What is HeadlessBrowserAPI?
It is a new API for scraping HTML content from webpages with JavaScript rendered content.
But what does this mean exactly? Here is a quick example web page: https://worldofwarships.asia/en/news/general-news/2020-wows-wallpapers/ – if you click on this link you will see an article loading, however, if you scroll to the bottom of the page right click on the gray footer of the page (this is the only region in the page where this is possible) and click “View page source”. You will see the HTML source code of the page. Here, usually you will be able to find also the contents of the article which is displayed on the page, however, in this specific case, the contents of the article are loaded dynamically, after the visitor loads up the page, using JavaScript.
Because of this, conventional scrapers will not be able to get usable content from this page (because they are not able to execute JavaScript).
To get around this, the page needs to be loaded in a browser, which allows JavaScript rendering. Fortunately, there are a wide variety of headless browsers available, which will allow the automation of scraping such pages and will return their JavaScript rendered HTML content.
The down side of this is that these headless browsers need to be installed on your server and the script that you are using needs to be set up to find them and use them accordingly. For this, some technical knowledge is required.
However, this API will solve this issue, it does not require any setup from your part, just call a simple API endpoint and you will get the JavaScript rendered HTML of scraped pages in matter of seconds!
It is able to scrape also Dark Web links (.onion)?
Yes! If you use the Tor endpoint of the API, it will also be able to access the Onion Network and scrape content from any Dark Web website!
Note that this will also automatically use the proxy chain from the TOR Network, so you will not need to enter your own proxies for this API node.