Quick Start

Welcome to the quick-start! Below are the top three integrations we support (Puppeteer, Tor and PhantomJS), with quick directions on getting up and running with popular methods.

Basic Usage Example

bash

$
 

 

Full Documentation

HeadlessBrowserAPI is designed to simplify web scraping. A few things to consider before we get started:

  • Each request will be tried to be retried until it can be successfully completed (up to 60 seconds wait time to fully download scraped pages). Remember to set your request timeout to 60 seconds to ensure that you receive the content of all pages we send (even the slower ones). In cases where the request times out after 60 seconds, the API will return an error detailing the timeout, you may retry the request afterwards. Make sure to catch these errors! They will occur on roughly 1-2% of requests for hard to scrape websites. You should scrape only HTML files, other file types like images, PDFs or others will return a HTML document with the image or file embedded in the content.
  • Also, remember that there is a 5MB maximum size limit per request. If this is exceeded, the API will return an error.
  • If you exceed your daily API call count, the API will respond with an error, prompting you that you are rate limited.
  • Each request will return a JSON string containing the URL you scraped and the raw html from the page requested.

Rendering Javascript

If you are crawling a page that requires you to render the JavaScript on the page, we can fetch these pages using a headless browser this API supports (Puppeteer, Tor and PhantomJS). JavaScript is rendered by default by this API on all API nodes and all subscription plans, you do not need to set any parameter or enable anything.

Tutorial Video and Examples


Check the available API nodes and their corresponding headless browsers, below:

1. Using Puppeteer

This API node will call Puppeteer headless browser to render the web pages and return their HTML content. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the fully rendered HTML in the API response.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/puppeteer

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/puppeteer?apikey=APIKEY&url=https://whatismyipaddress.com/"

Result:

{
  "url": "https:\/\/whatismyipaddress.com\/",
  "html": "<div id=\"outertop\">\n<div id=\"main\">\n<div id=\"wrap\">\n<div id=\"header\">\n<div id=\"logo\"><a href=\"https:\/\/whatismyipaddress.com\">
(rest of the HTML content redacted)...",
}

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • proxy_url – If you want to use a proxy to crawl webpages, input it’s address here. Required format: IP Address/URL:port. You can input a comma separated list of proxies (the API will select a random one from the list, each time). Don’t forget to urlencode this value.
  • proxy_auth – If you want to use a proxy to crawl webpages, and it requires authentication, input it’s authentication details here. Required format: username:password. You can input a comma separated list of users/passwords. If a proxy does not have a user/password, please leave it blank in the list. Example: user1:pass1,user2:pass2,,user4:pass4 – don’t forget to urlencode this value also.
  • sleep – numeric value, set the sleep timeout (in milliseconds) how long the API should wait after successfully loading the page, for further elements to render on the screen. This is useful for pages that load content after a timeout or for those that load larger pages, that need more screen time to render. The maximum value for this is 8000 milliseconds (8 seconds). The default value is 0. Use this only if you are sure that you need it, because otherwise it will only slow down the page crawling rate and make the requests you send the the API to take longer to finish.

API result:

– In case of success: a JSON with the following content:

  • url – The URL that was scraped.
  • html – The scraped HTML content, with JavaScript rendered content.

– In case of error: a JSON with the following content:

  • error – The error message containing the reason of the failure.

2. Using Tor

This API node will call Tor headless browser to render the web pages and return their HTML content. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the fully rendered HTML in the API response.

Notice: You will be able to scrape also .onion links using this endpoint! Also, this endpoint will use the proxy chain from the TOR network, so you will not be required to add your own proxy for the API node, it will use a random one, by default.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/tor

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/tor?apikey=APIKEY&url=https://whatismyipaddress.com/"

Result:

{
  "url": "https:\/\/whatismyipaddress.com\/",
  "html": "<div id=\"outertop\">\n<div id=\"main\">\n<div id=\"wrap\">\n<div id=\"header\">\n<div id=\"logo\"><a href=\"https:\/\/whatismyipaddress.com\">
(rest of the HTML content redacted)...",
}

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • sleep – numeric value, set the sleep timeout (in milliseconds) how long the API should wait after successfully loading the page, for further elements to render on the screen. This is useful for pages that load content after a timeout or for those that load larger pages, that need more screen time to render. The maximum value for this is 8000 milliseconds (8 seconds). The default value is 0. Use this only if you are sure that you need it, because otherwise it will only slow down the page crawling rate and make the requests you send the the API to take longer to finish.

API result:

– In case of success: a JSON with the following content:

  • url – The URL that was scraped.
  • html – The scraped HTML content, with JavaScript rendered content.

– In case of error: a JSON with the following content:

  • error – The error message containing the reason of the failure.

3. Using PhantomJS

This API node will call PhantomJS headless browser to render the web pages and return their HTML content. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the fully rendered HTML in the API response.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/phantomjs

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/phantomjs?apikey=APIKEY&url=https://whatismyipaddress.com/"

Result:

{
  "url": "https:\/\/whatismyipaddress.com\/",
  "html": "<div id=\"outertop\">\n<div id=\"main\">\n<div id=\"wrap\">\n<div id=\"header\">\n<div id=\"logo\"><a href=\"https:\/\/whatismyipaddress.com\">
(rest of the HTML content redacted)...",
}

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • proxy_url – If you want to use a proxy to crawl webpages, input it’s address here. Required format: IP Address/URL:port. You can input a comma separated list of proxies (the API will select a random one from the list, each time). Don’t forget to urlencode this value.
  • proxy_auth – If you want to use a proxy to crawl webpages, and it requires authentication, input it’s authentication details here. Required format: username:password. You can input a comma separated list of users/passwords. If a proxy does not have a user/password, please leave it blank in the list. Example: user1:pass1,user2:pass2,,user4:pass4 – don’t forget to urlencode this value also.
  • sleep – numeric value, set the sleep timeout (in milliseconds) how long the API should wait after successfully loading the page, for further elements to render on the screen. This is useful for pages that load content after a timeout or for those that load larger pages, that need more screen time to render. The maximum value for this is 8000 milliseconds (8 seconds). The default value is 0. Use this only if you are sure that you need it, because otherwise it will only slow down the page crawling rate and make the requests you send the the API to take longer to finish.

API result:

– In case of success: a JSON with the following content:

  • url – The URL that was scraped.
  • html – The scraped HTML content, with JavaScript rendered content.

– In case of error: a JSON with the following content:

  • error – The error message containing the reason of the failure.

API Use Case Example in the Echo RSS WordPress Plugin


What’s next?

Check also the WordPress plugins that are using this API (just add your API key in their settings, and you are ready to scrape JavaScript rendered HTML content:

There’s a lot more that you can configure and tune in HeadlessBrowserAPI to handle the needs of your application. Be sure to read about all the options it exposes and how to get the most out of this API.