Quick Start

Welcome to the quick-start! Below are the top three integrations we support (Puppeteer, Tor and PhantomJS), with quick directions on getting up and running with popular methods.

Basic Usage Example

bash

$
 

 

Full Documentation

HeadlessBrowserAPI is designed to simplify web scraping. A few things to consider before we get started:

  • Each request will be tried to be retried until it can be successfully completed (up to 60 seconds wait time to fully download scraped pages). Remember to set your request timeout to 60 seconds to ensure that you receive the content of all pages we send (even the slower ones). In cases where the request times out after 60 seconds, the API will return an error detailing the timeout, you may retry the request afterwards. Make sure to catch these errors! They will occur on roughly 1-2% of requests for hard to scrape websites. You should scrape only HTML files, other file types like images, PDFs or others will return a HTML document with the image or file embedded in the content.
  • Also, remember that there is a 5MB maximum size limit per request. If this is exceeded, the API will return an error.
  • If you exceed your daily API call count, the API will respond with an error, prompting you that you are rate limited.
  • The server is set to the UTC timezone, call counts will be reset daily based on this time zone.
  • Each request will return a JSON string containing the URL you scraped and the raw html from the page requested.

Rendering Javascript

If you are crawling a page that requires you to render the JavaScript on the page, we can fetch these pages using a headless browser this API supports (Puppeteer, Tor and PhantomJS). JavaScript is rendered by default by this API on all API nodes and all subscription plans, you do not need to set any parameter or enable anything.

Tutorial Video and Examples


Check the available API nodes and their corresponding headless browsers, below:

1. Using Puppeteer for web scraping

This API node will call Puppeteer headless browser to render the web pages and return their HTML content. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the fully rendered HTML in the API response.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/puppeteer

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/puppeteer?apikey=APIKEY&url=https://whatismyipaddress.com/"

Result:

{
  "url": "https:\/\/whatismyipaddress.com\/",
  "html": "<div id=\"outertop\">\n<div id=\"main\">\n<div id=\"wrap\">\n<div id=\"header\">\n<div id=\"logo\"><a href=\"https:\/\/whatismyipaddress.com\">
(rest of the HTML content redacted)...",
}

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • proxy_url – If you want to use a proxy to crawl webpages, input it’s address here. Required format: IP Address/URL:port. You can input a comma separated list of proxies (the API will select a random one from the list, each time). Don’t forget to urlencode this value. If you don’t add your own proxy, the scraper will randomly select a proxy from a pool of 200000 premium proxies which are available to the API, making each request you make come from a different IP address. If you want to disable automatic proxies (and use the IP address of this server), add into this parameter the value: disabled
  • proxy_auth – If you want to use a proxy to crawl webpages, and it requires authentication, input it’s authentication details here. Required format: username:password. You can input a comma separated list of users/passwords. If a proxy does not have a user/password, please leave it blank in the list. Example: user1:pass1,user2:pass2,,user4:pass4 – don’t forget to urlencode this value also.
  • sleep – numeric value, set the sleep timeout (in milliseconds) how long the API should wait after successfully loading the page, for further elements to render on the screen. This is useful for pages that load content after a timeout or for those that load larger pages, that need more screen time to render. The maximum value for this is 8000 milliseconds (8 seconds). The default value is 0. Use this only if you are sure that you need it, because otherwise it will only slow down the page crawling rate and make the requests you send the the API to take longer to finish.
  • jsexec – add your custom JavaScript code which will be injected into the scraped HTML content and executed (XSS). The code you send in this parameter must be URL encoded.
  • localstorage – select the local storage values you want to send with each request (these are similar to cookies, but some modern sites prefer them over cookies). The syntax for this field is the following: local_key1=local_value1; local_key2=local_value2 – don’t forget to URL encode this value also
  • solvecaptcha – set this parameter to on to enable automatic captcha solving. Most types of captchas will be supported to be solved using this feature. In the case this feature is used, it is recommended to increase the timeout that you wait in your script for the API response to at least 160 seconds, as captcha solving might take on its own even 120 seconds to complete. The default value of this feature is off.
  • enableadblock – set this parameter to on to enable automatic blocking of ads from the scraped pages. The default value of this feature is off.
  • readability – set this parameter to on to auto detect and return only the readable content of the scraped HTML page (best for scraping articles and posts).
  • clickelement – set the page selector where the headless browser should do a single click. Example: #captcha-submit or .btn.btn-info. If the element is not found in the page, nothing will happen.

API result:

– In case of success: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • url – The URL that was scraped.
  • html – The scraped HTML content, with JavaScript rendered content.

– In case of error: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • error – The error message containing the reason of the failure.

2. Using Tor for web scraping

This API node will call Tor headless browser to render the web pages and return their HTML content. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the fully rendered HTML in the API response.

Notice: You will be able to scrape also .onion links using this endpoint! Also, this endpoint will use the proxy chain from the TOR network, so you will not be required to add your own proxy for the API node, it will use a random one, by default.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/tor

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/tor?apikey=APIKEY&url=https://whatismyipaddress.com/"

Result:

{
  "url": "https:\/\/whatismyipaddress.com\/",
  "html": "<div id=\"outertop\">\n<div id=\"main\">\n<div id=\"wrap\">\n<div id=\"header\">\n<div id=\"logo\"><a href=\"https:\/\/whatismyipaddress.com\">
(rest of the HTML content redacted)...",
}

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • sleep – numeric value, set the sleep timeout (in milliseconds) how long the API should wait after successfully loading the page, for further elements to render on the screen. This is useful for pages that load content after a timeout or for those that load larger pages, that need more screen time to render. The maximum value for this is 8000 milliseconds (8 seconds). The default value is 0. Use this only if you are sure that you need it, because otherwise it will only slow down the page crawling rate and make the requests you send the the API to take longer to finish.
  • jsexec – add your custom JavaScript code which will be injected into the scraped HTML content and executed (XSS). The code you send in this parameter must be URL encoded.
  • localstorage – select the local storage values you want to send with each request (these are similar to cookies, but some modern sites prefer them over cookies). The syntax for this field is the following: local_key1=local_value1; local_key2=local_value2 – don’t forget to URL encode this value also
  • solvecaptcha – set this parameter to on to enable automatic captcha solving. Most types of captchas will be supported to be solved using this feature. In the case this feature is used, it is recommended to increase the timeout that you wait in your script for the API response to at least 160 seconds, as captcha solving might take on its own even 120 seconds to complete. The default value of this feature is off.
  • enableadblock – set this parameter to on to enable automatic blocking of ads from the scraped pages. The default value of this feature is off.
  • readability – set this parameter to on to auto detect and return only the readable content of the scraped HTML page (best for scraping articles and posts).
  • clickelement – set the page selector where the headless browser should do a single click. Example: #captcha-submit or .btn.btn-info. If the element is not found in the page, nothing will happen.

API result:

– In case of success: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • url – The URL that was scraped.
  • html – The scraped HTML content, with JavaScript rendered content.

– In case of error: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • error – The error message containing the reason of the failure.

3. Using PhantomJS for web scraping

This API node will call PhantomJS headless browser to render the web pages and return their HTML content. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the fully rendered HTML in the API response.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/phantomjs

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/phantomjs?apikey=APIKEY&url=https://whatismyipaddress.com/"

Result:

{
  "url": "https:\/\/whatismyipaddress.com\/",
  "html": "<div id=\"outertop\">\n<div id=\"main\">\n<div id=\"wrap\">\n<div id=\"header\">\n<div id=\"logo\"><a href=\"https:\/\/whatismyipaddress.com\">
(rest of the HTML content redacted)...",
}

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • proxy_url – If you want to use a proxy to crawl webpages, input it’s address here. Required format: IP Address/URL:port. You can input a comma separated list of proxies (the API will select a random one from the list, each time). Don’t forget to urlencode this value. If you don’t add your own proxy, the scraper will randomly select a proxy from a pool of 200000 premium proxies which are available to the API, making each request you make come from a different IP address. If you want to disable automatic proxies (and use the IP address of this server), add into this parameter the value: disabled
  • proxy_auth – If you want to use a proxy to crawl webpages, and it requires authentication, input it’s authentication details here. Required format: username:password. You can input a comma separated list of users/passwords. If a proxy does not have a user/password, please leave it blank in the list. Example: user1:pass1,user2:pass2,,user4:pass4 – don’t forget to urlencode this value also.
  • sleep – numeric value, set the sleep timeout (in milliseconds) how long the API should wait after successfully loading the page, for further elements to render on the screen. This is useful for pages that load content after a timeout or for those that load larger pages, that need more screen time to render. The maximum value for this is 8000 milliseconds (8 seconds). The default value is 0. Use this only if you are sure that you need it, because otherwise it will only slow down the page crawling rate and make the requests you send the the API to take longer to finish.
  • jsexec – add your custom JavaScript code which will be injected into the scraped HTML content and executed (XSS). The code you send in this parameter must be URL encoded.
  • readability – set this parameter to on to auto detect and return only the readable content of the scraped HTML page (best for scraping articles and posts).
  • localstorage – select the local storage values you want to send with each request (these are similar to cookies, but some modern sites prefer them over cookies). The syntax for this field is the following: local_key1=local_value1; local_key2=local_value2 – don’t forget to URL encode this value also

API result:

– In case of success: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • url – The URL that was scraped.
  • html – The scraped HTML content, with JavaScript rendered content.

– In case of error: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • error – The error message containing the reason of the failure.

4. Create screenshots of web sites

This API node will call Puppeteer headless browser to create a screenshot of the website you linked in the API call. You do not need to install on your server anything, the API will handle everything for you. Just call it and you will get the full page screenshot of any website.

API Endpoint:
https://headlessbrowserapi.com/apis/scrape/v1/screenshot

Sample Call:

# Add your API key, replace the scraped URL and add optional parameters
curl "https://headlessbrowserapi.com/apis/scrape/v1/screenshot?apikey=APIKEY&url=https://google.com/"

Result:

Google screenshot

API parameters:

  • apikey*required – add your API key for the call – be sure to have a valid subscription for the call to work
  • url*required – add the URL you wish to scrape – for best results, urlencode the URL you wish to scrape (especially, if it contains URL parameters, which might break the call if not urlencoded).
  • custom_user_agent – add the user agent you wish to use in the scraping process. For best results, URL encode this value when sending it to the API call – example: for the following user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 – the URL encoded value is: Mozilla%2F5.0%20%28Windows%20NT%2010.0%3B%20Win64%3B%20×64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F87.0.4280.88%20Safari%2F537.36
  • custom_cookies – select the cookie values you want to send with each request. The syntax for this field is the following: cookie_key1=cookie_value1; cookie_key2=cookie_value2 – don’t forget to URL encode this value also
  • user_pass – input the user:pass for the website (HTTP simple authentication). This feature grants access to password restricted websites. This is not for wp-login like user login to website. For that, you must use cookies – don’t forget to URL encode this value also
  • timeout – numeric value, set the timeout (in milliseconds) after which the API call will return unsuccessfully, if the scraped page does not respond. The maximum and also the default value for this is 60000 milliseconds (60 seconds). You can lower this value, if you don’t want to wait for long periods of time for pages that don’t respond. However, for best results in most cases, we recommend that you keep this value above 30 seconds.
  • proxy_url – If you want to use a proxy to crawl webpages, input it’s address here. Required format: IP Address/URL:port. You can input a comma separated list of proxies (the API will select a random one from the list, each time). Don’t forget to urlencode this value. If you don’t add your own proxy, the scraper will randomly select a proxy from a pool of 200000 premium proxies which are available to the API, making each request you make come from a different IP address. If you want to disable automatic proxies (and use the IP address of this server), add into this parameter the value: disabled
  • proxy_auth – If you want to use a proxy to crawl webpages, and it requires authentication, input it’s authentication details here. Required format: username:password. You can input a comma separated list of users/passwords. If a proxy does not have a user/password, please leave it blank in the list. Example: user1:pass1,user2:pass2,,user4:pass4 – don’t forget to urlencode this value also.
  • jsexec – add your custom JavaScript code which will be injected into the scraped HTML content and executed (XSS). The code you send in this parameter must be URL encoded.
  • localstorage – select the local storage values you want to send with each request (these are similar to cookies, but some modern sites prefer them over cookies). The syntax for this field is the following: local_key1=local_value1; local_key2=local_value2 – don’t forget to URL encode this value also
  • width – set the width of the screenshot to be created. If you don’t add this parameter, the default value will be 1920
  • height – set the height of the screenshot to be created. If you want to create full page screenshots, regardless of the height of the page, add 0 in this parameter. The default value for this parameter is 0 (full page screenshots)
  • format – select the output format of the screenshot returned by the API. Possible values are: jpg or pdf. If this parameter is omitted, the default value is jpg
  • solvecaptcha – set this parameter to on to enable automatic captcha solving. Most types of captchas will be supported to be solved using this feature. In the case this feature is used, it is recommended to increase the timeout that you wait in your script for the API response to at least 160 seconds, as captcha solving might take on its own even 120 seconds to complete. The default value of this feature is off.
  • enableadblock – set this parameter to on to enable automatic blocking of ads from the scraped pages. The default value of this feature is off.
  • clickelement – set the page selector where the headless browser should do a single click. Example: #captcha-submit or .btn.btn-info. If the element is not found in the page, nothing will happen.

API result:

– In case of success: a JPG image containing the screenshot of the website.

– In case of error: a JSON with the following content:

  • apicalls – The number of API calls remaining for your API key.
  • error – The error message containing the reason of the failure.

API Use Case Example in the Echo RSS WordPress Plugin


What’s next?

Check also the WordPress plugins that are using this API (just add your API key in their settings, and you are ready to scrape JavaScript rendered HTML content:

There’s a lot more that you can configure and tune in HeadlessBrowserAPI to handle the needs of your application. Be sure to read about all the options it exposes and how to get the most out of this API.