Amazon Crawling API: Integration and Best Practices of Pangolin Scrape API and Data API (with Code)

Amazon Crawling API - A comprehensive guide to leveraging Pangolin's Scrape API and Data API for building effective Amazon scraper solutions, with best practices and detailed code examples.

In this data-driven era, the ability to accurately and efficiently acquire data from e-commerce platforms is crucial for a business’s competitiveness. Amazon, being one of the world’s largest e-commerce platforms, holds a vast trove of business-critical data, including market trend analysis, competitor intelligence, and price monitoring. However, scraping data from Amazon is not a trivial task, involving complex challenges like anti-scraping mechanisms and dynamic page rendering. This article will delve into how to utilize Pangolin’s Scrape API and Data API to build a robust Amazon scraping API, and will share integration and best practices, along with practical code examples to guide you through the data acquisition process.

Understanding Pangolin’s Amazon Crawling API Solutions

Pangolin provides two core APIs to cater to different Amazon data scraping needs: the Scrape API and the Data API. Each has a specific focus, and together they provide comprehensive coverage of Amazon platform data.

1. Scrape API: Full Page Capture, Replicating Real User Experience

The Scrape API focuses on retrieving the raw HTML data of Amazon pages, aiming to replicate the real user experience within a browser. By simulating browser behavior, it bypasses Amazon’s anti-scraping mechanisms and allows developers to specify a specific zip code for location-based data. The advantages of this approach include:

  • High Fidelity: Retrieves page data that matches the real browser experience, including dynamically loaded content.
  • Flexibility: Enables the scraping of any Amazon page, be it product listing pages, product detail pages, or search results pages.
  • Customization: Supports zip code specification via the bizContext parameter, allowing the retrieval of product information for specific regions.
  • Asynchronous Processing: Uses an asynchronous callback mechanism to push scraped data to the developer’s designated callbackUrl, preventing long wait times.

2. Data API: Structured Data, Direct Access to Core Information

The Data API focuses on providing structured data, making it easier for developers to extract and analyze core information from the Amazon platform. It offers a set of predefined bizKey values, supporting the scraping of:

  • Product Listings: Retrieves product listings based on category (amzProductOfCategory) or seller (amzProductOfSeller).
  • Product Details: Scrapes the detailed information of a single product (amzProduct).
  • Keyword Search Results: Retrieves product listings based on keywords (amzKeyword).
  • Ranking Data: Acquires data from bestsellers (bestSellers) or new releases (newReleases) lists.
  • Product Reviews: Scrapes product review information, including pagination and different countries.

The advantages of the Data API include:

  • Structured Output: Data is returned in JSON format, facilitating parsing and storage.
  • Efficiency and Convenience: Eliminates the need to write complex HTML parsing code, directly delivering the desired data.
  • Targeted Data: The predefined bizKey values meet the common data scraping needs of Amazon.
  • Review Scraping: A dedicated /review interface efficiently retrieves product reviews, with support for different countries and regions.

In-Depth Look at Pangolin’s Amazon Scraping API Interfaces

Below is a detailed analysis of the Scrape API and Data API interfaces provided by Pangolin, based on the provided documentation, and includes code examples.

1. Scrape API (http://scrape.pangolinfo.com/api/task/receive/v1)

  • Request URL: http://scrape.pangolinfo.com/api/task/receive/v1
  • Request Method: POST
  • Request Header: Content-Type: application/json
  • Request Parameters (JSON):
    • token (Required): User authentication token for identification.
    • url (Required): The Amazon page URL to scrape.
    • callbackUrl (Required): The service address where developers receive data.
    • proxySession (Optional): A 32-bit UUID to specify a specific IP address for scraping. The IP is retained for the day and becomes invalid after midnight.
    • callbackHeaders (Optional): Data attached to the header of the callback request. Please encode values correctly.
    • bizContext (Optional): A JSON object containing the zipcode (Amazon zip code information).
  • Response Parameters (JSON):
    • code: System status code (0 indicates success).
    • message: System status message.
    • data: A JSON object containing the crawl task ID (data), business status code (bizCode), and business status message (bizMsg).

Important Note: The Scrape API returns the raw HTML of the page. Developers need to parse and extract data from it themselves.

Example Python Code (using the requests library):

import requests
import json

url = "http://scrape.pangolinfo.com/api/task/receive/v1?token=YOUR_TOKEN"

payload = json.dumps({
  "url": "https://www.amazon.com/s?k=baby",
  "callbackUrl": "http://your-callback-url/data",
  "bizContext": {
    "zipcode": "90001"
  }
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

content_copydownloadUse code with caution.Python

2. Data API (https://extapi.pangolinfo.com/api/v1)

The Data API is split into multiple sub-interfaces, including refresh token, submit task, and submit review task interfaces.

2.1 Refresh Token Interface (https://extapi.pangolinfo.com/api/v1/refreshToken)

  • Request URL: https://extapi.pangolinfo.com/api/v1/refreshToken
  • Request Method: POST
  • Request Header: Content-Type: application/json
  • Request Parameters (JSON):
    • email (Required): Registered email address.
    • password (Required): Password.
  • Response Parameters (JSON):
    • code: System status code (0 indicates success).
    • subCode: Sub status code.
    • message: System status message.
    • data: API access token, corresponding to xxxx in Authorization: Bearer xxxx.

Important Note: After getting the token, you need to add it in the Authorization request header.

Example Python Code (using the requests library):

import requests
import json

url = "https://extapi.pangolinfo.com/api/v1/refreshToken"
payload = json.dumps({
  "email": "[email protected]",
  "password": "your_password"
})
headers = {
  'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
token_data = response.json()['data']
print(token_data) # Save this token for use in headers of other Data API requests

content_copydownloadUse code with caution.Python

2.2 Submit Task Interface (https://extapi.pangolinfo.com/api/v1)

  • Request URL: https://extapi.pangolinfo.com/api/v1
  • Request Method: GET
  • Request Header: Content-Type: application/x-www-form-urlencoded
  • Request Parameters (Query String):
    • token (Required): API token (the data value obtained from the refresh token interface).
    • url (Required): Target web page URL.
    • callbackUrl (Required): Service address for receiving data.
    • bizKey (Required): Business type (e.g., amzProductOfCategory, amzProductOfSeller, amzProduct, amzKeyword, bestSellers, newReleases).
    • zipcode (Optional): Amazon zip code information.
    • rawData (Optional): Indicates whether to return raw data. The default is false.
  • Request Header: Authorization: Bearer xxxx(the data value obtained from the refresh token interface)
  • Response Parameters (JSON):
    • code: System status code (0 indicates success).
    • data: A JSON object containing the crawl task ID (data), business status code (bizCode), and business status message (bizMsg).
    • message: System status message.

Example Python Code (using the requests library):

import requests

token = "YOUR_TOKEN_HERE" # Retrieved from the refreshToken interface
url = "https://extapi.pangolinfo.com/api/v1"
params = {
  "token": token,
  "url": "https://www.amazon.com/gp/bestsellers/kitchen/ref=zg_bs_kitchen_sm",
  "callbackUrl": "http://your-callback-url/data",
  "bizKey": "bestSellers",
  "zipcode": "10041",
    "rawData": "false"
}
headers = {
  'Content-Type': 'application/x-www-form-urlencoded',
    'Authorization': f'Bearer {token}'
}
response = requests.get(url, params=params, headers=headers)
print(response.text)

content_copydownloadUse code with caution.Python

2.3 Submit Review Task Interface (https://extapi.pangolinfo.com/api/v1/review)

  • Request URL: https://extapi.pangolinfo.com/api/v1/review
  • Request Method: GET
  • Request Header: Content-Type: application/x-www-form-urlencoded
  • Request Parameters (Query String):
    • token (Required): API token (the data value obtained from the refresh token interface).
    • asin (Required): Target product ASIN.
    • callbackUrl (Required): Service address for receiving data.
    • page (Required): Review page number.
    • country_code (Optional): Target country code (e.g., us, de, uk, fr, jp, ca, it, au, es).
  • Request Header: Authorization: Bearer xxxx(the data value obtained from the refresh token interface)
  • Response Parameters (JSON):
    • code: System status code (0 indicates success).
    • data: A JSON object containing the crawl task ID (data), business status code (bizCode), and business status message (bizMsg).
    • message: System status message.

Example Python Code (using the requests library):

import requests

token = "YOUR_TOKEN_HERE" # Retrieved from the refreshToken interface
url = "https://extapi.pangolinfo.com/api/v1/review"
params = {
  "token": token,
  "asin": "B081T7N948",
  "callbackUrl": "http://your-callback-url/data",
  "page": 1,
  "country_code": "us"
}
headers = {
   'Content-Type': 'application/x-www-form-urlencoded',
    'Authorization': f'Bearer {token}'
}
response = requests.get(url, params=params, headers=headers)
print(response.text)

content_copydownloadUse code with caution.Python

Integration and Best Practices

Next, let’s discuss how to effectively integrate these Amazon crawling APIs and some best practices:

  1. Choose the Appropriate API: Select the correct API based on your needs. Choose the Scrape API to retrieve complete page data, or the Data API if you require structured data. For product review scraping, directly utilize the /review interface in the Data API.
  2. Secure Token Management: Securely store and manage your tokens to avoid leaks. Utilize the refresh token interface to renew your tokens and maintain API access.
  3. Handle Asynchronous Callbacks: Both the Scrape API and Data API use asynchronous callback mechanisms. Ensure that your server has a data-receiving service ready to process the scraped data. Refer to the provided documentation for sample Java Spring Boot code.
  4. Use proxySession Wisely: If you need to scrape using a specific IP address, use the proxySession parameter. Be mindful of its expiry.
  5. Optimize Data Extraction: For HTML data returned by the Scrape API, use efficient HTML parsing libraries (like Beautiful Soup or Jsoup) for data extraction. Avoid using regular expressions for parsing to improve efficiency and stability.
  6. Adhere to Amazon’s Robots.txt: Although Pangolin provides a strong scraping capability, respect Amazon’s robots.txt file. Avoid excessive scraping and adhere to the legal regulations and website terms of service.
  7. Error Handling: Implement comprehensive error handling, capture API return error codes and take action based on the error. For example, if code is 1001 or 1004, check the request parameters or your token.
  8. Data Caching: For repeatedly scraped data, implement caching mechanisms to avoid unnecessary requests, relieve server pressure, and speed up data acquisition.
  9. Rate Limiting: Control scraping frequency to avoid putting too much pressure on Amazon’s servers. Implement appropriate request intervals to prevent IP bans.
  10. Monitoring and Logging: Keep logs of API calls, monitor the execution of scraping tasks, and identify and resolve issues promptly.

Conclusion

Pangolin’s Scrape API and Data API offer developers powerful Amazon crawling API tools, catering to various data acquisition needs. By understanding the API’s functions and parameters, combined with the best practices and code examples shared here, you can build an efficient, stable, and reliable Amazon data scraping system to support your business decisions. But, always remember, data scraping needs to be done legally, complying with the terms of service, using technology responsibly, and never abusing the system. Only then can you truly unlock the value of data scraping and drive business growth.

Our solution

Protect your web crawler against blocked requests, proxy failure, IP leak, browser crash and CAPTCHAs!

Data API: Directly obtain data from any Amazon webpage without parsing.

The Amazon Product Advertising API allows developers to access Amazon’s product catalog data, including customer reviews, ratings, and product information, enabling integration of this data into third-party applications.

With Data Pilot, easily access cross-page, endto-end data, solving data fragmentation andcomplexity, empowering quick, informedbusiness decisions.

Follow Us

Weekly Tutorial

Sign up for our Newsletter

Sign up now to embark on your Amazon data journey, and we will provide you with the most accurate and efficient data collection solutions.

Scroll to Top
This website uses cookies to ensure you get the best experience.

联系我们,您的问题,我们随时倾听

无论您在使用 Pangolin 产品的过程中遇到任何问题,或有任何需求与建议,我们都在这里为您提供支持。请填写以下信息,我们的团队将尽快与您联系,确保您获得最佳的产品体验。

Talk to our team

If you encounter any issues while using Pangolin products, please fill out the following information, and our team will contact you as soon as possible to ensure you have the best product experience.