How to Utilize Python Web Scraping to Fetch Data from Amazon?

This article delves into the realm of Python web scraping to extract data from Amazon. It discusses various techniques to overcome Amazon's anti-scraping measures, such as CAPTCHA recognition, handling dynamic loading, managing rate limiting, and dealing with anti-scraping algorithms. By employing these strategies, users can effectively gather the desired data from Amazon's website.
如何利用Python爬虫抓取亚马逊数据

Amazon stands as the largest global shopping platform, boasting an extensive array of product information, user reviews, and more. Today, we’ll guide you step by step in bypassing Amazon’s anti-scraping mechanisms to scrape the useful information you desire.

Exploring Amazon’s Anti-Scraping Mechanisms

Before diving in, let’s attempt to access Amazon using several common Python web scraping modules and observe the effectiveness of its anti-scraping mechanisms.

1. urllib Module

First, let’s try accessing Amazon using the urllib module.

# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)

The result returns a status code of 503, indicating that Amazon has identified our request as coming from a scraper and has refused to provide service.

2. requests Module

Next, we attempt to access Amazon using the requests module.

import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)

Similarly, the result returns a status code of 503, indicating that Amazon has rejected our request.

3. Selenium Automation Module

Lastly, we try to access Amazon using the selenium module for automation.

import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
# Configure Chrome driver for selenium
chromedriver = "C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
# Set Chrome browser to headless mode
options = Options()
options.add_argument('--headless')
# Launch the browser
browser = webdriver.Chrome(chromedriver, chrome_options=options)

url = "https://www.amazon.com"
browser.get(url)

After trying, we find that using Selenium successfully accesses Amazon and can retrieve the page’s source code information.

Methods to Bypass Amazon’s Anti-Scraping Mechanisms

1. Spoofing User-Agent

Amazon and similar websites often identify scrapers by checking the user-agent in the request header. Therefore, we can spoof the user-agent to make the request look more like it’s coming from a regular browser rather than a scraper.

import requests
 
url = 'https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
    'Accept': '*/*',
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Cookie': 'your_cookie_value',
    'TE': 'Trailers'
}
r = requests.get(url, headers=web_header)
print(r.status_code)

2. Using Proxy IPs

By using proxy IPs, we can hide the real request source, increase the stealthiness of the crawl, and reduce the likelihood of being banned.

import requests
 
url = 'https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
proxy = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port'
}
r = requests.get(url, proxies=proxy)
print(r.status_code)

3. Using CAPTCHA Recognition Technology

Sometimes, Amazon may return a CAPTCHA page to block scrapers. We can use CAPTCHA recognition technology to automatically identify and input the CAPTCHA, allowing the crawl to continue.

# Auto-identify and input CAPTCHA using CAPTCHA recognition technology
# Omitted (requires the use of third-party CAPTCHA recognition services)

Conclusion:

Through the methods described above, we can successfully bypass Amazon’s anti-scraping mechanisms and smoothly obtain the desired data. However, it is important to note that when collecting data from the Amazon site, we may encounter various anti-scraping measures. Here are some common anti-scraping measures:

CAPTCHA: Amazon may return a CAPTCHA page when a scraper attempts to access it, requiring users to manually enter a CAPTCHA to confirm their identity. This requires the use of CAPTCHA recognition technology to automatically identify and input the CAPTCHA.

Dynamic loading: Amazon web pages may use Ajax or JavaScript to dynamically load content instead of loading all content at once. This requires the use of tools such as Selenium to simulate browser behavior, ensuring that all content is loaded and retrieved.

Rate limiting: Amazon servers may restrict access from the same IP address that frequently accesses the same page to prevent excessive server load caused by frequent scraping. When scraping, it is important to control the request frequency to avoid IP bans.

Anti-scraping algorithms: Amazon may use various algorithms to detect scraper behavior, such as detecting specific fields in request headers or analyzing request frequencies. In such cases, it is necessary to constantly adjust request headers, use proxy IPs, and other methods to evade detection.

In summary, dealing with Amazon’s anti-scraping measures requires the comprehensive application of various technologies and strategies to ensure successful data collection.

Advertisement Time: If you want to simplify the data collection process, consider trying Pangolin Scrape API. Just push the collection tasks to the API according to your needs, making data collection easier and more efficient!

If you are a novice with zero coding knowledge and do not want to use complex RPA tools, you can try Pangolin Scrapper. It can collect data from Amazon sites in real-time based on keywords or ASINs, without the need for configuration, and download it in Excel format for analysis and processing with just one click.

Our solution

Protect your web crawler against blocked requests, proxy failure, IP leak, browser crash and CAPTCHAs!

Data API: Directly obtain data from any Amazon webpage without parsing.

The Amazon Product Advertising API allows developers to access Amazon’s product catalog data, including customer reviews, ratings, and product information, enabling integration of this data into third-party applications.

With Data Pilot, easily access cross-page, endto-end data, solving data fragmentation andcomplexity, empowering quick, informedbusiness decisions.

Follow Us

Weekly Tutorial

Sign up for our Newsletter

Sign up now to embark on your Amazon data journey, and we will provide you with the most accurate and efficient data collection solutions.

Scroll to Top
This website uses cookies to ensure you get the best experience.

联系我们,您的问题,我们随时倾听

无论您在使用 Pangolin 产品的过程中遇到任何问题,或有任何需求与建议,我们都在这里为您提供支持。请填写以下信息,我们的团队将尽快与您联系,确保您获得最佳的产品体验。

Talk to our team

If you encounter any issues while using Pangolin products, please fill out the following information, and our team will contact you as soon as possible to ensure you have the best product experience.