How to Utilize Python Web Scraping to Fetch Data from Amazon?

如何利用Python爬虫抓取亚马逊数据

Amazon stands as the largest global shopping platform, boasting an extensive array of product information, user reviews, and more. Today, we’ll guide you step by step in bypassing Amazon’s anti-scraping mechanisms to scrape the useful information you desire.

Exploring Amazon’s Anti-Scraping Mechanisms

Before diving in, let’s attempt to access Amazon using several common Python web scraping modules and observe the effectiveness of its anti-scraping mechanisms.

1. urllib Module

First, let’s try accessing Amazon using the urllib module.

# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)

The result returns a status code of 503, indicating that Amazon has identified our request as coming from a scraper and has refused to provide service.

2. requests Module

Next, we attempt to access Amazon using the requests module.

import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)

Similarly, the result returns a status code of 503, indicating that Amazon has rejected our request.

3. Selenium Automation Module

Lastly, we try to access Amazon using the selenium module for automation.

import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
# Configure Chrome driver for selenium
chromedriver = "C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
# Set Chrome browser to headless mode
options = Options()
options.add_argument('--headless')
# Launch the browser
browser = webdriver.Chrome(chromedriver, chrome_options=options)

url = "https://www.amazon.com"
browser.get(url)

After trying, we find that using Selenium successfully accesses Amazon and can retrieve the page’s source code information.

Methods to Bypass Amazon’s Anti-Scraping Mechanisms

1. Spoofing User-Agent

Amazon and similar websites often identify scrapers by checking the user-agent in the request header. Therefore, we can spoof the user-agent to make the request look more like it’s coming from a regular browser rather than a scraper.

import requests
 
url = 'https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
    'Accept': '*/*',
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Cookie': 'your_cookie_value',
    'TE': 'Trailers'
}
r = requests.get(url, headers=web_header)
print(r.status_code)

2. Using Proxy IPs

By using proxy IPs, we can hide the real request source, increase the stealthiness of the crawl, and reduce the likelihood of being banned.

import requests
 
url = 'https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
proxy = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port'
}
r = requests.get(url, proxies=proxy)
print(r.status_code)

3. Using CAPTCHA Recognition Technology

Sometimes, Amazon may return a CAPTCHA page to block scrapers. We can use CAPTCHA recognition technology to automatically identify and input the CAPTCHA, allowing the crawl to continue.

# Auto-identify and input CAPTCHA using CAPTCHA recognition technology
# Omitted (requires the use of third-party CAPTCHA recognition services)

Conclusion:

Through the methods described above, we can successfully bypass Amazon’s anti-scraping mechanisms and smoothly obtain the desired data. However, it is important to note that when collecting data from the Amazon site, we may encounter various anti-scraping measures. Here are some common anti-scraping measures:

CAPTCHA: Amazon may return a CAPTCHA page when a scraper attempts to access it, requiring users to manually enter a CAPTCHA to confirm their identity. This requires the use of CAPTCHA recognition technology to automatically identify and input the CAPTCHA.

Dynamic loading: Amazon web pages may use Ajax or JavaScript to dynamically load content instead of loading all content at once. This requires the use of tools such as Selenium to simulate browser behavior, ensuring that all content is loaded and retrieved.

Rate limiting: Amazon servers may restrict access from the same IP address that frequently accesses the same page to prevent excessive server load caused by frequent scraping. When scraping, it is important to control the request frequency to avoid IP bans.

Anti-scraping algorithms: Amazon may use various algorithms to detect scraper behavior, such as detecting specific fields in request headers or analyzing request frequencies. In such cases, it is necessary to constantly adjust request headers, use proxy IPs, and other methods to evade detection.

In summary, dealing with Amazon’s anti-scraping measures requires the comprehensive application of various technologies and strategies to ensure successful data collection.

Advertisement Time: If you want to simplify the data collection process, consider trying Pangolin Scrape API. Just push the collection tasks to the API according to your needs, making data collection easier and more efficient!

If you are a novice with zero coding knowledge and do not want to use complex RPA tools, you can try Pangolin Scrapper. It can collect data from Amazon sites in real-time based on keywords or ASINs, without the need for configuration, and download it in Excel format for analysis and processing with just one click.

Start Crawling the first 1,000 requests free

Our solution

Protect your web crawler against blocked requests, proxy failure, IP leak, browser crash and CAPTCHAs!

Real-time collection of all Amazon data with just one click, no programming required, enabling you to stay updated on every Amazon data fluctuation instantly!

Add To chrome

Like it?

Share this post

Follow us

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Do You Want To Boost Your Business?

Drop us a line and keep in touch
Scroll to Top
pangolinfo LOGO

Talk to our team

Pangolin provides a total solution from network resource, scrapper, to data collection service.
This website uses cookies to ensure you get the best experience.
pangolinfo LOGO

与我们的团队交谈

Pangolin提供从网络资源、爬虫工具到数据采集服务的完整解决方案。