Amazon stands as the largest global shopping platform, boasting an extensive array of product information, user reviews, and more. Today, we’ll guide you step by step in bypassing Amazon’s anti-scraping mechanisms to scrape the useful information you desire.
Exploring Amazon’s Anti-Scraping Mechanisms
Before diving in, let’s attempt to access Amazon using several common Python web scraping modules and observe the effectiveness of its anti-scraping mechanisms.
1. urllib Module
First, let’s try accessing Amazon using the urllib
module.
# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)
The result returns a status code of 503, indicating that Amazon has identified our request as coming from a scraper and has refused to provide service.
2. requests Module
Next, we attempt to access Amazon using the requests
module.
import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)
Similarly, the result returns a status code of 503, indicating that Amazon has rejected our request.
3. Selenium Automation Module
Lastly, we try to access Amazon using the selenium
module for automation.
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Configure Chrome driver for selenium
chromedriver = "C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
# Set Chrome browser to headless mode
options = Options()
options.add_argument('--headless')
# Launch the browser
browser = webdriver.Chrome(chromedriver, chrome_options=options)
url = "https://www.amazon.com"
browser.get(url)
After trying, we find that using Selenium successfully accesses Amazon and can retrieve the page’s source code information.
Methods to Bypass Amazon’s Anti-Scraping Mechanisms
1. Spoofing User-Agent
Amazon and similar websites often identify scrapers by checking the user-agent in the request header. Therefore, we can spoof the user-agent to make the request look more like it’s coming from a regular browser rather than a scraper.
import requests
url = 'https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'your_cookie_value',
'TE': 'Trailers'
}
r = requests.get(url, headers=web_header)
print(r.status_code)
2. Using Proxy IPs
By using proxy IPs, we can hide the real request source, increase the stealthiness of the crawl, and reduce the likelihood of being banned.
import requests
url = 'https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
proxy = {
'http': 'http://your_proxy_ip:port',
'https': 'https://your_proxy_ip:port'
}
r = requests.get(url, proxies=proxy)
print(r.status_code)
3. Using CAPTCHA Recognition Technology
Sometimes, Amazon may return a CAPTCHA page to block scrapers. We can use CAPTCHA recognition technology to automatically identify and input the CAPTCHA, allowing the crawl to continue.
# Auto-identify and input CAPTCHA using CAPTCHA recognition technology
# Omitted (requires the use of third-party CAPTCHA recognition services)
Conclusion:
Through the methods described above, we can successfully bypass Amazon’s anti-scraping mechanisms and smoothly obtain the desired data. However, it is important to note that when collecting data from the Amazon site, we may encounter various anti-scraping measures. Here are some common anti-scraping measures:
CAPTCHA: Amazon may return a CAPTCHA page when a scraper attempts to access it, requiring users to manually enter a CAPTCHA to confirm their identity. This requires the use of CAPTCHA recognition technology to automatically identify and input the CAPTCHA.
Dynamic loading: Amazon web pages may use Ajax or JavaScript to dynamically load content instead of loading all content at once. This requires the use of tools such as Selenium to simulate browser behavior, ensuring that all content is loaded and retrieved.
Rate limiting: Amazon servers may restrict access from the same IP address that frequently accesses the same page to prevent excessive server load caused by frequent scraping. When scraping, it is important to control the request frequency to avoid IP bans.
Anti-scraping algorithms: Amazon may use various algorithms to detect scraper behavior, such as detecting specific fields in request headers or analyzing request frequencies. In such cases, it is necessary to constantly adjust request headers, use proxy IPs, and other methods to evade detection.
In summary, dealing with Amazon’s anti-scraping measures requires the comprehensive application of various technologies and strategies to ensure successful data collection.
Advertisement Time: If you want to simplify the data collection process, consider trying Pangolin Scrape API. Just push the collection tasks to the API according to your needs, making data collection easier and more efficient!
If you are a novice with zero coding knowledge and do not want to use complex RPA tools, you can try Pangolin Scrapper. It can collect data from Amazon sites in real-time based on keywords or ASINs, without the need for configuration, and download it in Excel format for analysis and processing with just one click.