How to Use Python to Scrape Data from Amazon's New Releases List - Pangolin Info

Data Compliance, 数据采集

How to Use Python to Scrape Data from Amazon’s New Releases List

A VPN is an essential component of IT security, whether you’re just starting a business or are already up and running. Most business interactions and transactions happen online and VPN

I. Introduction

1.1 Importance of Amazon‘s New Releases List

Amazon is one of the world’s largest e-commerce platforms, and its New Releases list showcases the latest and most popular products. For e-commerce sellers and market analysts, understanding these new releases can help capture market trends and consumer preferences, thereby optimizing product strategies and marketing plans.

1.2 Value of Data Scraping from Amazon

By scraping data from Amazon, users can conduct multi-dimensional market analysis. For example, analyzing product price trends, customer reviews, and sales rankings can provide data support for product development and marketing promotion. Additionally, e-commerce sellers can adjust their operational strategies by analyzing competitors, improving their competitiveness.

II. Challenges in Scraping Amazon’s Best Seller Data

2.1 Dynamic Content Loading

Amazon’s web content is often dynamically loaded, making it impossible to retrieve all data using traditional static scraping methods. This requires tools that can handle dynamic content loading, such as Selenium or Playwright.

2.2 Anti-Scraping Mechanisms

Amazon has powerful anti-scraping mechanisms that detect and block frequent and abnormal requests. This includes detecting user behavior patterns and using CAPTCHA. Bypassing these mechanisms requires advanced techniques, such as IP proxy pools and request frequency control.

2.3 IP Restrictions and Captchas

Amazon restricts frequent requests from the same IP address and may trigger CAPTCHA verification. This requires the scraping program to handle IP restrictions and CAPTCHAs, ensuring continuous and stable data scraping.

2.4 Complex Data Structure

Amazon pages have complex data structures, and there may be differences between different pages. This requires the scraping program to have strong flexibility and adaptability to accurately extract the required data according to different page structures.

III. Preparation of Python Scraping Environment

3.1 Installing Python and Necessary Libraries

First, we need to install Python and related libraries. Here are the steps to install Python and some commonly used libraries:

# Install Python
sudo apt update
sudo apt install python3
sudo apt install python3-pip

# Install necessary libraries
pip3 install scrapy selenium requests bs4

3.2 Choosing the Right Scraping Framework (e.g., Scrapy)

Scrapy is a powerful and flexible scraping framework suitable for handling large-scale data scraping tasks. We can install Scrapy with the following command:

pip3 install scrapy

3.3 Setting Up a Virtual Environment

To ensure dependency management for the project, we recommend using a virtual environment:

# Install virtualenv
pip3 install virtualenv

# Create a virtual environment
virtualenv venv

# Activate the virtual environment
source venv/bin/activate

IV. Designing the Scraper Architecture

4.1 Defining Target URLs and Data Structure

First, we need to determine the target URLs and the data structure to be extracted. For example, we need to extract product name, price, rating, and other information from Amazon’s New Releases list.

4.2 Creating a Spider Class

In Scrapy, we define the behavior of the scraper by creating a Spider class. Here is a simple Spider class example:

import scrapy

class AmazonSpider(scrapy.Spider):
    name = "amazon"
    start_urls = [
        'https://www.amazon.com/s?i=new-releases',
    ]

    def parse(self, response):
        for product in response.css('div.s-main-slot div.s-result-item'):
            yield {
                'name': product.css('span.a-text-normal::text').get(),
                'price': product.css('span.a-price-whole::text').get(),
                'rating': product.css('span.a-icon-alt::text').get(),
            }

4.3 Implementing Data Parsing Functions

Data parsing functions are used to extract the required data from the response. In the above example, we use CSS selectors to locate and extract product information.

V. Handling Dynamic Content Loading

5.1 Using Selenium to Simulate Browser Behavior

Selenium can simulate user operations and load dynamic content. Here is an example of using Selenium to load an Amazon page:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

service = Service('/path/to/chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(service=service, options=options)

driver.get('https://www.amazon.com/s?i=new-releases')

5.2 Waiting for the Page to Load Completely

When using Selenium, we need to wait for the page to load completely before extracting data:

wait = WebDriverWait(driver, 10)
products = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.s-result-item')))

5.3 Extracting JavaScript Rendered Data

Once the page loads, we can use Selenium to extract data rendered by JavaScript:

for product in products:
    name = product.find_element(By.CSS_SELECTOR, 'span.a-text-normal').text
    price = product.find_element(By.CSS_SELECTOR, 'span.a-price-whole').text
    rating = product.find_element(By.CSS_SELECTOR, 'span.a-icon-alt').text
    print(f'Name: {name}, Price: {price}, Rating: {rating}')

VI. Bypassing Anti-Scraping Mechanisms

6.1 Setting User-Agent

Setting a User-Agent can simulate real user requests and bypass some anti-scraping detections:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get('https://www.amazon.com/s?i=new-releases', headers=headers)

6.2 Implementing an IP Proxy Pool

Using an IP proxy pool can avoid IP bans:

import requests

proxies = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port'
}

response = requests.get('https://www.amazon.com/s?i=new-releases', headers=headers, proxies=proxies)

6.3 Controlling Request Frequency

Controlling the request frequency can reduce the risk of detection:

import time

for url in urls:
    response = requests.get(url, headers=headers)
    # Process response data
    time.sleep(2)  # Delay 2 seconds

VII. Handling Captchas and Login

7.1 Recognizing Captchas (OCR Technology)

We can use OCR technology to recognize captchas, such as using Tesseract:

from PIL import Image
import pytesseract

captcha_image = Image.open('captcha.png')
captcha_text = pytesseract.image_to_string(captcha_image)

7.2 Simulating the Login Process

Using Selenium, we can simulate the login process:

driver.get('https://www.amazon.com/ap/signin')

username = driver.find_element(By.ID, 'ap_email')
username.send_keys('[email protected]')

password = driver.find_element(By.ID, 'ap_password')
password.send_keys('your_password')

login_button = driver.find_element(By.ID, 'signInSubmit')
login_button.click()

7.3 Maintaining Session State

Using the Session object from the requests library, we can maintain the session state:

import requests

session = requests.Session()
session.post('https://www.amazon.com/ap/signin', data={'email': '[email protected]', 'password': 'your_password'})
response = session.get('https://www.amazon.com/s?i=new-releases')

VIII. Data Extraction and Cleaning

8.1 Locating Elements Using XPath or CSS Selectors

Using XPath or CSS selectors can accurately locate page elements:

from lxml import html

tree = html.fromstring(response.content)
names = tree.xpath('//span[@class="a-text-normal"]/text()')

8.2 Extracting Product Information (Name, Price, Rating, etc.)

Example code for extracting product information:

for product in products:
    name = product.find_element(By.CSS_SELECTOR, 'span.a-text-normal').text
    price = product.find_element(By.CSS_SELECTOR, 'span.a-price-whole').text
    rating = product.find_element(By.CSS_SELECTOR, 'span.a-icon-alt').text

8.3 Data Cleaning and Formatting

Clean and format the data for easy storage and analysis:

cleaned_data = []
for product in raw_data:
    name = product['name'].strip()
    price = float(product['price'].replace(',', ''))
    rating = float(product['rating'].split()[0])
    cleaned_data.append({'name': name, 'price': price, 'rating': rating})

IX. Data Storage

9.1 Choosing the Right Database (e.g., MongoDB)

MongoDB is a NoSQL database suitable for storing scraping data:

# Install MongoDB
sudo apt install -y mongodb

# Start MongoDB
sudo service mongodb start

9.2 Designing the Data Model

Design the data model to ensure clear data structure:

import pymongo

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["amazon"]
collection = db["new_releases"]

product = {
    'name': 'Sample Product',
    'price': 19.99,
    'rating': 4.5
}

collection.insert_one(product)

9.3 Implementing Data Persistence

Persistently store the extracted data into MongoDB:

for product in cleaned_data:
    collection.insert_one(product)

X. Scraper Optimization

10.1 Multithreading and Asynchronous Processing

Using multithreading or asynchronous processing can improve scraper efficiency:

import threading

def fetch_data(url):
    response = requests.get(url, headers=headers)
    # Process response data

threads = []
for url in urls:
    t = threading.Thread(target=fetch_data, args=(url,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

10.2 Distributed Scraping

Distributed scraping can further enhance the scale and speed of data scraping. Tools like Scrapy-Redis can implement distributed scraping:

# Install Scrapy-Redis
pip3 install scrapy-redis

# Configure Scrapy-Redis in settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
REDIS_URL = 'redis://localhost:6379'

10.3 Incremental Scraping

Incremental scraping can avoid duplicate data collection and save resources:

last_crawled_time = get_last_crawled_time()
for product in new_products:
    if product['date'] > last_crawled_time:
        collection.insert_one(product)

XI. Code Implementation Examples

11.1 Spider Class Code

Here is a complete Spider class example:

import scrapy
from myproject.items import ProductItem

class AmazonSpider(scrapy.Spider):
    name = "amazon"
    start_urls = ['https://www.amazon.com/s?i=new-releases']

    def parse(self, response):
        for product in response.css('div.s-main-slot div.s-result-item'):
            item = ProductItem()
            item['name'] = product.css('span.a-text-normal::text').get()
            item['price'] = product.css('span.a-price-whole::text').get()
            item['rating'] = product.css('span.a-icon-alt::text').get()
            yield item

11.2 Data Parsing Functions

Example of data parsing functions:

def parse_product(response):
    name = response.css('span.a-text-normal::text').get()
    price = response.css('span.a-price-whole::text').get()
    rating = response.css('span.a-icon-alt::text').get()
    return {'name': name, 'price': price, 'rating': rating}

11.3 Anti-Scraping Handling Code

Example of anti-scraping handling code:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

proxies = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port'
}

response = requests.get('https://www.amazon.com/s?i=new-releases', headers=headers, proxies=proxies)

11.4 Data Storage Code

Example of data storage code:

import pymongo

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["amazon"]
collection = db["new_releases"]

for product in cleaned_data:
    collection.insert_one(product)

XII. Precautions and Best Practices

12.1 Follow robots.txt Rules

Scrapers should follow the target website’s robots.txt rules to avoid putting too much pressure on the server:

ROBOTSTXT_OBEY = True

12.2 Error Handling and Logging

Good error handling and logging can improve the stability and maintainability of the scraper:

import logging

logging.basicConfig(filename='scrapy.log', level=logging.INFO)
try:
    response = requests.get('https://www.amazon.com/s?i=new-releases', headers=headers, proxies=proxies)
except requests.exceptions.RequestException as e:
    logging.error(f"Request failed: {e}")

12.3 Regular Maintenance and Updates

Regularly maintain and update the scraper to ensure it adapts to changes in the website structure and anti-scraping mechanisms.

XIII. Summary of the Current Situation and Difficulties in Scraping Amazon Data

13.1 Technical Challenges

Scraping Amazon data faces many technical challenges, such as dynamic content loading, anti-scraping mechanisms, IP restrictions, and CAPTCHAs. These problems require a combination of multiple technical means to solve.

13.2 Legal and Ethical Considerations

When scraping data, it is necessary to comply with relevant laws and regulations and respect the terms of use of the target website. Illegal or unethical data scraping behaviors may bring legal risks and ethical disputes.

13.3 Data Quality and Timeliness Issues

Data quality and timeliness are important indicators of data scraping. During scraping, efforts should be made to ensure the accuracy and timeliness of the data to avoid outdated or erroneous data affecting analysis results.

XIV. A Better Choice: Pangolin Scrape API

14.1 Introduction to Scrape API

Pangolin Scrape API is a professional data scraping service that provides efficient and stable data scraping solutions, supporting various target websites and data types.

14.2 Main Features and Advantages

Pangolin Scrape API has the following features and advantages:

Efficient and Stable: Based on a distributed architecture, it can handle large-scale data scraping tasks, ensuring the efficiency and stability of data scraping.
Simple and Easy to Use: Provides simple and easy-to-use API interfaces, without complex configuration and programming, allowing users to quickly integrate and use.
Real-time Updates: Supports real-time data scraping and updates, ensuring the timeliness and accuracy of the data.
Safe and Reliable: Provides multi-level security protection measures to ensure the legality and security of data scraping.

14.3 Applicable Scenarios

Pangolin Scrape API is suitable for the following scenarios:

Market Analysis: Scraping product data from e-commerce platforms for market trend analysis and competitor research.
Data Mining: Obtaining data from various websites for data mining and business intelligence analysis.
Academic Research: Scraping data required for research to support academic research and thesis writing.

XV. Conclusion

15.1 Limitations of Python Scrapers

Although Python scrapers have powerful functions in data scraping, they still have certain limitations in handling complex dynamic content, bypassing anti-scraping mechanisms, and maintaining session states.

15.2 Importance of Choosing the Right Data Scraping Method

Choosing the right data scraping method according to specific needs is very important. For complex scraping tasks, using professional data scraping services (such as Pangolin Scrape API) may be a better choice. Regardless of the method chosen, attention should be paid to data quality, timeliness, and legality to ensure the effectiveness and safety of data scraping.

Weekly Tutorial

Sign up for our Newsletter

Sign up now to embark on your Amazon data journey, and we will provide you with the most accurate and efficient data collection solutions.