Files
trends-scraper/README.md
2025-09-11 17:46:14 +03:00

9.0 KiB

Market Trends Scraper

A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.

Features

  • Multi-Source Scraping: Collect data from multiple e-commerce websites simultaneously
  • Flexible Configuration: Easy-to-use YAML configuration for different sources and scraping parameters
  • Dual Scraping Methods: Supports both requests/BeautifulSoup and Selenium for dynamic content
  • Data Analysis: Built-in analysis of pricing trends, ratings, and product availability
  • Multiple Output Formats: Save data in CSV, JSON, or Excel formats
  • Robust Error Handling: Comprehensive error handling and retry mechanisms
  • Professional Logging: Detailed logging with configurable levels and outputs
  • Extensible Architecture: Modular design for easy customization and extension

Installation

Prerequisites

  • Python 3.8 or higher
  • Chrome browser (for Selenium functionality)
  • ChromeDriver (compatible with your Chrome version)

Setup

  1. Clone the repository:
git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
cd market-trends-scraper
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install ChromeDriver (if using Selenium):
# For Ubuntu/Debian:
sudo apt-get install chromium-chromedriver

# For macOS using Homebrew:
brew install chromedriver

# For Windows, download from: https://chromedriver.chromium.org/

Configuration

The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at config/config.yaml when you first run the scraper.

Example Configuration

scraper:
  delay_between_requests: 1.0
  timeout: 30
  max_retries: 3
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
  headless: true
  window_size: [1920, 1080]

sources:
  - name: "example_ecommerce"
    url: "https://example-ecommerce.com/search"
    type: "ecommerce"
    enabled: true
    use_selenium: false
    selectors:
      product: "div.product-item"
      name: "h2.product-title"
      price: "span.price"
      rating: "div.rating"
      availability: "div.stock-status"
    pagination:
      next_page: "a.next-page"
      max_pages: 10

output:
  format: "csv"
  include_timestamp: true
  filename: "market_trends_data"

database:
  url: "sqlite:///data/market_trends.db"
  echo: false

analysis:
  price_history_days: 30
  trend_threshold: 0.05
  generate_charts: true

Configuration Options

Scraper Settings

  • delay_between_requests: Delay between requests in seconds (default: 1.0)
  • timeout: Request timeout in seconds (default: 30)
  • max_retries: Maximum number of retry attempts for failed requests (default: 3)
  • user_agent: User agent string for HTTP requests
  • headless: Run browser in headless mode (default: true)
  • window_size: Browser window size as [width, height] (default: [1920, 1080])

Source Configuration

  • name: Unique identifier for the data source
  • url: Base URL for scraping
  • type: Type of website (e.g., "ecommerce")
  • enabled: Whether to scrape this source (default: true)
  • use_selenium: Use Selenium instead of requests (default: false)
  • selectors: CSS selectors for extracting data
  • pagination: Pagination settings

Output Settings

  • format: Output format ("csv", "json", or "excel")
  • include_timestamp: Include timestamp in output filename (default: true)
  • filename: Base filename for output files

Analysis Settings

  • price_history_days: Number of days to consider for price history (default: 30)
  • trend_threshold: Minimum price change percentage to consider as a trend (default: 0.05)
  • generate_charts: Generate trend charts (default: true)

Usage

Command Line Interface

Run the scraper with default settings:

python main.py

Specify a custom configuration file:

python main.py --config path/to/config.yaml

Specify output file:

python main.py --output path/to/output.csv

Run in verbose mode:

python main.py --verbose

Run browser in non-headless mode (for debugging):

python main.py --no-headless

Python API

from src.config_manager import ConfigManager
from src.scraper import MarketTrendsScraper
from src.logger import setup_logger

# Setup logging
setup_logger(verbose=True)

# Load configuration
config_manager = ConfigManager("config/config.yaml")
config = config_manager.load_config()

# Initialize scraper
with MarketTrendsScraper(config, headless=True) as scraper:
    # Scrape data
    data = scraper.scrape_market_trends()
    
    # Save data
    scraper.save_data(data, "output.csv")
    
    # Analyze trends
    analysis = scraper.analyze_trends(data)
    
    # Save analysis
    scraper.save_analysis(analysis, "analysis.json")

Output

Data Output

The scraper produces structured data with the following fields:

Field Description
name Product name
price Product price (as float)
rating Product rating (as float)
availability Product availability status
url Product URL
source Data source name
scraped_at Timestamp when data was scraped

Analysis Output

The trend analysis includes:

  • Summary Statistics: Total products, source distribution
  • Price Analysis: Average, min, max, median prices and distribution
  • Rating Analysis: Average, min, max ratings and distribution
  • Availability Analysis: Count of products by availability status
  • Price Trends by Source: Comparative analysis across sources

Example analysis output:

{
  "total_products": 150,
  "sources": {
    "example_ecommerce": 100,
    "another_store": 50
  },
  "price_analysis": {
    "average_price": 49.99,
    "min_price": 9.99,
    "max_price": 199.99,
    "median_price": 45.00,
    "price_distribution": {
      "count": 150,
      "mean": 49.99,
      "std": 35.25,
      "min": 9.99,
      "25%": 25.00,
      "50%": 45.00,
      "75%": 75.00,
      "max": 199.99
    }
  },
  "rating_analysis": {
    "average_rating": 4.2,
    "min_rating": 1.0,
    "max_rating": 5.0,
    "rating_distribution": {
      "5.0": 45,
      "4.0": 60,
      "3.0": 30,
      "2.0": 10,
      "1.0": 5
    }
  }
}

Testing

Run all tests:

pytest

Run unit tests only:

pytest -m unit

Run integration tests only:

pytest -m integration

Run tests with coverage report:

pytest --cov=src --cov-report=html

Project Structure

market-trends-scraper/
├── src/
│   ├── __init__.py
│   ├── config_manager.py    # Configuration management
│   ├── logger.py            # Logging utilities
│   └── scraper.py           # Main scraper implementation
├── tests/
│   ├── __init__.py
│   ├── test_config_manager.py
│   ├── test_logger.py
│   ├── test_scraper.py
│   └── test_integration.py
├── config/
│   └── config.yaml          # Configuration file
├── data/                    # Output data directory
├── main.py                  # Main entry point
├── requirements.txt         # Python dependencies
├── pytest.ini              # Test configuration
└── README.md               # This file

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Write comprehensive tests for new features
  • Update documentation as needed
  • Ensure all tests pass before submitting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for educational and research purposes. Users are responsible for:

  • Complying with websites' terms of service
  • Respecting robots.txt files
  • Using the tool ethically and responsibly
  • Not overwhelming servers with too many requests

The authors are not responsible for any misuse of this tool.

Support

If you encounter any issues or have questions:

  1. Check the Issues page
  2. Create a new issue with detailed information
  3. For general questions, use the Discussions tab

Acknowledgments