Files

Dev 4ddcde68d4 Initial commit: Market Trends Scraper

2025-09-11 17:46:14 +03:00

9.0 KiB

Raw Permalink Blame History

Market Trends Scraper

A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.

Features

Multi-Source Scraping: Collect data from multiple e-commerce websites simultaneously
Flexible Configuration: Easy-to-use YAML configuration for different sources and scraping parameters
Dual Scraping Methods: Supports both requests/BeautifulSoup and Selenium for dynamic content
Data Analysis: Built-in analysis of pricing trends, ratings, and product availability
Multiple Output Formats: Save data in CSV, JSON, or Excel formats
Robust Error Handling: Comprehensive error handling and retry mechanisms
Professional Logging: Detailed logging with configurable levels and outputs
Extensible Architecture: Modular design for easy customization and extension

Installation

Prerequisites

Python 3.8 or higher
Chrome browser (for Selenium functionality)
ChromeDriver (compatible with your Chrome version)

Setup

Clone the repository:

git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
cd market-trends-scraper

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install ChromeDriver (if using Selenium):

# For Ubuntu/Debian:
sudo apt-get install chromium-chromedriver

# For macOS using Homebrew:
brew install chromedriver

# For Windows, download from: https://chromedriver.chromium.org/

Configuration

The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at config/config.yaml when you first run the scraper.

Example Configuration

scraper:
  delay_between_requests: 1.0
  timeout: 30
  max_retries: 3
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
  headless: true
  window_size: [1920, 1080]

sources:
  - name: "example_ecommerce"
    url: "https://example-ecommerce.com/search"
    type: "ecommerce"
    enabled: true
    use_selenium: false
    selectors:
      product: "div.product-item"
      name: "h2.product-title"
      price: "span.price"
      rating: "div.rating"
      availability: "div.stock-status"
    pagination:
      next_page: "a.next-page"
      max_pages: 10

output:
  format: "csv"
  include_timestamp: true
  filename: "market_trends_data"

database:
  url: "sqlite:///data/market_trends.db"
  echo: false

analysis:
  price_history_days: 30
  trend_threshold: 0.05
  generate_charts: true

Configuration Options

Scraper Settings

delay_between_requests: Delay between requests in seconds (default: 1.0)
timeout: Request timeout in seconds (default: 30)
max_retries: Maximum number of retry attempts for failed requests (default: 3)
user_agent: User agent string for HTTP requests
headless: Run browser in headless mode (default: true)
window_size: Browser window size as [width, height] (default: [1920, 1080])

Source Configuration

name: Unique identifier for the data source
url: Base URL for scraping
type: Type of website (e.g., "ecommerce")
enabled: Whether to scrape this source (default: true)
use_selenium: Use Selenium instead of requests (default: false)
selectors: CSS selectors for extracting data
pagination: Pagination settings

Output Settings

format: Output format ("csv", "json", or "excel")
include_timestamp: Include timestamp in output filename (default: true)
filename: Base filename for output files

Analysis Settings

price_history_days: Number of days to consider for price history (default: 30)
trend_threshold: Minimum price change percentage to consider as a trend (default: 0.05)
generate_charts: Generate trend charts (default: true)

Usage

Command Line Interface

Run the scraper with default settings:

python main.py

Specify a custom configuration file:

python main.py --config path/to/config.yaml

Specify output file:

python main.py --output path/to/output.csv

Run in verbose mode:

python main.py --verbose

Run browser in non-headless mode (for debugging):

python main.py --no-headless

Python API

from src.config_manager import ConfigManager
from src.scraper import MarketTrendsScraper
from src.logger import setup_logger

# Setup logging
setup_logger(verbose=True)

# Load configuration
config_manager = ConfigManager("config/config.yaml")
config = config_manager.load_config()

# Initialize scraper
with MarketTrendsScraper(config, headless=True) as scraper:
    # Scrape data
    data = scraper.scrape_market_trends()
    
    # Save data
    scraper.save_data(data, "output.csv")
    
    # Analyze trends
    analysis = scraper.analyze_trends(data)
    
    # Save analysis
    scraper.save_analysis(analysis, "analysis.json")

Output

Data Output

The scraper produces structured data with the following fields:

Field	Description
name	Product name
price	Product price (as float)
rating	Product rating (as float)
availability	Product availability status
url	Product URL
source	Data source name
scraped_at	Timestamp when data was scraped

Analysis Output

The trend analysis includes:

Summary Statistics: Total products, source distribution
Price Analysis: Average, min, max, median prices and distribution
Rating Analysis: Average, min, max ratings and distribution
Availability Analysis: Count of products by availability status
Price Trends by Source: Comparative analysis across sources

Example analysis output:

{
  "total_products": 150,
  "sources": {
    "example_ecommerce": 100,
    "another_store": 50
  },
  "price_analysis": {
    "average_price": 49.99,
    "min_price": 9.99,
    "max_price": 199.99,
    "median_price": 45.00,
    "price_distribution": {
      "count": 150,
      "mean": 49.99,
      "std": 35.25,
      "min": 9.99,
      "25%": 25.00,
      "50%": 45.00,
      "75%": 75.00,
      "max": 199.99
    }
  },
  "rating_analysis": {
    "average_rating": 4.2,
    "min_rating": 1.0,
    "max_rating": 5.0,
    "rating_distribution": {
      "5.0": 45,
      "4.0": 60,
      "3.0": 30,
      "2.0": 10,
      "1.0": 5
    }
  }
}

Testing

Run all tests:

pytest

Run unit tests only:

pytest -m unit

Run integration tests only:

pytest -m integration

Run tests with coverage report:

pytest --cov=src --cov-report=html

Project Structure

market-trends-scraper/
├── src/
│   ├── __init__.py
│   ├── config_manager.py    # Configuration management
│   ├── logger.py            # Logging utilities
│   └── scraper.py           # Main scraper implementation
├── tests/
│   ├── __init__.py
│   ├── test_config_manager.py
│   ├── test_logger.py
│   ├── test_scraper.py
│   └── test_integration.py
├── config/
│   └── config.yaml          # Configuration file
├── data/                    # Output data directory
├── main.py                  # Main entry point
├── requirements.txt         # Python dependencies
├── pytest.ini              # Test configuration
└── README.md               # This file

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guidelines
Write comprehensive tests for new features
Update documentation as needed
Ensure all tests pass before submitting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for educational and research purposes. Users are responsible for:

Complying with websites' terms of service
Respecting robots.txt files
Using the tool ethically and responsibly
Not overwhelming servers with too many requests

The authors are not responsible for any misuse of this tool.

Support

If you encounter any issues or have questions:

Check the Issues page
Create a new issue with detailed information
For general questions, use the Discussions tab

Acknowledgments

Beautiful Soup for HTML parsing
Selenium for browser automation
Pandas for data analysis
Loguru for logging

9.0 KiB Raw Permalink Blame History