Market Trends Scraper
A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.
Features
- Multi-Source Scraping: Collect data from multiple e-commerce websites simultaneously
- Flexible Configuration: Easy-to-use YAML configuration for different sources and scraping parameters
- Dual Scraping Methods: Supports both requests/BeautifulSoup and Selenium for dynamic content
- Data Analysis: Built-in analysis of pricing trends, ratings, and product availability
- Multiple Output Formats: Save data in CSV, JSON, or Excel formats
- Robust Error Handling: Comprehensive error handling and retry mechanisms
- Professional Logging: Detailed logging with configurable levels and outputs
- Extensible Architecture: Modular design for easy customization and extension
Installation
Prerequisites
- Python 3.8 or higher
- Chrome browser (for Selenium functionality)
- ChromeDriver (compatible with your Chrome version)
Setup
- Clone the repository:
git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
cd market-trends-scraper
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install ChromeDriver (if using Selenium):
# For Ubuntu/Debian:
sudo apt-get install chromium-chromedriver
# For macOS using Homebrew:
brew install chromedriver
# For Windows, download from: https://chromedriver.chromium.org/
Configuration
The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at config/config.yaml
when you first run the scraper.
Example Configuration
scraper:
delay_between_requests: 1.0
timeout: 30
max_retries: 3
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
headless: true
window_size: [1920, 1080]
sources:
- name: "example_ecommerce"
url: "https://example-ecommerce.com/search"
type: "ecommerce"
enabled: true
use_selenium: false
selectors:
product: "div.product-item"
name: "h2.product-title"
price: "span.price"
rating: "div.rating"
availability: "div.stock-status"
pagination:
next_page: "a.next-page"
max_pages: 10
output:
format: "csv"
include_timestamp: true
filename: "market_trends_data"
database:
url: "sqlite:///data/market_trends.db"
echo: false
analysis:
price_history_days: 30
trend_threshold: 0.05
generate_charts: true
Configuration Options
Scraper Settings
delay_between_requests
: Delay between requests in seconds (default: 1.0)timeout
: Request timeout in seconds (default: 30)max_retries
: Maximum number of retry attempts for failed requests (default: 3)user_agent
: User agent string for HTTP requestsheadless
: Run browser in headless mode (default: true)window_size
: Browser window size as [width, height] (default: [1920, 1080])
Source Configuration
name
: Unique identifier for the data sourceurl
: Base URL for scrapingtype
: Type of website (e.g., "ecommerce")enabled
: Whether to scrape this source (default: true)use_selenium
: Use Selenium instead of requests (default: false)selectors
: CSS selectors for extracting datapagination
: Pagination settings
Output Settings
format
: Output format ("csv", "json", or "excel")include_timestamp
: Include timestamp in output filename (default: true)filename
: Base filename for output files
Analysis Settings
price_history_days
: Number of days to consider for price history (default: 30)trend_threshold
: Minimum price change percentage to consider as a trend (default: 0.05)generate_charts
: Generate trend charts (default: true)
Usage
Command Line Interface
Run the scraper with default settings:
python main.py
Specify a custom configuration file:
python main.py --config path/to/config.yaml
Specify output file:
python main.py --output path/to/output.csv
Run in verbose mode:
python main.py --verbose
Run browser in non-headless mode (for debugging):
python main.py --no-headless
Python API
from src.config_manager import ConfigManager
from src.scraper import MarketTrendsScraper
from src.logger import setup_logger
# Setup logging
setup_logger(verbose=True)
# Load configuration
config_manager = ConfigManager("config/config.yaml")
config = config_manager.load_config()
# Initialize scraper
with MarketTrendsScraper(config, headless=True) as scraper:
# Scrape data
data = scraper.scrape_market_trends()
# Save data
scraper.save_data(data, "output.csv")
# Analyze trends
analysis = scraper.analyze_trends(data)
# Save analysis
scraper.save_analysis(analysis, "analysis.json")
Output
Data Output
The scraper produces structured data with the following fields:
Field | Description |
---|---|
name | Product name |
price | Product price (as float) |
rating | Product rating (as float) |
availability | Product availability status |
url | Product URL |
source | Data source name |
scraped_at | Timestamp when data was scraped |
Analysis Output
The trend analysis includes:
- Summary Statistics: Total products, source distribution
- Price Analysis: Average, min, max, median prices and distribution
- Rating Analysis: Average, min, max ratings and distribution
- Availability Analysis: Count of products by availability status
- Price Trends by Source: Comparative analysis across sources
Example analysis output:
{
"total_products": 150,
"sources": {
"example_ecommerce": 100,
"another_store": 50
},
"price_analysis": {
"average_price": 49.99,
"min_price": 9.99,
"max_price": 199.99,
"median_price": 45.00,
"price_distribution": {
"count": 150,
"mean": 49.99,
"std": 35.25,
"min": 9.99,
"25%": 25.00,
"50%": 45.00,
"75%": 75.00,
"max": 199.99
}
},
"rating_analysis": {
"average_rating": 4.2,
"min_rating": 1.0,
"max_rating": 5.0,
"rating_distribution": {
"5.0": 45,
"4.0": 60,
"3.0": 30,
"2.0": 10,
"1.0": 5
}
}
}
Testing
Run all tests:
pytest
Run unit tests only:
pytest -m unit
Run integration tests only:
pytest -m integration
Run tests with coverage report:
pytest --cov=src --cov-report=html
Project Structure
market-trends-scraper/
├── src/
│ ├── __init__.py
│ ├── config_manager.py # Configuration management
│ ├── logger.py # Logging utilities
│ └── scraper.py # Main scraper implementation
├── tests/
│ ├── __init__.py
│ ├── test_config_manager.py
│ ├── test_logger.py
│ ├── test_scraper.py
│ └── test_integration.py
├── config/
│ └── config.yaml # Configuration file
├── data/ # Output data directory
├── main.py # Main entry point
├── requirements.txt # Python dependencies
├── pytest.ini # Test configuration
└── README.md # This file
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Development Guidelines
- Follow PEP 8 style guidelines
- Write comprehensive tests for new features
- Update documentation as needed
- Ensure all tests pass before submitting
License
This project is licensed under the MIT License - see the LICENSE file for details.
Disclaimer
This tool is for educational and research purposes. Users are responsible for:
- Complying with websites' terms of service
- Respecting robots.txt files
- Using the tool ethically and responsibly
- Not overwhelming servers with too many requests
The authors are not responsible for any misuse of this tool.
Support
If you encounter any issues or have questions:
- Check the Issues page
- Create a new issue with detailed information
- For general questions, use the Discussions tab
Acknowledgments
- Beautiful Soup for HTML parsing
- Selenium for browser automation
- Pandas for data analysis
- Loguru for logging