# Market Trends Scraper A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses. ## Features - **Multi-Source Scraping**: Collect data from multiple e-commerce websites simultaneously - **Flexible Configuration**: Easy-to-use YAML configuration for different sources and scraping parameters - **Dual Scraping Methods**: Supports both requests/BeautifulSoup and Selenium for dynamic content - **Data Analysis**: Built-in analysis of pricing trends, ratings, and product availability - **Multiple Output Formats**: Save data in CSV, JSON, or Excel formats - **Robust Error Handling**: Comprehensive error handling and retry mechanisms - **Professional Logging**: Detailed logging with configurable levels and outputs - **Extensible Architecture**: Modular design for easy customization and extension ## Installation ### Prerequisites - Python 3.8 or higher - Chrome browser (for Selenium functionality) - ChromeDriver (compatible with your Chrome version) ### Setup 1. Clone the repository: ```bash git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git cd market-trends-scraper ``` 2. Create and activate a virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 3. Install dependencies: ```bash pip install -r requirements.txt ``` 4. Install ChromeDriver (if using Selenium): ```bash # For Ubuntu/Debian: sudo apt-get install chromium-chromedriver # For macOS using Homebrew: brew install chromedriver # For Windows, download from: https://chromedriver.chromium.org/ ``` ## Configuration The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at `config/config.yaml` when you first run the scraper. ### Example Configuration ```yaml scraper: delay_between_requests: 1.0 timeout: 30 max_retries: 3 user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" headless: true window_size: [1920, 1080] sources: - name: "example_ecommerce" url: "https://example-ecommerce.com/search" type: "ecommerce" enabled: true use_selenium: false selectors: product: "div.product-item" name: "h2.product-title" price: "span.price" rating: "div.rating" availability: "div.stock-status" pagination: next_page: "a.next-page" max_pages: 10 output: format: "csv" include_timestamp: true filename: "market_trends_data" database: url: "sqlite:///data/market_trends.db" echo: false analysis: price_history_days: 30 trend_threshold: 0.05 generate_charts: true ``` ### Configuration Options #### Scraper Settings - `delay_between_requests`: Delay between requests in seconds (default: 1.0) - `timeout`: Request timeout in seconds (default: 30) - `max_retries`: Maximum number of retry attempts for failed requests (default: 3) - `user_agent`: User agent string for HTTP requests - `headless`: Run browser in headless mode (default: true) - `window_size`: Browser window size as [width, height] (default: [1920, 1080]) #### Source Configuration - `name`: Unique identifier for the data source - `url`: Base URL for scraping - `type`: Type of website (e.g., "ecommerce") - `enabled`: Whether to scrape this source (default: true) - `use_selenium`: Use Selenium instead of requests (default: false) - `selectors`: CSS selectors for extracting data - `pagination`: Pagination settings #### Output Settings - `format`: Output format ("csv", "json", or "excel") - `include_timestamp`: Include timestamp in output filename (default: true) - `filename`: Base filename for output files #### Analysis Settings - `price_history_days`: Number of days to consider for price history (default: 30) - `trend_threshold`: Minimum price change percentage to consider as a trend (default: 0.05) - `generate_charts`: Generate trend charts (default: true) ## Usage ### Command Line Interface Run the scraper with default settings: ```bash python main.py ``` Specify a custom configuration file: ```bash python main.py --config path/to/config.yaml ``` Specify output file: ```bash python main.py --output path/to/output.csv ``` Run in verbose mode: ```bash python main.py --verbose ``` Run browser in non-headless mode (for debugging): ```bash python main.py --no-headless ``` ### Python API ```python from src.config_manager import ConfigManager from src.scraper import MarketTrendsScraper from src.logger import setup_logger # Setup logging setup_logger(verbose=True) # Load configuration config_manager = ConfigManager("config/config.yaml") config = config_manager.load_config() # Initialize scraper with MarketTrendsScraper(config, headless=True) as scraper: # Scrape data data = scraper.scrape_market_trends() # Save data scraper.save_data(data, "output.csv") # Analyze trends analysis = scraper.analyze_trends(data) # Save analysis scraper.save_analysis(analysis, "analysis.json") ``` ## Output ### Data Output The scraper produces structured data with the following fields: | Field | Description | |-------|-------------| | name | Product name | | price | Product price (as float) | | rating | Product rating (as float) | | availability | Product availability status | | url | Product URL | | source | Data source name | | scraped_at | Timestamp when data was scraped | ### Analysis Output The trend analysis includes: - **Summary Statistics**: Total products, source distribution - **Price Analysis**: Average, min, max, median prices and distribution - **Rating Analysis**: Average, min, max ratings and distribution - **Availability Analysis**: Count of products by availability status - **Price Trends by Source**: Comparative analysis across sources Example analysis output: ```json { "total_products": 150, "sources": { "example_ecommerce": 100, "another_store": 50 }, "price_analysis": { "average_price": 49.99, "min_price": 9.99, "max_price": 199.99, "median_price": 45.00, "price_distribution": { "count": 150, "mean": 49.99, "std": 35.25, "min": 9.99, "25%": 25.00, "50%": 45.00, "75%": 75.00, "max": 199.99 } }, "rating_analysis": { "average_rating": 4.2, "min_rating": 1.0, "max_rating": 5.0, "rating_distribution": { "5.0": 45, "4.0": 60, "3.0": 30, "2.0": 10, "1.0": 5 } } } ``` ## Testing Run all tests: ```bash pytest ``` Run unit tests only: ```bash pytest -m unit ``` Run integration tests only: ```bash pytest -m integration ``` Run tests with coverage report: ```bash pytest --cov=src --cov-report=html ``` ## Project Structure ``` market-trends-scraper/ ├── src/ │ ├── __init__.py │ ├── config_manager.py # Configuration management │ ├── logger.py # Logging utilities │ └── scraper.py # Main scraper implementation ├── tests/ │ ├── __init__.py │ ├── test_config_manager.py │ ├── test_logger.py │ ├── test_scraper.py │ └── test_integration.py ├── config/ │ └── config.yaml # Configuration file ├── data/ # Output data directory ├── main.py # Main entry point ├── requirements.txt # Python dependencies ├── pytest.ini # Test configuration └── README.md # This file ``` ## Contributing 1. Fork the repository 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ### Development Guidelines - Follow PEP 8 style guidelines - Write comprehensive tests for new features - Update documentation as needed - Ensure all tests pass before submitting ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Disclaimer This tool is for educational and research purposes. Users are responsible for: - Complying with websites' terms of service - Respecting robots.txt files - Using the tool ethically and responsibly - Not overwhelming servers with too many requests The authors are not responsible for any misuse of this tool. ## Support If you encounter any issues or have questions: 1. Check the [Issues](https://github.com/iwasforcedtobehere/market-trends-scraper/issues) page 2. Create a new issue with detailed information 3. For general questions, use the [Discussions](https://github.com/iwasforcedtobehere/market-trends-scraper/discussions) tab ## Acknowledgments - [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing - [Selenium](https://www.selenium.dev/) for browser automation - [Pandas](https://pandas.pydata.org/) for data analysis - [Loguru](https://github.com/Delgan/loguru) for logging