Initial commit: Market Trends Scraper

2025-09-11 17:46:14 +03:00
commit 4ddcde68d4
17 changed files with 3049 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,340 @@
+# Market Trends Scraper
+
+A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.
+
+## Features
+
+- **Multi-Source Scraping**: Collect data from multiple e-commerce websites simultaneously
+- **Flexible Configuration**: Easy-to-use YAML configuration for different sources and scraping parameters
+- **Dual Scraping Methods**: Supports both requests/BeautifulSoup and Selenium for dynamic content
+- **Data Analysis**: Built-in analysis of pricing trends, ratings, and product availability
+- **Multiple Output Formats**: Save data in CSV, JSON, or Excel formats
+- **Robust Error Handling**: Comprehensive error handling and retry mechanisms
+- **Professional Logging**: Detailed logging with configurable levels and outputs
+- **Extensible Architecture**: Modular design for easy customization and extension
+
+## Installation
+
+### Prerequisites
+
+- Python 3.8 or higher
+- Chrome browser (for Selenium functionality)
+- ChromeDriver (compatible with your Chrome version)
+
+### Setup
+
+1. Clone the repository:
+```bash
+git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
+cd market-trends-scraper
+```
+
+2. Create and activate a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+
+3. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+4. Install ChromeDriver (if using Selenium):
+```bash
+# For Ubuntu/Debian:
+sudo apt-get install chromium-chromedriver
+
+# For macOS using Homebrew:
+brew install chromedriver
+
+# For Windows, download from: https://chromedriver.chromium.org/
+```
+
+## Configuration
+
+The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at `config/config.yaml` when you first run the scraper.
+
+### Example Configuration
+
+```yaml
+scraper:
+  delay_between_requests: 1.0
+  timeout: 30
+  max_retries: 3
+  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
+  headless: true
+  window_size: [1920, 1080]
+
+sources:
+  - name: "example_ecommerce"
+    url: "https://example-ecommerce.com/search"
+    type: "ecommerce"
+    enabled: true
+    use_selenium: false
+    selectors:
+      product: "div.product-item"
+      name: "h2.product-title"
+      price: "span.price"
+      rating: "div.rating"
+      availability: "div.stock-status"
+    pagination:
+      next_page: "a.next-page"
+      max_pages: 10
+
+output:
+  format: "csv"
+  include_timestamp: true
+  filename: "market_trends_data"
+
+database:
+  url: "sqlite:///data/market_trends.db"
+  echo: false
+
+analysis:
+  price_history_days: 30
+  trend_threshold: 0.05
+  generate_charts: true
+```
+
+### Configuration Options
+
+#### Scraper Settings
+- `delay_between_requests`: Delay between requests in seconds (default: 1.0)
+- `timeout`: Request timeout in seconds (default: 30)
+- `max_retries`: Maximum number of retry attempts for failed requests (default: 3)
+- `user_agent`: User agent string for HTTP requests
+- `headless`: Run browser in headless mode (default: true)
+- `window_size`: Browser window size as [width, height] (default: [1920, 1080])
+
+#### Source Configuration
+- `name`: Unique identifier for the data source
+- `url`: Base URL for scraping
+- `type`: Type of website (e.g., "ecommerce")
+- `enabled`: Whether to scrape this source (default: true)
+- `use_selenium`: Use Selenium instead of requests (default: false)
+- `selectors`: CSS selectors for extracting data
+- `pagination`: Pagination settings
+
+#### Output Settings
+- `format`: Output format ("csv", "json", or "excel")
+- `include_timestamp`: Include timestamp in output filename (default: true)
+- `filename`: Base filename for output files
+
+#### Analysis Settings
+- `price_history_days`: Number of days to consider for price history (default: 30)
+- `trend_threshold`: Minimum price change percentage to consider as a trend (default: 0.05)
+- `generate_charts`: Generate trend charts (default: true)
+
+## Usage
+
+### Command Line Interface
+
+Run the scraper with default settings:
+```bash
+python main.py
+```
+
+Specify a custom configuration file:
+```bash
+python main.py --config path/to/config.yaml
+```
+
+Specify output file:
+```bash
+python main.py --output path/to/output.csv
+```
+
+Run in verbose mode:
+```bash
+python main.py --verbose
+```
+
+Run browser in non-headless mode (for debugging):
+```bash
+python main.py --no-headless
+```
+
+### Python API
+
+```python
+from src.config_manager import ConfigManager
+from src.scraper import MarketTrendsScraper
+from src.logger import setup_logger
+
+# Setup logging
+setup_logger(verbose=True)
+
+# Load configuration
+config_manager = ConfigManager("config/config.yaml")
+config = config_manager.load_config()
+
+# Initialize scraper
+with MarketTrendsScraper(config, headless=True) as scraper:
+    # Scrape data
+    data = scraper.scrape_market_trends()
+    
+    # Save data
+    scraper.save_data(data, "output.csv")
+    
+    # Analyze trends
+    analysis = scraper.analyze_trends(data)
+    
+    # Save analysis
+    scraper.save_analysis(analysis, "analysis.json")
+```
+
+## Output
+
+### Data Output
+
+The scraper produces structured data with the following fields:
+
+| Field | Description |
+|-------|-------------|
+| name | Product name |
+| price | Product price (as float) |
+| rating | Product rating (as float) |
+| availability | Product availability status |
+| url | Product URL |
+| source | Data source name |
+| scraped_at | Timestamp when data was scraped |
+
+### Analysis Output
+
+The trend analysis includes:
+
+- **Summary Statistics**: Total products, source distribution
+- **Price Analysis**: Average, min, max, median prices and distribution
+- **Rating Analysis**: Average, min, max ratings and distribution
+- **Availability Analysis**: Count of products by availability status
+- **Price Trends by Source**: Comparative analysis across sources
+
+Example analysis output:
+```json
+{
+  "total_products": 150,
+  "sources": {
+    "example_ecommerce": 100,
+    "another_store": 50
+  },
+  "price_analysis": {
+    "average_price": 49.99,
+    "min_price": 9.99,
+    "max_price": 199.99,
+    "median_price": 45.00,
+    "price_distribution": {
+      "count": 150,
+      "mean": 49.99,
+      "std": 35.25,
+      "min": 9.99,
+      "25%": 25.00,
+      "50%": 45.00,
+      "75%": 75.00,
+      "max": 199.99
+    }
+  },
+  "rating_analysis": {
+    "average_rating": 4.2,
+    "min_rating": 1.0,
+    "max_rating": 5.0,
+    "rating_distribution": {
+      "5.0": 45,
+      "4.0": 60,
+      "3.0": 30,
+      "2.0": 10,
+      "1.0": 5
+    }
+  }
+}
+```
+
+## Testing
+
+Run all tests:
+```bash
+pytest
+```
+
+Run unit tests only:
+```bash
+pytest -m unit
+```
+
+Run integration tests only:
+```bash
+pytest -m integration
+```
+
+Run tests with coverage report:
+```bash
+pytest --cov=src --cov-report=html
+```
+
+## Project Structure
+
+```
+market-trends-scraper/
+├── src/
+│   ├── __init__.py
+│   ├── config_manager.py    # Configuration management
+│   ├── logger.py            # Logging utilities
+│   └── scraper.py           # Main scraper implementation
+├── tests/
+│   ├── __init__.py
+│   ├── test_config_manager.py
+│   ├── test_logger.py
+│   ├── test_scraper.py
+│   └── test_integration.py
+├── config/
+│   └── config.yaml          # Configuration file
+├── data/                    # Output data directory
+├── main.py                  # Main entry point
+├── requirements.txt         # Python dependencies
+├── pytest.ini              # Test configuration
+└── README.md               # This file
+```
+
+## Contributing
+
+1. Fork the repository
+2. Create a feature branch (`git checkout -b feature/amazing-feature`)
+3. Commit your changes (`git commit -m 'Add amazing feature'`)
+4. Push to the branch (`git push origin feature/amazing-feature`)
+5. Open a Pull Request
+
+### Development Guidelines
+
+- Follow PEP 8 style guidelines
+- Write comprehensive tests for new features
+- Update documentation as needed
+- Ensure all tests pass before submitting
+
+## License
+
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+
+## Disclaimer
+
+This tool is for educational and research purposes. Users are responsible for:
+- Complying with websites' terms of service
+- Respecting robots.txt files
+- Using the tool ethically and responsibly
+- Not overwhelming servers with too many requests
+
+The authors are not responsible for any misuse of this tool.
+
+## Support
+
+If you encounter any issues or have questions:
+
+1. Check the [Issues](https://github.com/iwasforcedtobehere/market-trends-scraper/issues) page
+2. Create a new issue with detailed information
+3. For general questions, use the [Discussions](https://github.com/iwasforcedtobehere/market-trends-scraper/discussions) tab
+
+## Acknowledgments
+
+- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
+- [Selenium](https://www.selenium.dev/) for browser automation
+- [Pandas](https://pandas.pydata.org/) for data analysis
+- [Loguru](https://github.com/Delgan/loguru) for logging