trends-scraper/README.md

# Market Trends Scraper

A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.

## Features

- **Multi-Source Scraping**: Collect data from multiple e-commerce websites simultaneously
- **Flexible Configuration**: Easy-to-use YAML configuration for different sources and scraping parameters
- **Dual Scraping Methods**: Supports both requests/BeautifulSoup and Selenium for dynamic content
- **Data Analysis**: Built-in analysis of pricing trends, ratings, and product availability
- **Multiple Output Formats**: Save data in CSV, JSON, or Excel formats
- **Robust Error Handling**: Comprehensive error handling and retry mechanisms
- **Professional Logging**: Detailed logging with configurable levels and outputs
- **Extensible Architecture**: Modular design for easy customization and extension

## Installation

### Prerequisites

- Python 3.8 or higher
- Chrome browser (for Selenium functionality)
- ChromeDriver (compatible with your Chrome version)

### Setup

1. Clone the repository:
```bash
git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
cd market-trends-scraper
```

2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

4. Install ChromeDriver (if using Selenium):
```bash
# For Ubuntu/Debian:
sudo apt-get install chromium-chromedriver

# For macOS using Homebrew:
brew install chromedriver

# For Windows, download from: https://chromedriver.chromium.org/
```

## Configuration

The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at `config/config.yaml` when you first run the scraper.

### Example Configuration

```yaml
scraper:
  delay_between_requests: 1.0
  timeout: 30
  max_retries: 3
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
  headless: true
  window_size: [1920, 1080]

sources:
  - name: "example_ecommerce"
    url: "https://example-ecommerce.com/search"
    type: "ecommerce"
    enabled: true
    use_selenium: false
    selectors:
      product: "div.product-item"
      name: "h2.product-title"
      price: "span.price"
      rating: "div.rating"
      availability: "div.stock-status"
    pagination:
      next_page: "a.next-page"
      max_pages: 10

output:
  format: "csv"
  include_timestamp: true
  filename: "market_trends_data"

database:
  url: "sqlite:///data/market_trends.db"
  echo: false

analysis:
  price_history_days: 30
  trend_threshold: 0.05
  generate_charts: true
```

### Configuration Options

#### Scraper Settings
- `delay_between_requests`: Delay between requests in seconds (default: 1.0)
- `timeout`: Request timeout in seconds (default: 30)
- `max_retries`: Maximum number of retry attempts for failed requests (default: 3)
- `user_agent`: User agent string for HTTP requests
- `headless`: Run browser in headless mode (default: true)
- `window_size`: Browser window size as [width, height] (default: [1920, 1080])

#### Source Configuration
- `name`: Unique identifier for the data source
- `url`: Base URL for scraping
- `type`: Type of website (e.g., "ecommerce")
- `enabled`: Whether to scrape this source (default: true)
- `use_selenium`: Use Selenium instead of requests (default: false)
- `selectors`: CSS selectors for extracting data
- `pagination`: Pagination settings

#### Output Settings
- `format`: Output format ("csv", "json", or "excel")
- `include_timestamp`: Include timestamp in output filename (default: true)
- `filename`: Base filename for output files

#### Analysis Settings
- `price_history_days`: Number of days to consider for price history (default: 30)
- `trend_threshold`: Minimum price change percentage to consider as a trend (default: 0.05)
- `generate_charts`: Generate trend charts (default: true)

## Usage

### Command Line Interface

Run the scraper with default settings:
```bash
python main.py
```

Specify a custom configuration file:
```bash
python main.py --config path/to/config.yaml
```

Specify output file:
```bash
python main.py --output path/to/output.csv
```

Run in verbose mode:
```bash
python main.py --verbose
```

Run browser in non-headless mode (for debugging):
```bash
python main.py --no-headless
```

### Python API

```python
from src.config_manager import ConfigManager
from src.scraper import MarketTrendsScraper
from src.logger import setup_logger

# Setup logging
setup_logger(verbose=True)

# Load configuration
config_manager = ConfigManager("config/config.yaml")
config = config_manager.load_config()

# Initialize scraper
with MarketTrendsScraper(config, headless=True) as scraper:
    # Scrape data
    data = scraper.scrape_market_trends()

    # Save data
    scraper.save_data(data, "output.csv")

    # Analyze trends
    analysis = scraper.analyze_trends(data)

    # Save analysis
    scraper.save_analysis(analysis, "analysis.json")
```

## Output

### Data Output

The scraper produces structured data with the following fields:

| Field | Description |
|-------|-------------|
| name | Product name |
| price | Product price (as float) |
| rating | Product rating (as float) |
| availability | Product availability status |
| url | Product URL |
| source | Data source name |
| scraped_at | Timestamp when data was scraped |

### Analysis Output

The trend analysis includes:

- **Summary Statistics**: Total products, source distribution
- **Price Analysis**: Average, min, max, median prices and distribution
- **Rating Analysis**: Average, min, max ratings and distribution
- **Availability Analysis**: Count of products by availability status
- **Price Trends by Source**: Comparative analysis across sources

Example analysis output:
```json
{
  "total_products": 150,
  "sources": {
    "example_ecommerce": 100,
    "another_store": 50
  },
  "price_analysis": {
    "average_price": 49.99,
    "min_price": 9.99,
    "max_price": 199.99,
    "median_price": 45.00,
    "price_distribution": {
      "count": 150,
      "mean": 49.99,
      "std": 35.25,
      "min": 9.99,
      "25%": 25.00,
      "50%": 45.00,
      "75%": 75.00,
      "max": 199.99
    }
  },
  "rating_analysis": {
    "average_rating": 4.2,
    "min_rating": 1.0,
    "max_rating": 5.0,
    "rating_distribution": {
      "5.0": 45,
      "4.0": 60,
      "3.0": 30,
      "2.0": 10,
      "1.0": 5
    }
  }
}
```

## Testing

Run all tests:
```bash
pytest
```

Run unit tests only:
```bash
pytest -m unit
```

Run integration tests only:
```bash
pytest -m integration
```

Run tests with coverage report:
```bash
pytest --cov=src --cov-report=html
```

## Project Structure

```
market-trends-scraper/
├── src/
│   ├── __init__.py
│   ├── config_manager.py    # Configuration management
│   ├── logger.py            # Logging utilities
│   └── scraper.py           # Main scraper implementation
├── tests/
│   ├── __init__.py
│   ├── test_config_manager.py
│   ├── test_logger.py
│   ├── test_scraper.py
│   └── test_integration.py
├── config/
│   └── config.yaml          # Configuration file
├── data/                    # Output data directory
├── main.py                  # Main entry point
├── requirements.txt         # Python dependencies
├── pytest.ini              # Test configuration
└── README.md               # This file
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

### Development Guidelines

- Follow PEP 8 style guidelines
- Write comprehensive tests for new features
- Update documentation as needed
- Ensure all tests pass before submitting

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Disclaimer

This tool is for educational and research purposes. Users are responsible for:
- Complying with websites' terms of service
- Respecting robots.txt files
- Using the tool ethically and responsibly
- Not overwhelming servers with too many requests

The authors are not responsible for any misuse of this tool.

## Support

If you encounter any issues or have questions:

1. Check the [Issues](https://github.com/iwasforcedtobehere/market-trends-scraper/issues) page
2. Create a new issue with detailed information
3. For general questions, use the [Discussions](https://github.com/iwasforcedtobehere/market-trends-scraper/discussions) tab

## Acknowledgments

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [Selenium](https://www.selenium.dev/) for browser automation
- [Pandas](https://pandas.pydata.org/) for data analysis
- [Loguru](https://github.com/Delgan/loguru) for logging