340 lines
9.0 KiB
Markdown
340 lines
9.0 KiB
Markdown
# Market Trends Scraper
|
|
|
|
A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.
|
|
|
|
## Features
|
|
|
|
- **Multi-Source Scraping**: Collect data from multiple e-commerce websites simultaneously
|
|
- **Flexible Configuration**: Easy-to-use YAML configuration for different sources and scraping parameters
|
|
- **Dual Scraping Methods**: Supports both requests/BeautifulSoup and Selenium for dynamic content
|
|
- **Data Analysis**: Built-in analysis of pricing trends, ratings, and product availability
|
|
- **Multiple Output Formats**: Save data in CSV, JSON, or Excel formats
|
|
- **Robust Error Handling**: Comprehensive error handling and retry mechanisms
|
|
- **Professional Logging**: Detailed logging with configurable levels and outputs
|
|
- **Extensible Architecture**: Modular design for easy customization and extension
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.8 or higher
|
|
- Chrome browser (for Selenium functionality)
|
|
- ChromeDriver (compatible with your Chrome version)
|
|
|
|
### Setup
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
|
|
cd market-trends-scraper
|
|
```
|
|
|
|
2. Create and activate a virtual environment:
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
```
|
|
|
|
3. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
4. Install ChromeDriver (if using Selenium):
|
|
```bash
|
|
# For Ubuntu/Debian:
|
|
sudo apt-get install chromium-chromedriver
|
|
|
|
# For macOS using Homebrew:
|
|
brew install chromedriver
|
|
|
|
# For Windows, download from: https://chromedriver.chromium.org/
|
|
```
|
|
|
|
## Configuration
|
|
|
|
The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at `config/config.yaml` when you first run the scraper.
|
|
|
|
### Example Configuration
|
|
|
|
```yaml
|
|
scraper:
|
|
delay_between_requests: 1.0
|
|
timeout: 30
|
|
max_retries: 3
|
|
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
|
headless: true
|
|
window_size: [1920, 1080]
|
|
|
|
sources:
|
|
- name: "example_ecommerce"
|
|
url: "https://example-ecommerce.com/search"
|
|
type: "ecommerce"
|
|
enabled: true
|
|
use_selenium: false
|
|
selectors:
|
|
product: "div.product-item"
|
|
name: "h2.product-title"
|
|
price: "span.price"
|
|
rating: "div.rating"
|
|
availability: "div.stock-status"
|
|
pagination:
|
|
next_page: "a.next-page"
|
|
max_pages: 10
|
|
|
|
output:
|
|
format: "csv"
|
|
include_timestamp: true
|
|
filename: "market_trends_data"
|
|
|
|
database:
|
|
url: "sqlite:///data/market_trends.db"
|
|
echo: false
|
|
|
|
analysis:
|
|
price_history_days: 30
|
|
trend_threshold: 0.05
|
|
generate_charts: true
|
|
```
|
|
|
|
### Configuration Options
|
|
|
|
#### Scraper Settings
|
|
- `delay_between_requests`: Delay between requests in seconds (default: 1.0)
|
|
- `timeout`: Request timeout in seconds (default: 30)
|
|
- `max_retries`: Maximum number of retry attempts for failed requests (default: 3)
|
|
- `user_agent`: User agent string for HTTP requests
|
|
- `headless`: Run browser in headless mode (default: true)
|
|
- `window_size`: Browser window size as [width, height] (default: [1920, 1080])
|
|
|
|
#### Source Configuration
|
|
- `name`: Unique identifier for the data source
|
|
- `url`: Base URL for scraping
|
|
- `type`: Type of website (e.g., "ecommerce")
|
|
- `enabled`: Whether to scrape this source (default: true)
|
|
- `use_selenium`: Use Selenium instead of requests (default: false)
|
|
- `selectors`: CSS selectors for extracting data
|
|
- `pagination`: Pagination settings
|
|
|
|
#### Output Settings
|
|
- `format`: Output format ("csv", "json", or "excel")
|
|
- `include_timestamp`: Include timestamp in output filename (default: true)
|
|
- `filename`: Base filename for output files
|
|
|
|
#### Analysis Settings
|
|
- `price_history_days`: Number of days to consider for price history (default: 30)
|
|
- `trend_threshold`: Minimum price change percentage to consider as a trend (default: 0.05)
|
|
- `generate_charts`: Generate trend charts (default: true)
|
|
|
|
## Usage
|
|
|
|
### Command Line Interface
|
|
|
|
Run the scraper with default settings:
|
|
```bash
|
|
python main.py
|
|
```
|
|
|
|
Specify a custom configuration file:
|
|
```bash
|
|
python main.py --config path/to/config.yaml
|
|
```
|
|
|
|
Specify output file:
|
|
```bash
|
|
python main.py --output path/to/output.csv
|
|
```
|
|
|
|
Run in verbose mode:
|
|
```bash
|
|
python main.py --verbose
|
|
```
|
|
|
|
Run browser in non-headless mode (for debugging):
|
|
```bash
|
|
python main.py --no-headless
|
|
```
|
|
|
|
### Python API
|
|
|
|
```python
|
|
from src.config_manager import ConfigManager
|
|
from src.scraper import MarketTrendsScraper
|
|
from src.logger import setup_logger
|
|
|
|
# Setup logging
|
|
setup_logger(verbose=True)
|
|
|
|
# Load configuration
|
|
config_manager = ConfigManager("config/config.yaml")
|
|
config = config_manager.load_config()
|
|
|
|
# Initialize scraper
|
|
with MarketTrendsScraper(config, headless=True) as scraper:
|
|
# Scrape data
|
|
data = scraper.scrape_market_trends()
|
|
|
|
# Save data
|
|
scraper.save_data(data, "output.csv")
|
|
|
|
# Analyze trends
|
|
analysis = scraper.analyze_trends(data)
|
|
|
|
# Save analysis
|
|
scraper.save_analysis(analysis, "analysis.json")
|
|
```
|
|
|
|
## Output
|
|
|
|
### Data Output
|
|
|
|
The scraper produces structured data with the following fields:
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| name | Product name |
|
|
| price | Product price (as float) |
|
|
| rating | Product rating (as float) |
|
|
| availability | Product availability status |
|
|
| url | Product URL |
|
|
| source | Data source name |
|
|
| scraped_at | Timestamp when data was scraped |
|
|
|
|
### Analysis Output
|
|
|
|
The trend analysis includes:
|
|
|
|
- **Summary Statistics**: Total products, source distribution
|
|
- **Price Analysis**: Average, min, max, median prices and distribution
|
|
- **Rating Analysis**: Average, min, max ratings and distribution
|
|
- **Availability Analysis**: Count of products by availability status
|
|
- **Price Trends by Source**: Comparative analysis across sources
|
|
|
|
Example analysis output:
|
|
```json
|
|
{
|
|
"total_products": 150,
|
|
"sources": {
|
|
"example_ecommerce": 100,
|
|
"another_store": 50
|
|
},
|
|
"price_analysis": {
|
|
"average_price": 49.99,
|
|
"min_price": 9.99,
|
|
"max_price": 199.99,
|
|
"median_price": 45.00,
|
|
"price_distribution": {
|
|
"count": 150,
|
|
"mean": 49.99,
|
|
"std": 35.25,
|
|
"min": 9.99,
|
|
"25%": 25.00,
|
|
"50%": 45.00,
|
|
"75%": 75.00,
|
|
"max": 199.99
|
|
}
|
|
},
|
|
"rating_analysis": {
|
|
"average_rating": 4.2,
|
|
"min_rating": 1.0,
|
|
"max_rating": 5.0,
|
|
"rating_distribution": {
|
|
"5.0": 45,
|
|
"4.0": 60,
|
|
"3.0": 30,
|
|
"2.0": 10,
|
|
"1.0": 5
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
Run all tests:
|
|
```bash
|
|
pytest
|
|
```
|
|
|
|
Run unit tests only:
|
|
```bash
|
|
pytest -m unit
|
|
```
|
|
|
|
Run integration tests only:
|
|
```bash
|
|
pytest -m integration
|
|
```
|
|
|
|
Run tests with coverage report:
|
|
```bash
|
|
pytest --cov=src --cov-report=html
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
market-trends-scraper/
|
|
├── src/
|
|
│ ├── __init__.py
|
|
│ ├── config_manager.py # Configuration management
|
|
│ ├── logger.py # Logging utilities
|
|
│ └── scraper.py # Main scraper implementation
|
|
├── tests/
|
|
│ ├── __init__.py
|
|
│ ├── test_config_manager.py
|
|
│ ├── test_logger.py
|
|
│ ├── test_scraper.py
|
|
│ └── test_integration.py
|
|
├── config/
|
|
│ └── config.yaml # Configuration file
|
|
├── data/ # Output data directory
|
|
├── main.py # Main entry point
|
|
├── requirements.txt # Python dependencies
|
|
├── pytest.ini # Test configuration
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
|
|
3. Commit your changes (`git commit -m 'Add amazing feature'`)
|
|
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
5. Open a Pull Request
|
|
|
|
### Development Guidelines
|
|
|
|
- Follow PEP 8 style guidelines
|
|
- Write comprehensive tests for new features
|
|
- Update documentation as needed
|
|
- Ensure all tests pass before submitting
|
|
|
|
## License
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
|
|
## Disclaimer
|
|
|
|
This tool is for educational and research purposes. Users are responsible for:
|
|
- Complying with websites' terms of service
|
|
- Respecting robots.txt files
|
|
- Using the tool ethically and responsibly
|
|
- Not overwhelming servers with too many requests
|
|
|
|
The authors are not responsible for any misuse of this tool.
|
|
|
|
## Support
|
|
|
|
If you encounter any issues or have questions:
|
|
|
|
1. Check the [Issues](https://github.com/iwasforcedtobehere/market-trends-scraper/issues) page
|
|
2. Create a new issue with detailed information
|
|
3. For general questions, use the [Discussions](https://github.com/iwasforcedtobehere/market-trends-scraper/discussions) tab
|
|
|
|
## Acknowledgments
|
|
|
|
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
|
|
- [Selenium](https://www.selenium.dev/) for browser automation
|
|
- [Pandas](https://pandas.pydata.org/) for data analysis
|
|
- [Loguru](https://github.com/Delgan/loguru) for logging |