Initial commit: Market Trends Scraper

This commit is contained in:
Dev
2025-09-11 17:46:14 +03:00
commit 4ddcde68d4
17 changed files with 3049 additions and 0 deletions

340
README.md Normal file
View File

@@ -0,0 +1,340 @@
# Market Trends Scraper
A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.
## Features
- **Multi-Source Scraping**: Collect data from multiple e-commerce websites simultaneously
- **Flexible Configuration**: Easy-to-use YAML configuration for different sources and scraping parameters
- **Dual Scraping Methods**: Supports both requests/BeautifulSoup and Selenium for dynamic content
- **Data Analysis**: Built-in analysis of pricing trends, ratings, and product availability
- **Multiple Output Formats**: Save data in CSV, JSON, or Excel formats
- **Robust Error Handling**: Comprehensive error handling and retry mechanisms
- **Professional Logging**: Detailed logging with configurable levels and outputs
- **Extensible Architecture**: Modular design for easy customization and extension
## Installation
### Prerequisites
- Python 3.8 or higher
- Chrome browser (for Selenium functionality)
- ChromeDriver (compatible with your Chrome version)
### Setup
1. Clone the repository:
```bash
git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
cd market-trends-scraper
```
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Install ChromeDriver (if using Selenium):
```bash
# For Ubuntu/Debian:
sudo apt-get install chromium-chromedriver
# For macOS using Homebrew:
brew install chromedriver
# For Windows, download from: https://chromedriver.chromium.org/
```
## Configuration
The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at `config/config.yaml` when you first run the scraper.
### Example Configuration
```yaml
scraper:
delay_between_requests: 1.0
timeout: 30
max_retries: 3
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
headless: true
window_size: [1920, 1080]
sources:
- name: "example_ecommerce"
url: "https://example-ecommerce.com/search"
type: "ecommerce"
enabled: true
use_selenium: false
selectors:
product: "div.product-item"
name: "h2.product-title"
price: "span.price"
rating: "div.rating"
availability: "div.stock-status"
pagination:
next_page: "a.next-page"
max_pages: 10
output:
format: "csv"
include_timestamp: true
filename: "market_trends_data"
database:
url: "sqlite:///data/market_trends.db"
echo: false
analysis:
price_history_days: 30
trend_threshold: 0.05
generate_charts: true
```
### Configuration Options
#### Scraper Settings
- `delay_between_requests`: Delay between requests in seconds (default: 1.0)
- `timeout`: Request timeout in seconds (default: 30)
- `max_retries`: Maximum number of retry attempts for failed requests (default: 3)
- `user_agent`: User agent string for HTTP requests
- `headless`: Run browser in headless mode (default: true)
- `window_size`: Browser window size as [width, height] (default: [1920, 1080])
#### Source Configuration
- `name`: Unique identifier for the data source
- `url`: Base URL for scraping
- `type`: Type of website (e.g., "ecommerce")
- `enabled`: Whether to scrape this source (default: true)
- `use_selenium`: Use Selenium instead of requests (default: false)
- `selectors`: CSS selectors for extracting data
- `pagination`: Pagination settings
#### Output Settings
- `format`: Output format ("csv", "json", or "excel")
- `include_timestamp`: Include timestamp in output filename (default: true)
- `filename`: Base filename for output files
#### Analysis Settings
- `price_history_days`: Number of days to consider for price history (default: 30)
- `trend_threshold`: Minimum price change percentage to consider as a trend (default: 0.05)
- `generate_charts`: Generate trend charts (default: true)
## Usage
### Command Line Interface
Run the scraper with default settings:
```bash
python main.py
```
Specify a custom configuration file:
```bash
python main.py --config path/to/config.yaml
```
Specify output file:
```bash
python main.py --output path/to/output.csv
```
Run in verbose mode:
```bash
python main.py --verbose
```
Run browser in non-headless mode (for debugging):
```bash
python main.py --no-headless
```
### Python API
```python
from src.config_manager import ConfigManager
from src.scraper import MarketTrendsScraper
from src.logger import setup_logger
# Setup logging
setup_logger(verbose=True)
# Load configuration
config_manager = ConfigManager("config/config.yaml")
config = config_manager.load_config()
# Initialize scraper
with MarketTrendsScraper(config, headless=True) as scraper:
# Scrape data
data = scraper.scrape_market_trends()
# Save data
scraper.save_data(data, "output.csv")
# Analyze trends
analysis = scraper.analyze_trends(data)
# Save analysis
scraper.save_analysis(analysis, "analysis.json")
```
## Output
### Data Output
The scraper produces structured data with the following fields:
| Field | Description |
|-------|-------------|
| name | Product name |
| price | Product price (as float) |
| rating | Product rating (as float) |
| availability | Product availability status |
| url | Product URL |
| source | Data source name |
| scraped_at | Timestamp when data was scraped |
### Analysis Output
The trend analysis includes:
- **Summary Statistics**: Total products, source distribution
- **Price Analysis**: Average, min, max, median prices and distribution
- **Rating Analysis**: Average, min, max ratings and distribution
- **Availability Analysis**: Count of products by availability status
- **Price Trends by Source**: Comparative analysis across sources
Example analysis output:
```json
{
"total_products": 150,
"sources": {
"example_ecommerce": 100,
"another_store": 50
},
"price_analysis": {
"average_price": 49.99,
"min_price": 9.99,
"max_price": 199.99,
"median_price": 45.00,
"price_distribution": {
"count": 150,
"mean": 49.99,
"std": 35.25,
"min": 9.99,
"25%": 25.00,
"50%": 45.00,
"75%": 75.00,
"max": 199.99
}
},
"rating_analysis": {
"average_rating": 4.2,
"min_rating": 1.0,
"max_rating": 5.0,
"rating_distribution": {
"5.0": 45,
"4.0": 60,
"3.0": 30,
"2.0": 10,
"1.0": 5
}
}
}
```
## Testing
Run all tests:
```bash
pytest
```
Run unit tests only:
```bash
pytest -m unit
```
Run integration tests only:
```bash
pytest -m integration
```
Run tests with coverage report:
```bash
pytest --cov=src --cov-report=html
```
## Project Structure
```
market-trends-scraper/
├── src/
│ ├── __init__.py
│ ├── config_manager.py # Configuration management
│ ├── logger.py # Logging utilities
│ └── scraper.py # Main scraper implementation
├── tests/
│ ├── __init__.py
│ ├── test_config_manager.py
│ ├── test_logger.py
│ ├── test_scraper.py
│ └── test_integration.py
├── config/
│ └── config.yaml # Configuration file
├── data/ # Output data directory
├── main.py # Main entry point
├── requirements.txt # Python dependencies
├── pytest.ini # Test configuration
└── README.md # This file
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
### Development Guidelines
- Follow PEP 8 style guidelines
- Write comprehensive tests for new features
- Update documentation as needed
- Ensure all tests pass before submitting
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Disclaimer
This tool is for educational and research purposes. Users are responsible for:
- Complying with websites' terms of service
- Respecting robots.txt files
- Using the tool ethically and responsibly
- Not overwhelming servers with too many requests
The authors are not responsible for any misuse of this tool.
## Support
If you encounter any issues or have questions:
1. Check the [Issues](https://github.com/iwasforcedtobehere/market-trends-scraper/issues) page
2. Create a new issue with detailed information
3. For general questions, use the [Discussions](https://github.com/iwasforcedtobehere/market-trends-scraper/discussions) tab
## Acknowledgments
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [Selenium](https://www.selenium.dev/) for browser automation
- [Pandas](https://pandas.pydata.org/) for data analysis
- [Loguru](https://github.com/Delgan/loguru) for logging