Initial commit: Market Trends Scraper
This commit is contained in:
340
README.md
Normal file
340
README.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# Market Trends Scraper
|
||||
|
||||
A powerful and flexible Python web scraper for collecting and analyzing pricing and product trends from e-commerce websites. This tool provides comprehensive market insights by extracting product data, tracking price changes, and generating detailed trend analyses.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multi-Source Scraping**: Collect data from multiple e-commerce websites simultaneously
|
||||
- **Flexible Configuration**: Easy-to-use YAML configuration for different sources and scraping parameters
|
||||
- **Dual Scraping Methods**: Supports both requests/BeautifulSoup and Selenium for dynamic content
|
||||
- **Data Analysis**: Built-in analysis of pricing trends, ratings, and product availability
|
||||
- **Multiple Output Formats**: Save data in CSV, JSON, or Excel formats
|
||||
- **Robust Error Handling**: Comprehensive error handling and retry mechanisms
|
||||
- **Professional Logging**: Detailed logging with configurable levels and outputs
|
||||
- **Extensible Architecture**: Modular design for easy customization and extension
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.8 or higher
|
||||
- Chrome browser (for Selenium functionality)
|
||||
- ChromeDriver (compatible with your Chrome version)
|
||||
|
||||
### Setup
|
||||
|
||||
1. Clone the repository:
|
||||
```bash
|
||||
git clone https://github.com/iwasforcedtobehere/market-trends-scraper.git
|
||||
cd market-trends-scraper
|
||||
```
|
||||
|
||||
2. Create and activate a virtual environment:
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
3. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
4. Install ChromeDriver (if using Selenium):
|
||||
```bash
|
||||
# For Ubuntu/Debian:
|
||||
sudo apt-get install chromium-chromedriver
|
||||
|
||||
# For macOS using Homebrew:
|
||||
brew install chromedriver
|
||||
|
||||
# For Windows, download from: https://chromedriver.chromium.org/
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
The scraper uses a YAML configuration file to define scraping sources and parameters. A default configuration will be created automatically at `config/config.yaml` when you first run the scraper.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
```yaml
|
||||
scraper:
|
||||
delay_between_requests: 1.0
|
||||
timeout: 30
|
||||
max_retries: 3
|
||||
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
||||
headless: true
|
||||
window_size: [1920, 1080]
|
||||
|
||||
sources:
|
||||
- name: "example_ecommerce"
|
||||
url: "https://example-ecommerce.com/search"
|
||||
type: "ecommerce"
|
||||
enabled: true
|
||||
use_selenium: false
|
||||
selectors:
|
||||
product: "div.product-item"
|
||||
name: "h2.product-title"
|
||||
price: "span.price"
|
||||
rating: "div.rating"
|
||||
availability: "div.stock-status"
|
||||
pagination:
|
||||
next_page: "a.next-page"
|
||||
max_pages: 10
|
||||
|
||||
output:
|
||||
format: "csv"
|
||||
include_timestamp: true
|
||||
filename: "market_trends_data"
|
||||
|
||||
database:
|
||||
url: "sqlite:///data/market_trends.db"
|
||||
echo: false
|
||||
|
||||
analysis:
|
||||
price_history_days: 30
|
||||
trend_threshold: 0.05
|
||||
generate_charts: true
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
#### Scraper Settings
|
||||
- `delay_between_requests`: Delay between requests in seconds (default: 1.0)
|
||||
- `timeout`: Request timeout in seconds (default: 30)
|
||||
- `max_retries`: Maximum number of retry attempts for failed requests (default: 3)
|
||||
- `user_agent`: User agent string for HTTP requests
|
||||
- `headless`: Run browser in headless mode (default: true)
|
||||
- `window_size`: Browser window size as [width, height] (default: [1920, 1080])
|
||||
|
||||
#### Source Configuration
|
||||
- `name`: Unique identifier for the data source
|
||||
- `url`: Base URL for scraping
|
||||
- `type`: Type of website (e.g., "ecommerce")
|
||||
- `enabled`: Whether to scrape this source (default: true)
|
||||
- `use_selenium`: Use Selenium instead of requests (default: false)
|
||||
- `selectors`: CSS selectors for extracting data
|
||||
- `pagination`: Pagination settings
|
||||
|
||||
#### Output Settings
|
||||
- `format`: Output format ("csv", "json", or "excel")
|
||||
- `include_timestamp`: Include timestamp in output filename (default: true)
|
||||
- `filename`: Base filename for output files
|
||||
|
||||
#### Analysis Settings
|
||||
- `price_history_days`: Number of days to consider for price history (default: 30)
|
||||
- `trend_threshold`: Minimum price change percentage to consider as a trend (default: 0.05)
|
||||
- `generate_charts`: Generate trend charts (default: true)
|
||||
|
||||
## Usage
|
||||
|
||||
### Command Line Interface
|
||||
|
||||
Run the scraper with default settings:
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Specify a custom configuration file:
|
||||
```bash
|
||||
python main.py --config path/to/config.yaml
|
||||
```
|
||||
|
||||
Specify output file:
|
||||
```bash
|
||||
python main.py --output path/to/output.csv
|
||||
```
|
||||
|
||||
Run in verbose mode:
|
||||
```bash
|
||||
python main.py --verbose
|
||||
```
|
||||
|
||||
Run browser in non-headless mode (for debugging):
|
||||
```bash
|
||||
python main.py --no-headless
|
||||
```
|
||||
|
||||
### Python API
|
||||
|
||||
```python
|
||||
from src.config_manager import ConfigManager
|
||||
from src.scraper import MarketTrendsScraper
|
||||
from src.logger import setup_logger
|
||||
|
||||
# Setup logging
|
||||
setup_logger(verbose=True)
|
||||
|
||||
# Load configuration
|
||||
config_manager = ConfigManager("config/config.yaml")
|
||||
config = config_manager.load_config()
|
||||
|
||||
# Initialize scraper
|
||||
with MarketTrendsScraper(config, headless=True) as scraper:
|
||||
# Scrape data
|
||||
data = scraper.scrape_market_trends()
|
||||
|
||||
# Save data
|
||||
scraper.save_data(data, "output.csv")
|
||||
|
||||
# Analyze trends
|
||||
analysis = scraper.analyze_trends(data)
|
||||
|
||||
# Save analysis
|
||||
scraper.save_analysis(analysis, "analysis.json")
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
### Data Output
|
||||
|
||||
The scraper produces structured data with the following fields:
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| name | Product name |
|
||||
| price | Product price (as float) |
|
||||
| rating | Product rating (as float) |
|
||||
| availability | Product availability status |
|
||||
| url | Product URL |
|
||||
| source | Data source name |
|
||||
| scraped_at | Timestamp when data was scraped |
|
||||
|
||||
### Analysis Output
|
||||
|
||||
The trend analysis includes:
|
||||
|
||||
- **Summary Statistics**: Total products, source distribution
|
||||
- **Price Analysis**: Average, min, max, median prices and distribution
|
||||
- **Rating Analysis**: Average, min, max ratings and distribution
|
||||
- **Availability Analysis**: Count of products by availability status
|
||||
- **Price Trends by Source**: Comparative analysis across sources
|
||||
|
||||
Example analysis output:
|
||||
```json
|
||||
{
|
||||
"total_products": 150,
|
||||
"sources": {
|
||||
"example_ecommerce": 100,
|
||||
"another_store": 50
|
||||
},
|
||||
"price_analysis": {
|
||||
"average_price": 49.99,
|
||||
"min_price": 9.99,
|
||||
"max_price": 199.99,
|
||||
"median_price": 45.00,
|
||||
"price_distribution": {
|
||||
"count": 150,
|
||||
"mean": 49.99,
|
||||
"std": 35.25,
|
||||
"min": 9.99,
|
||||
"25%": 25.00,
|
||||
"50%": 45.00,
|
||||
"75%": 75.00,
|
||||
"max": 199.99
|
||||
}
|
||||
},
|
||||
"rating_analysis": {
|
||||
"average_rating": 4.2,
|
||||
"min_rating": 1.0,
|
||||
"max_rating": 5.0,
|
||||
"rating_distribution": {
|
||||
"5.0": 45,
|
||||
"4.0": 60,
|
||||
"3.0": 30,
|
||||
"2.0": 10,
|
||||
"1.0": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run all tests:
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
Run unit tests only:
|
||||
```bash
|
||||
pytest -m unit
|
||||
```
|
||||
|
||||
Run integration tests only:
|
||||
```bash
|
||||
pytest -m integration
|
||||
```
|
||||
|
||||
Run tests with coverage report:
|
||||
```bash
|
||||
pytest --cov=src --cov-report=html
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
market-trends-scraper/
|
||||
├── src/
|
||||
│ ├── __init__.py
|
||||
│ ├── config_manager.py # Configuration management
|
||||
│ ├── logger.py # Logging utilities
|
||||
│ └── scraper.py # Main scraper implementation
|
||||
├── tests/
|
||||
│ ├── __init__.py
|
||||
│ ├── test_config_manager.py
|
||||
│ ├── test_logger.py
|
||||
│ ├── test_scraper.py
|
||||
│ └── test_integration.py
|
||||
├── config/
|
||||
│ └── config.yaml # Configuration file
|
||||
├── data/ # Output data directory
|
||||
├── main.py # Main entry point
|
||||
├── requirements.txt # Python dependencies
|
||||
├── pytest.ini # Test configuration
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
|
||||
3. Commit your changes (`git commit -m 'Add amazing feature'`)
|
||||
4. Push to the branch (`git push origin feature/amazing-feature`)
|
||||
5. Open a Pull Request
|
||||
|
||||
### Development Guidelines
|
||||
|
||||
- Follow PEP 8 style guidelines
|
||||
- Write comprehensive tests for new features
|
||||
- Update documentation as needed
|
||||
- Ensure all tests pass before submitting
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
||||
|
||||
## Disclaimer
|
||||
|
||||
This tool is for educational and research purposes. Users are responsible for:
|
||||
- Complying with websites' terms of service
|
||||
- Respecting robots.txt files
|
||||
- Using the tool ethically and responsibly
|
||||
- Not overwhelming servers with too many requests
|
||||
|
||||
The authors are not responsible for any misuse of this tool.
|
||||
|
||||
## Support
|
||||
|
||||
If you encounter any issues or have questions:
|
||||
|
||||
1. Check the [Issues](https://github.com/iwasforcedtobehere/market-trends-scraper/issues) page
|
||||
2. Create a new issue with detailed information
|
||||
3. For general questions, use the [Discussions](https://github.com/iwasforcedtobehere/market-trends-scraper/discussions) tab
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
|
||||
- [Selenium](https://www.selenium.dev/) for browser automation
|
||||
- [Pandas](https://pandas.pydata.org/) for data analysis
|
||||
- [Loguru](https://github.com/Delgan/loguru) for logging
|
Reference in New Issue
Block a user