Skip to content

Installation

This guide walks you through installing DigitalChild on your system.

Just want to access the data?

Skip installation! Use the REST API instead:

# Install API only
pip install -r api_requirements.txt
python run_api.py

API Quick Start

Prerequisites

Required

  • Python 3.12 - Modern Python features required
  • pip - Python package installer
  • Git - Version control (for cloning repository)
  • 1GB+ disk space - For code and small dataset
  • Internet connection - For scraping documents

Optional

  • 10GB+ disk space - For large document collections
  • Virtual environment tool - venv, virtualenv, or conda

Installation Steps

1. Clone the Repository

git clone https://github.com/MissCrispenCakes/DigitalChild.git
cd DigitalChild

2. Set Up Virtual Environment

```bash
python3 -m venv .LittleRainbow
source .LittleRainbow/bin/activate
```
```cmd
python -m venv .LittleRainbow
.LittleRainbow\Scripts\activate
```
```bash
conda create -n LittleRainbow python=3.12
conda activate LittleRainbow
```

Why virtual environment?

Virtual environments isolate project dependencies, preventing conflicts with other Python projects on your system.

3. Install Dependencies

Pipeline Dependencies:

pip install -r requirements.txt

This installs:

  • beautifulsoup4 - HTML parsing
  • selenium - Browser automation (optional)
  • pandas - Data manipulation
  • pypdf - PDF processing
  • python-docx - Word document processing
  • openpyxl - Excel file handling
  • requests - HTTP requests

API Dependencies (Optional):

If you want to run the Flask REST API:

pip install -r api_requirements.txt

This adds:

  • Flask - Web framework
  • Flask-CORS - Cross-origin resource sharing
  • Flask-Caching - Response caching
  • Flask-Limiter - Rate limiting
  • gunicorn - Production server

4. Initialize Project Structure

python init_project.py

This creates:

  • data/raw/ - Downloaded documents
  • data/processed/ - Extracted text
  • data/metadata/ - Metadata JSON files
  • data/exports/ - CSV export outputs
  • logs/ - Run logs

Ready to Go!

Your installation is complete. Proceed to Quick Start to run your first pipeline.

Development Installation

For contributors and developers:

# Install development tools
pip install pre-commit pytest pytest-cov

# Set up pre-commit hooks
pre-commit install

# Verify installation
pytest tests/ -v
pre-commit run --all-files

Verifying Installation

Test your setup:

# Check Python version
python --version  # Should be 3.12.x

# Test imports
python -c "import pandas; import bs4; print('Success!')"

# Run demo (no internet needed)
python utils/pipeline_runner_DEMO.py

Verify API Installation (Optional):

If you installed API dependencies:

# Test Flask import
python -c "import flask; print('Flask ready!')"

# Run API health check
python run_api.py &
sleep 2
python test_api.py

Optional: Selenium Setup

Only needed for _sel variant scrapers (browser automation):

1. Install ChromeDriver

```bash
sudo apt-get install chromium-chromedriver
```
```bash
brew install chromedriver
```
Download from [ChromeDriver](https://chromedriver.chromium.org/) and add to PATH.

2. Verify Selenium

python -c "from selenium import webdriver; print('Selenium ready!')"

Troubleshooting

Python Version Issues

Error: Python 3.12 required

The project uses modern Python features from 3.12. Install Python 3.12 from python.org.

Virtual Environment Not Activating

Check file permissions:
```bash
chmod +x .LittleRainbow/bin/activate
```
Enable script execution:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```

Dependency Installation Failures

Try upgrading pip first:

pip install --upgrade pip
pip install -r requirements.txt

Import Errors

Ensure you're running from project root:

# Wrong
cd processors
python pipeline_runner.py  # Error!

# Right
cd /path/to/DigitalChild
python pipeline_runner.py  # Success

More Help

See First Run Errors for comprehensive troubleshooting.

Next Steps

System Requirements

Minimum

  • Python 3.12
  • 1GB RAM
  • 1GB disk space
  • Broadband internet
  • Python 3.12
  • 4GB+ RAM
  • 10GB+ disk space
  • Fast internet connection
  • SSD for faster processing

Platform Support

DigitalChild runs on:

  • ✅ Linux (Ubuntu, Debian, Fedora, etc.)
  • ✅ macOS (10.15+)
  • ✅ Windows 10/11
  • ✅ WSL2 (Windows Subsystem for Linux)
  • ✅ Cloud VMs (AWS EC2, Google Cloud, Azure, DigitalOcean)

Need Help?