Skip to content

Installation

This guide walks you through installing DigitalChild on your system.

Prerequisites

Required

  • Python 3.12 - Modern Python features required
  • pip - Python package installer
  • Git - Version control (for cloning repository)
  • 1GB+ disk space - For code and small dataset
  • Internet connection - For scraping documents

Optional

  • 10GB+ disk space - For large document collections
  • Virtual environment tool - venv, virtualenv, or conda

Installation Steps

1. Clone the Repository

git clone https://github.com/MissCrispenCakes/DigitalChild.git
cd DigitalChild

2. Set Up Virtual Environment

```bash
python3 -m venv .LittleRainbow
source .LittleRainbow/bin/activate
```
```cmd
python -m venv .LittleRainbow
.LittleRainbow\Scripts\activate
```
```bash
conda create -n digitalchild python=3.12
conda activate digitalchild
```

Why virtual environment?

Virtual environments isolate project dependencies, preventing conflicts with other Python projects on your system.

3. Install Dependencies

pip install -r requirements.txt

This installs:

  • beautifulsoup4 - HTML parsing
  • selenium - Browser automation (optional)
  • pandas - Data manipulation
  • PyPDF2 - PDF processing
  • python-docx - Word document processing
  • openpyxl - Excel file handling
  • requests - HTTP requests

4. Initialize Project Structure

python init_project.py

This creates:

  • data/raw/ - Downloaded documents
  • data/processed/ - Extracted text
  • data/metadata/ - Metadata JSON files
  • data/exports/ - CSV export outputs
  • logs/ - Run logs

Ready to Go!

Your installation is complete. Proceed to Quick Start to run your first pipeline.

Development Installation

For contributors and developers:

# Install development tools
pip install pre-commit pytest pytest-cov

# Set up pre-commit hooks
pre-commit install

# Verify installation
pytest tests/ -v
pre-commit run --all-files

Verifying Installation

Test your setup:

# Check Python version
python --version  # Should be 3.12.x

# Test imports
python -c "import pandas; import bs4; print('Success!')"

# Run demo (no internet needed)
python utils/pipeline_runner_DEMO.py

Optional: Selenium Setup

Only needed for _sel variant scrapers (browser automation):

1. Install ChromeDriver

```bash
sudo apt-get install chromium-chromedriver
```
```bash
brew install chromedriver
```
Download from [ChromeDriver](https://chromedriver.chromium.org/) and add to PATH.

2. Verify Selenium

python -c "from selenium import webdriver; print('Selenium ready!')"

Troubleshooting

Python Version Issues

Error: Python 3.12 required

The project uses modern Python features from 3.12. Install Python 3.12 from python.org.

Virtual Environment Not Activating

Check file permissions:
```bash
chmod +x .LittleRainbow/bin/activate
```
Enable script execution:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```

Dependency Installation Failures

Try upgrading pip first:

pip install --upgrade pip
pip install -r requirements.txt

Import Errors

Ensure you're running from project root:

# Wrong
cd processors
python pipeline_runner.py  # Error!

# Right
cd /path/to/DigitalChild
python pipeline_runner.py  # Success

More Help

See First Run Errors for comprehensive troubleshooting.

Next Steps

System Requirements

Minimum

  • Python 3.12
  • 1GB RAM
  • 1GB disk space
  • Broadband internet
  • Python 3.12
  • 4GB+ RAM
  • 10GB+ disk space
  • Fast internet connection
  • SSD for faster processing

Platform Support

DigitalChild runs on:

  • ✅ Linux (Ubuntu, Debian, Fedora, etc.)
  • ✅ macOS (10.15+)
  • ✅ Windows 10/11
  • ✅ WSL2 (Windows Subsystem for Linux)
  • ✅ Cloud VMs (AWS EC2, Google Cloud, Azure, DigitalOcean)

Need Help?