Quick Start¶

Get started with DigitalChild in 5 minutes.

Prerequisites¶

Ensure you've installed DigitalChild and activated your virtual environment.

Your First Pipeline Run¶

Step 1: Basic Run¶

Run the pipeline on African Union policy documents:

python pipeline_runner.py --source au_policy

This will:

✅ Scrape AU policy documents (or skip if already downloaded)
✅ Process PDFs to extract text
✅ Apply tags using default tag configuration
✅ Generate exports in data/exports/

Expected output:

[INFO] Starting pipeline for source: au_policy
[INFO] Scraping documents...
[INFO] Found 12 documents
[INFO] Processing documents...
[INFO] Tagging documents...
[INFO] Generating exports...
[INFO] Pipeline complete!

Step 2: View Results¶

Check the exports:

ls data/exports/

You'll find:

tags_summary.csv - Tag frequencies and document counts
metadata_export.csv - All document metadata

Open in Excel, Google Sheets, or analyze with pandas:

import pandas as pd

df = pd.read_csv('data/exports/tags_summary.csv')
print(df.head())

Step 3: Explore Metadata¶

View processed documents:

# See metadata
cat data/metadata/metadata.json | python -m json.tool | head -50

# See processed text
ls data/processed/Africa/AU/text/

Common Use Cases¶

Process Specific Country (UPR)¶

python pipeline_runner.py --source upr --country kenya

This scrapes and processes UPR (Universal Periodic Review) documents for Kenya specifically.

Use Latest Tags¶

python pipeline_runner.py --source au_policy --tags-version latest

The latest version points to the most recent tag configuration (currently tags_v3).

Run Scorecard Workflow¶

python pipeline_runner.py --mode scorecard --scorecard-action all

This:

Enriches metadata with country-level indicators
Exports scorecard summaries
Validates all 2,543 source URLs

Results appear in data/exports/scorecard_*.csv.

Process Multiple Sources¶

# AU Policy
python pipeline_runner.py --source au_policy

# OHCHR
python pipeline_runner.py --source ohchr

# UNICEF
python pipeline_runner.py --source unicef

Each source has unique scraping logic for that organization's website.

Understanding the Pipeline¶

Data Flow¶

┌─────────────────┐
│   WEB SOURCES   │ (AU, OHCHR, UPR, UNICEF)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    SCRAPERS     │ Download PDFs/DOCX/HTML
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   data/raw/     │ Store downloaded files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   PROCESSORS    │ Extract text
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ data/processed/ │ Store text files
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│     TAGGER      │ Apply regex rules
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    METADATA     │ metadata.json with tags history
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    EXPORTS      │ CSV files for analysis
└─────────────────┘

File Locations¶

After running the pipeline:

Path	Contents
`data/raw/au_policy/`	Downloaded PDF files
`data/processed/Africa/AU/text/`	Extracted text files
`data/metadata/metadata.json`	Document metadata with tags
`data/exports/tags_summary.csv`	Tag analysis
`logs/`	Run logs with timestamps

Pipeline Modes¶

The pipeline has 3 modes:

scraper (default)

**Complete workflow:** Scrape → Process → Tag → Export

```bash
python pipeline_runner.py --source au_policy
```

urls

**From static URLs:** Process from `configs/url_dict/*.json`

```bash
python pipeline_runner.py --mode urls --source upr
```

scorecard

**Indicator workflow:** Enrich → Export → Validate

```bash
python pipeline_runner.py --mode scorecard --scorecard-action all
```

Command Reference¶

Required Arguments¶

Argument	Description	Example
`--source`	Data source name	`au_policy`, `upr`, `ohchr`

Optional Arguments¶

Argument	Description	Example
`--tags-version`	Tag config version	`latest`, `v3`, `v2`
`--mode`	Pipeline mode	`scraper`, `urls`, `scorecard`
`--country`	Filter by country	`kenya`, `south_africa`
`--scorecard-action`	Scorecard action	`enrich`, `export`, `validate`, `all`
`--no-module-logs`	Disable per-module logs	(flag, no value)

Examples¶

# Minimal
python pipeline_runner.py --source au_policy

# With tags version
python pipeline_runner.py --source upr --tags-version v3

# Country-specific
python pipeline_runner.py --source upr --country kenya

# Scorecard only
python pipeline_runner.py --mode scorecard --scorecard-action enrich

# URLs mode
python pipeline_runner.py --mode urls --source upr

Supported Sources¶

Source	Description	Documents
`au_policy`	African Union policy documents	~10-15
`ohchr`	OHCHR Treaty Body database	Hundreds
`upr`	Universal Periodic Review (per country)	~50 per country
`unicef`	UNICEF reports	Varies
`acerwc`	African Committee on Child Rights	~20-30
`achpr`	African Commission on Human Rights	~30-40
`manual`	Manual uploads to `data/raw/manual/`	User-provided

Next Steps¶

Learn More

Dive deeper into pipeline operations

Read Runbook
Customize Tags

Add your own tag patterns

Tags Config Format
Explore Scorecard

Understand country indicators

Scorecard Workflow
Add Scrapers

Build scrapers for new sources

Scraper Structure

Troubleshooting¶

No documents found

Check if documents already exist in data/raw/<source>/. The pipeline skips existing files. Delete to re-scrape.

Import errors

Ensure you're running from project root, not from subdirectories. Use absolute paths if needed.

Processing failed

Check logs/ for error details. Some PDFs may be scanned images (no text layer) and will fail.

Tags summary empty

Verify documents have text content. Check data/processed/ for .txt files.

See First Run Errors for comprehensive troubleshooting.

Pro Tips¶

Incremental Processing

The pipeline skips already-downloaded files. Run again to only process new documents.

Parallel Analysis

Export CSV files can be analyzed in parallel with R, Python, Excel, or Tableau.

Custom Tags

Edit configs/tags_v3.json to add your own regex patterns. Re-run with --tags-version v3.

Version Control

Tags history preserves all tagging operations. Compare results across tag versions using metadata.

Getting Help¶

Documentation: Full docs index
FAQ: Common questions
Issues: GitHub Issues
Discussions: GitHub Discussions

Happy analyzing! 🌈