Skip to content

Quick Start

Get started with DigitalChild in 5 minutes.

Prerequisites

Ensure you've installed DigitalChild and activated your virtual environment.

Your First Pipeline Run

Step 1: Basic Run

Run the pipeline on African Union policy documents:

python pipeline_runner.py --source au_policy

This will:

  1. ✅ Scrape AU policy documents (or skip if already downloaded)
  2. ✅ Process PDFs to extract text
  3. ✅ Apply tags using default tag configuration
  4. ✅ Generate exports in data/exports/

Expected output:

[INFO] Starting pipeline for source: au_policy
[INFO] Scraping documents...
[INFO] Found 12 documents
[INFO] Processing documents...
[INFO] Tagging documents...
[INFO] Generating exports...
[INFO] Pipeline complete!

Step 2: View Results

Check the exports:

ls data/exports/

You'll find:

  • tags_summary.csv - Tag frequencies and document counts
  • metadata_export.csv - All document metadata

Open in Excel, Google Sheets, or analyze with pandas:

import pandas as pd

df = pd.read_csv('data/exports/tags_summary.csv')
print(df.head())

Step 3: Explore Metadata

View processed documents:

# See metadata
cat data/metadata/metadata.json | python -m json.tool | head -50

# See processed text
ls data/processed/Africa/AU/text/

Common Use Cases

Process Specific Country (UPR)

python pipeline_runner.py --source upr --country kenya

This scrapes and processes UPR (Universal Periodic Review) documents for Kenya specifically.

Use Latest Tags

python pipeline_runner.py --source au_policy --tags-version latest

The latest version points to the most recent tag configuration (currently tags_v3).

Run Scorecard Workflow

python pipeline_runner.py --mode scorecard --scorecard-action all

This:

  1. Enriches metadata with country-level indicators
  2. Exports scorecard summaries
  3. Validates all 2,543 source URLs

Results appear in data/exports/scorecard_*.csv.

Process Multiple Sources

# AU Policy
python pipeline_runner.py --source au_policy

# OHCHR
python pipeline_runner.py --source ohchr

# UNICEF
python pipeline_runner.py --source unicef

Each source has unique scraping logic for that organization's website.

Understanding the Pipeline

Data Flow

┌─────────────────┐
│   WEB SOURCES   │ (AU, OHCHR, UPR, UNICEF)
└────────┬────────┘
┌─────────────────┐
│    SCRAPERS     │ Download PDFs/DOCX/HTML
└────────┬────────┘
┌─────────────────┐
│   data/raw/     │ Store downloaded files
└────────┬────────┘
┌─────────────────┐
│   PROCESSORS    │ Extract text
└────────┬────────┘
┌─────────────────┐
│ data/processed/ │ Store text files
└────────┬────────┘
┌─────────────────┐
│     TAGGER      │ Apply regex rules
└────────┬────────┘
┌─────────────────┐
│    METADATA     │ metadata.json with tags history
└────────┬────────┘
┌─────────────────┐
│    EXPORTS      │ CSV files for analysis
└─────────────────┘

File Locations

After running the pipeline:

Path Contents
data/raw/au_policy/ Downloaded PDF files
data/processed/Africa/AU/text/ Extracted text files
data/metadata/metadata.json Document metadata with tags
data/exports/tags_summary.csv Tag analysis
logs/ Run logs with timestamps

Pipeline Modes

The pipeline has 3 modes:

**Complete workflow:** Scrape → Process → Tag → Export

```bash
python pipeline_runner.py --source au_policy
```
**From static URLs:** Process from `configs/url_dict/*.json`

```bash
python pipeline_runner.py --mode urls --source upr
```
**Indicator workflow:** Enrich → Export → Validate

```bash
python pipeline_runner.py --mode scorecard --scorecard-action all
```

Command Reference

Required Arguments

Argument Description Example
--source Data source name au_policy, upr, ohchr

Optional Arguments

Argument Description Example
--tags-version Tag config version latest, v3, v2
--mode Pipeline mode scraper, urls, scorecard
--country Filter by country kenya, south_africa
--scorecard-action Scorecard action enrich, export, validate, all
--no-module-logs Disable per-module logs (flag, no value)

Examples

# Minimal
python pipeline_runner.py --source au_policy

# With tags version
python pipeline_runner.py --source upr --tags-version v3

# Country-specific
python pipeline_runner.py --source upr --country kenya

# Scorecard only
python pipeline_runner.py --mode scorecard --scorecard-action enrich

# URLs mode
python pipeline_runner.py --mode urls --source upr

Supported Sources

Source Description Documents
au_policy African Union policy documents ~10-15
ohchr OHCHR Treaty Body database Hundreds
upr Universal Periodic Review (per country) ~50 per country
unicef UNICEF reports Varies
acerwc African Committee on Child Rights ~20-30
achpr African Commission on Human Rights ~30-40
manual Manual uploads to data/raw/manual/ User-provided

Next Steps

Troubleshooting

No documents found

Check if documents already exist in data/raw/<source>/. The pipeline skips existing files. Delete to re-scrape.

Import errors

Ensure you're running from project root, not from subdirectories. Use absolute paths if needed.

Processing failed

Check logs/ for error details. Some PDFs may be scanned images (no text layer) and will fail.

Tags summary empty

Verify documents have text content. Check data/processed/ for .txt files.

See First Run Errors for comprehensive troubleshooting.

Pro Tips

Incremental Processing

The pipeline skips already-downloaded files. Run again to only process new documents.

Parallel Analysis

Export CSV files can be analyzed in parallel with R, Python, Excel, or Tableau.

Custom Tags

Edit configs/tags_v3.json to add your own regex patterns. Re-run with --tags-version v3.

Version Control

Tags history preserves all tagging operations. Compare results across tag versions using metadata.

Getting Help

Happy analyzing! 🌈