Quick Start¶
Get started with DigitalChild in 5 minutes.
Prerequisites¶
Ensure you've installed DigitalChild and activated your virtual environment.
Your First Pipeline Run¶
Step 1: Basic Run¶
Run the pipeline on African Union policy documents:
This will:
- ✅ Scrape AU policy documents (or skip if already downloaded)
- ✅ Process PDFs to extract text
- ✅ Apply tags using default tag configuration
- ✅ Generate exports in
data/exports/
Expected output:
[INFO] Starting pipeline for source: au_policy
[INFO] Scraping documents...
[INFO] Found 12 documents
[INFO] Processing documents...
[INFO] Tagging documents...
[INFO] Generating exports...
[INFO] Pipeline complete!
Step 2: View Results¶
Check the exports:
You'll find:
tags_summary.csv- Tag frequencies and document countsmetadata_export.csv- All document metadata
Open in Excel, Google Sheets, or analyze with pandas:
Step 3: Explore Metadata¶
View processed documents:
# See metadata
cat data/metadata/metadata.json | python -m json.tool | head -50
# See processed text
ls data/processed/Africa/AU/text/
Common Use Cases¶
Process Specific Country (UPR)¶
This scrapes and processes UPR (Universal Periodic Review) documents for Kenya specifically.
Use Latest Tags¶
The latest version points to the most recent tag configuration (currently tags_v3).
Run Scorecard Workflow¶
This:
- Enriches metadata with country-level indicators
- Exports scorecard summaries
- Validates all 2,543 source URLs
Results appear in data/exports/scorecard_*.csv.
Process Multiple Sources¶
# AU Policy
python pipeline_runner.py --source au_policy
# OHCHR
python pipeline_runner.py --source ohchr
# UNICEF
python pipeline_runner.py --source unicef
Each source has unique scraping logic for that organization's website.
Understanding the Pipeline¶
Data Flow¶
┌─────────────────┐
│ WEB SOURCES │ (AU, OHCHR, UPR, UNICEF)
└────────┬────────┘
│
▼
┌─────────────────┐
│ SCRAPERS │ Download PDFs/DOCX/HTML
└────────┬────────┘
│
▼
┌─────────────────┐
│ data/raw/ │ Store downloaded files
└────────┬────────┘
│
▼
┌─────────────────┐
│ PROCESSORS │ Extract text
└────────┬────────┘
│
▼
┌─────────────────┐
│ data/processed/ │ Store text files
└────────┬────────┘
│
▼
┌─────────────────┐
│ TAGGER │ Apply regex rules
└────────┬────────┘
│
▼
┌─────────────────┐
│ METADATA │ metadata.json with tags history
└────────┬────────┘
│
▼
┌─────────────────┐
│ EXPORTS │ CSV files for analysis
└─────────────────┘
File Locations¶
After running the pipeline:
| Path | Contents |
|---|---|
data/raw/au_policy/ |
Downloaded PDF files |
data/processed/Africa/AU/text/ |
Extracted text files |
data/metadata/metadata.json |
Document metadata with tags |
data/exports/tags_summary.csv |
Tag analysis |
logs/ |
Run logs with timestamps |
Pipeline Modes¶
The pipeline has 3 modes:
**Complete workflow:** Scrape → Process → Tag → Export
```bash
python pipeline_runner.py --source au_policy
```
**From static URLs:** Process from `configs/url_dict/*.json`
```bash
python pipeline_runner.py --mode urls --source upr
```
**Indicator workflow:** Enrich → Export → Validate
```bash
python pipeline_runner.py --mode scorecard --scorecard-action all
```
Command Reference¶
Required Arguments¶
| Argument | Description | Example |
|---|---|---|
--source |
Data source name | au_policy, upr, ohchr |
Optional Arguments¶
| Argument | Description | Example |
|---|---|---|
--tags-version |
Tag config version | latest, v3, v2 |
--mode |
Pipeline mode | scraper, urls, scorecard |
--country |
Filter by country | kenya, south_africa |
--scorecard-action |
Scorecard action | enrich, export, validate, all |
--no-module-logs |
Disable per-module logs | (flag, no value) |
Examples¶
# Minimal
python pipeline_runner.py --source au_policy
# With tags version
python pipeline_runner.py --source upr --tags-version v3
# Country-specific
python pipeline_runner.py --source upr --country kenya
# Scorecard only
python pipeline_runner.py --mode scorecard --scorecard-action enrich
# URLs mode
python pipeline_runner.py --mode urls --source upr
Supported Sources¶
| Source | Description | Documents |
|---|---|---|
au_policy |
African Union policy documents | ~10-15 |
ohchr |
OHCHR Treaty Body database | Hundreds |
upr |
Universal Periodic Review (per country) | ~50 per country |
unicef |
UNICEF reports | Varies |
acerwc |
African Committee on Child Rights | ~20-30 |
achpr |
African Commission on Human Rights | ~30-40 |
manual |
Manual uploads to data/raw/manual/ |
User-provided |
Next Steps¶
-
Learn More
Dive deeper into pipeline operations
-
Customize Tags
Add your own tag patterns
-
Explore Scorecard
Understand country indicators
-
Add Scrapers
Build scrapers for new sources
Troubleshooting¶
No documents found
Check if documents already exist in data/raw/<source>/. The pipeline skips existing files. Delete to re-scrape.
Import errors
Ensure you're running from project root, not from subdirectories. Use absolute paths if needed.
Processing failed
Check logs/ for error details. Some PDFs may be scanned images (no text layer) and will fail.
Tags summary empty
Verify documents have text content. Check data/processed/ for .txt files.
See First Run Errors for comprehensive troubleshooting.
Pro Tips¶
Incremental Processing
The pipeline skips already-downloaded files. Run again to only process new documents.
Parallel Analysis
Export CSV files can be analyzed in parallel with R, Python, Excel, or Tableau.
Custom Tags
Edit configs/tags_v3.json to add your own regex patterns. Re-run with --tags-version v3.
Version Control
Tags history preserves all tagging operations. Compare results across tag versions using metadata.
Getting Help¶
- Documentation: Full docs index
- FAQ: Common questions
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Happy analyzing! 🌈