System Architecture¶
This document provides a high-level overview of the DigitalChild pipeline architecture.
๐ฏ Purpose¶
DigitalChild is a data pipeline that:
- Scrapes human rights documents from international organizations
- Processes documents into structured, analyzable text
- Analyzes content using regex-based tagging and enrichment
- Enriches with country-level indicators via scorecard system
- Exports analysis results for research use
Focus: Child and LGBTQ+ digital rights, with particular emphasis on AI policy, data protection, and online safety.
๐๏ธ High-Level Architecture¶
โโโโโโโโโโโโโโโ
โ SOURCES โ (Web: AU, OHCHR, UPR, UNICEF, etc.)
โโโโโโโโฌโโโโโโโ
โ HTTP/Selenium
โผ
โโโโโโโโโโโโโโโ
โ SCRAPERS โ (Download PDFs, DOCX, HTML)
โโโโโโโโฌโโโโโโโ
โ Files
โผ
โโโโโโโโโโโโโโโ
โ data/raw/ โ (Raw documents by source)
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ PROCESSORS โ (PDFโtext, DOCXโtext, HTMLโtext)
โโโโโโโโฌโโโโโโโ
โ Text files
โผ
โโโโโโโโโโโโโโโ
โdata/process โ (Extracted text by region/org)
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ TAGGER โ (Apply regex rules from configs)
โโโโโโโโฌโโโโโโโ
โ Tags
โผ
โโโโโโโโโโโโโโโ
โ METADATA โ (metadata.json with tags_history)
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โ ENRICHER โ (Add scorecard indicators)
โโโโโโโโฌโโโโโโโ
โ Enriched metadata
โผ
โโโโโโโโโโโโโโโ
โ EXPORTERS โ (Generate CSV summaries)
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโ
โdata/exports โ (CSV files for analysis)
โโโโโโโโโโโโโโโ
๐ฆ Core Components¶
1. Pipeline Runner¶
File: pipeline_runner.py
Responsibilities:
- Orchestrates entire workflow
- Handles CLI arguments
- Manages logging
- Supports 3 modes:
scraper- Run scrapers, process, tag, exporturls- Process from static URL dictionariesscorecard- Enrich/export/validate scorecard data
Entry Point:
2. Scrapers (scrapers/)¶
Purpose: Fetch documents from web sources
Structure:
- Each source has its own module (e.g.,
au_policy.py) - Two variants: requests-based and Selenium (
_selsuffix) - Implements standard
scrape()function - Outputs to
data/raw/<source>/
Example Sources:
au_policy- African Union policy documentsohchr- OHCHR Treaty Body databaseupr- Universal Periodic Review documentsunicef- UNICEF reportsacerwc- African Committee on Child Rightsachpr- African Commission on Human Rights
Key Features:
- Skip existing files (idempotent)
- Configurable timeouts and retry logic
- Logging for all operations
- Error handling and graceful failures
3. Processors (processors/)¶
Purpose: Convert documents to analyzable text
Modules:
pdf_to_text.py- Extract text from PDFs (PyPDF2)docx_to_text.py- Extract text from Word docs (python-docx)html_to_text.py- Extract text from HTML (BeautifulSoup4)fallback_handler.py- Try processors until one succeeds
Output: Text files in data/processed/<region>/<org>/text/
4. Tagger (processors/tagger.py)¶
Purpose: Apply regex-based tags to documents
How it works:
- Load tag config (e.g.,
configs/tags_v3.json) - Apply regex patterns to text
- Record matched tags
- Store in
metadata.jsonwith version and timestamp
Tags include:
- ChildRights, LGBTQ, AI, Privacy
- DigitalPolicy, OnlineRights, DataProtection
- And more (expandable via configs)
Versioning:
- Multiple tag versions (v1, v2, v3, digital)
tags_main.jsonmaps version aliases- Tags history preserves all versions for comparison
5. Scorecard System¶
Purpose: Enrich documents with country-level indicators
Components:
A. Data Loader (processors/scorecard.py)¶
- Loads
data/scorecard/scorecard_main_presentation.xlsx(canonical file - 194 countries, 10 indicators) - Provides query functions
- Caches data in memory
B. Enricher (processors/scorecard_enricher.py)¶
- Matches documents to countries
- Adds indicator data to metadata
- Tracks enrichment timestamp
C. Exporter (processors/scorecard_export.py)¶
- Exports to CSV formats:
- Summary (countries ร indicators)
- Sources (all source URLs)
- By indicator
- By region
D. Validator (processors/scorecard_validator.py)¶
- Validates 2,543 source URLs
- Parallel workers for performance
- Retry logic for transient failures
- Generates broken links report
E. Diff Monitor (processors/scorecard_diff.py)¶
- Monitors sources for changes
- Content hashing for comparison
- Detects stale data
10 Indicators Tracked:
- AI_Policy_Status
- Data_Protection_Law
- LGBTQ_Legal_Status
- Child_Online_Protection
- SIM_Biometric
- Encryption_Backdoors
- Promotion_Propaganda
- DPA_Independence
- Content_Moderation
- Age_Verification
6. Validators (processors/validators.py)¶
Purpose: Centralized input validation and security
Functions:
- URL validation (blocks malicious patterns)
- Path validation (prevents traversal attacks)
- File validation (size, extension checks)
- String validation (length, patterns)
- Config validation (JSON structure)
- Schema validation (metadata documents)
Security: Protects against:
- Path traversal (e.g.,
../../../etc/passwd) - Malicious URLs (e.g.,
javascript:,file:) - File bombs (size limits)
- Invalid configs
7. Metadata System¶
File: data/metadata/metadata.json
Structure:
{
"project_identity": {...},
"documents": [
{
"id": "doc-123.pdf",
"source": "au_policy",
"country": "Kenya",
"year": 2024,
"tags_history": [...],
"recommendations_history": [...],
"scorecard": {...},
"last_processed": "2025-01-19T10:00:00Z"
}
]
}
Tracking:
- Document metadata (source, country, year)
- Tags history (versions, timestamps)
- Recommendations (future)
- Scorecard indicators
- Processing timestamps
8. Logging System (processors/logger.py)¶
Features:
- Unified run logs
- Per-module logs (optional)
- Timestamped filenames
- Console + file output
- Configurable via
--no-module-logs
Levels:
- INFO: Normal operations
- WARNING: Recoverable issues
- ERROR: Non-recoverable failures
๐ Data Flow¶
Standard Pipeline Execution¶
1. User runs: python pipeline_runner.py --source au_policy
2. Pipeline Runner:
- Initializes logging
- Loads SCRAPER_MAP configuration
- Determines source and output paths
3. Scraper Phase:
- Downloads documents to data/raw/au_policy/
- Skips existing files
- Returns list of file paths
4. Processing Phase:
For each downloaded file:
- Detect file type (PDF, DOCX, HTML)
- Convert to text โ data/processed/Africa/AU/text/
- Extract metadata (year, country from filename/content)
5. Tagging Phase:
- Load tag config (tags_v3.json)
- Apply regex rules to each document
- Store tags in metadata.json
6. Export Phase:
- Generate tags_summary.csv
- Count tag frequencies
- Add project branding footer
7. Logging:
- Write unified log to logs/<timestamp>_au_policy_run.log
- Optional per-module logs
Scorecard Workflow¶
1. User runs: python pipeline_runner.py --mode scorecard --scorecard-action all
2. Enrich:
- Load metadata.json
- Load data/scorecard/scorecard_main_presentation.xlsx
- Match documents to countries
- Add indicators to metadata
- Save updated metadata.json
3. Export:
- Generate scorecard_summary.csv
- Generate scorecard_sources.csv
- Generate indicator-specific CSVs
4. Validate:
- Load all source URLs (2,543)
- Validate in parallel (10 workers)
- Generate validation report
- Create broken links CSV
5. Diff (optional):
- Fetch monitored sources
- Compare content hashes
- Detect changes
- Generate diff report
๐ Extensibility Points¶
Adding a New Scraper¶
- Create
scrapers/new_source.py - Implement
scrape()function - Add to
SCRAPER_MAPinpipeline_runner.py - Add tests in
tests/test_new_source.py
Template:
def scrape(base_url=None, countries=None):
"""Download documents. Returns list of file paths."""
# Implementation
return downloaded_files
Adding a New Processor¶
- Create
processors/new_processor.py - Implement
convert(input_path, output_dir)function - Update
fallback_handler.pyif needed - Add tests
Adding New Tags¶
- Edit
configs/tags_vX.json - Add new tag categories and regex patterns
- Update version in
configs/tags_main.json - Run tagging:
python pipeline_runner.py --tags-version vX
Adding Scorecard Indicators¶
- Edit
data/scorecard/scorecard_main_presentation.xlsx - Add new column for indicator
- Add source URLs
- Update
INDICATOR_COLUMNSinprocessors/scorecard.py - Re-run enrichment
๐งช Testing Strategy¶
Test Suite: 124 tests covering:
- 68 validator tests (comprehensive security checks)
- 20 scorecard tests (load, enrich, export, validate)
- 36 other tests (tagger, processors, metadata, logging)
Test Organization:
tests/
โโโ test_validators.py # Input validation
โโโ test_scorecard.py # Scorecard system
โโโ test_tagger.py # Tagging logic
โโโ test_metadata.py # Metadata operations
โโโ test_logging.py # Logging system
โโโ test_fallback_handler.py # Multi-format processing
โโโ conftest.py # Pytest configuration
Run tests:
pytest tests/ -v # All tests
pytest tests/test_validators.py -v # Specific module
pytest tests/ --cov # With coverage
๐ Performance Considerations¶
Bottlenecks¶
-
Scraping: Network I/O bound
-
Mitigated by: Timeouts, skip existing files
-
PDF Processing: CPU bound
-
Mitigated by: Fallback handler, efficient PyPDF2 usage
-
URL Validation: Network I/O bound
-
Mitigated by: Parallel workers (10 concurrent), caching
Scalability¶
Current scale:
- 194 countries
- 2,543 source URLs
- 7 data sources
- Processing hundreds of documents
Future scale: System designed to handle thousands of documents with:
- Incremental processing (skip processed files)
- Efficient caching
- Modular architecture
Optimization Opportunities¶
- Database instead of JSON (PostgreSQL for metadata)
- Async scrapers (aiohttp)
- Distributed processing (Celery)
- Content delivery network (CDN for exports)
๐ Security Architecture¶
Input Validation¶
All external inputs validated through validators.py:
- URLs validated before HTTP requests
- File paths validated before file operations
- Configs validated before loading
Attack Surface¶
Minimized by:
- No user authentication (static site deployment)
- No database (JSON-based metadata)
- No eval/exec of untrusted code
- Sandboxed scraping (timeout limits)
Protected against:
- Path traversal attacks
- Malicious URL injection
- File upload vulnerabilities
- XSS (no dynamic web content)
๐ Deployment Architecture¶
Development¶
Local Machine
โโโ Python 3.12 virtual environment
โโโ Git repository
โโโ Pre-commit hooks
โโโ Pytest for testing
Production (Planned)¶
GitHub Repository
โโโ GitHub Actions (CI/CD)
โ โโโ Run tests
โ โโโ Check code quality
โ โโโ Deploy docs
โโโ GitHub Pages (Static site)
โ โโโ MkDocs-generated docs
โ โโโ Scorecard visualizations
โ โโโ Custom domain (GRIMdata.org)
โโโ Data Files (Gitignored)
โโโ Scraped documents
โโโ Processed text
โโโ Export CSVs
๐ Technology Stack¶
Core:
- Python 3.12
- BeautifulSoup4 (HTML parsing)
- Selenium (dynamic scraping)
- pandas (data manipulation)
- PyPDF2 (PDF processing)
Testing:
- pytest
- pytest-cov
- pre-commit hooks
Code Quality:
- black (formatting)
- isort (import sorting)
- flake8 (linting)
- mdformat (markdown)
Documentation:
- MkDocs (site generation)
- Material theme
- 25 markdown files
Deployment:
- GitHub Pages
- Custom domains
- Automatic SSL
๐ฎ Future Architecture¶
Phase 3: Advanced Processing¶
- Recommendations extraction (NLP-based)
- Timeline analysis
- Comparison across versions
Phase 4: Research Dashboard¶
- Flask backend (REST API)
- React/Vue frontend
- Interactive visualizations (D3.js, Plotly)
- Database migration (PostgreSQL)
Phase 5: Global Expansion¶
- Multi-language support
- Additional regions (Europe, Asia, Americas)
- Machine learning for classification
- Automated report generation
๐ค Integration Points¶
External Systems¶
Currently integrates with:
- GitHub (version control, CI/CD)
- Public data sources (AU, OHCHR, UPR, etc.)
Planned integrations:
- Zotero (citation management)
- SPARQL endpoints (semantic queries)
- Research databases
APIs¶
Current: None (CLI-based)
Planned: REST API for:
/api/documents- Search and filter/api/scorecard- Query indicators/api/export- Download datasets
๐ Further Reading¶
- PIPELINE_FLOW.md - Detailed pipeline flow
- DIRECTORY_STRUCTURE.md - File organization
- METADATA_SCHEMA.md - Metadata structure
- SCORECARD_WORKFLOW.md - Scorecard system details
- VALIDATORS_USAGE.md - Validation framework
Last updated: January 2026