Scorecard System Workflow¶
This document describes the complete scorecard system for the DigitalChild project, which tracks 10 human rights indicators across countries and enriches document metadata.
Overview¶
The scorecard system provides country-level data on digital child protection policies and LGBTQ+ rights. It consists of:
- Data Source:
data/scorecard/scorecard_main_presentation.xlsx- Canonical scorecard file (Sept 2025 conference) - Loader:
processors/scorecard.py- Loads and caches scorecard data - Enricher:
processors/scorecard_enricher.py- Adds scorecard data to document metadata - Exporter:
processors/scorecard_export.py- Creates CSV exports for website/analysis - Validator:
processors/scorecard_validator.py- Checks source URLs for broken links - Diff Checker:
processors/scorecard_diff.py- Monitors sources for changes
Indicators¶
The scorecard tracks 10 indicators (each with value + source URL):
- AI_Policy_Status - National AI policy/strategy status
- Data_Protection_Law - Data protection/privacy legislation
- Children_Data_Safeguards - Child-specific data protection measures
- SOGI_Sensitive_Data - Sexual orientation/gender identity data protections
- DPA_Independence - Data Protection Authority independence
- DPIA_Required_High_Risk_AI - Data protection impact assessments for AI
- LGBTQ_Legal_Status - Legal status of LGBTQ+ people
- Promotion_Propaganda_Offences - Anti-LGBTQ+ propaganda laws
- COP_Strategy - Child online protection strategy
- SIM_Biometric_ID_Linkage - SIM registration and biometric requirements
Architecture¶
scorecard_main.xlsx
↓
scorecard.py (loader)
↓
┌───────────┴───────────┐
↓ ↓
scorecard_enricher.py scorecard_export.py
(add to metadata) (CSV exports)
↓
metadata.json
(enriched documents)
Workflows¶
1. Initial Scorecard Setup¶
# 1. Scorecard file located at data/scorecard/scorecard_main_presentation.xlsx
# 2. Test scorecard loads correctly
python -c "from processors.scorecard import load_scorecard; print(load_scorecard())"
# 3. Run tests to verify
pytest tests/test_scorecard.py -v
2. Enrich Document Metadata¶
Add scorecard indicators to documents based on their country:
# Enrich all documents in metadata.json
python processors/scorecard_enricher.py
# Dry run (don't save changes)
python processors/scorecard_enricher.py --dry-run
# Show enrichment summary only
python processors/scorecard_enricher.py --summary
Programmatic usage:
from processors.scorecard_enricher import enrich_document, enrich_all_metadata
# Enrich single document
doc = {"id": "doc-1", "country": "Albania"}
enriched_doc = enrich_document(doc)
# Enrich all metadata
stats = enrich_all_metadata(save=True)
print(f"Enriched {stats['enriched']} documents")
Output format:
{
"id": "doc-1",
"country": "Albania",
"scorecard": {
"matched_country": "Albania",
"enriched_at": "2024-01-15T10:30:00Z",
"indicators": {
"AI_Policy_Status": {
"value": "Draft policy under development (2023)",
"source": "https://..."
},
"Data_Protection_Law": {
"value": "Law No. 9887 (2008), aligned with GDPR",
"source": "https://..."
}
// ... 8 more indicators
}
}
}
3. Export Scorecard Data¶
Generate CSV exports for website/analysis:
# From Python code
from processors.scorecard_export import export_scorecard
exports = export_scorecard()
# Returns:
# {
# "summary": "data/exports/scorecard_summary.csv",
# "sources": "data/exports/scorecard_sources.csv",
# "indicator_counts": "data/exports/scorecard_indicator_counts.csv"
# }
Export types:
- Summary CSV: All countries with all indicators (for main table)
- Sources CSV: All source URLs (for verification/citation)
- Indicator Counts: Distribution of values per indicator (for charts)
- By Indicator: Individual CSV per indicator
- By Region: Countries filtered by region
Programmatic usage:
from processors.scorecard_export import ScorecardExporter
exporter = ScorecardExporter()
# Export specific region
exporter.export_by_region("Africa", "data/exports/africa.csv")
# Export specific indicator
exporter.export_by_indicator("LGBTQ_Legal_Status", "data/exports/lgbtq_status.csv")
# Export all at once
exports = exporter.export_all()
4. Validate Source URLs¶
Check all source URLs for broken/redirected links:
# Run validation
python processors/scorecard_validator.py
# Custom worker count
python processors/scorecard_validator.py --workers 20
# Don't save reports
python processors/scorecard_validator.py --no-save
Output:
data/exports/scorecard_url_validation.json- Full validation reportdata/exports/scorecard_broken_links.csv- Broken links only (for review)
Programmatic usage:
from processors.scorecard_validator import run_validation
report = run_validation(save_reports=True)
print(f"{report['ok']} OK, {report['broken']} broken, {report['redirected']} redirected")
5. Monitor for Changes¶
Check monitored sources for content changes:
# Check all monitored sources
python processors/scorecard_diff.py
# Check specific country sources
python processors/scorecard_diff.py --country "South Africa"
# Check sources only (skip stale entry detection)
python processors/scorecard_diff.py --sources-only
Monitored sources:
- UNESCO AI Policy Observatory
- UNCTAD Data Protection Tracker
- ILGA World Maps
- Human Dignity Trust
- GSMA SIM Registration
Programmatic usage:
from processors.scorecard_diff import run_diff_check, check_country_sources
# Full check
report = run_diff_check(save_report=True)
# Check specific country
results = check_country_sources("Kenya")
Integration with Pipeline¶
The scorecard enrichment is not part of the main pipeline (pipeline_runner.py) by default. It's a separate step run after documents are processed.
Typical workflow:
# 1. Run pipeline to scrape and process documents
python pipeline_runner.py --source upr --country "Kenya"
# 2. Enrich metadata with scorecard
python processors/scorecard_enricher.py
# 3. Export scorecard data for website
python -c "from processors.scorecard_export import export_scorecard; export_scorecard()"
Adding to Pipeline (Optional)¶
To integrate scorecard enrichment into the pipeline:
# In pipeline_runner.py, after process_documents():
from processors.scorecard_enricher import enrich_all_metadata
# After processing is complete
if args.enrich_scorecard:
logger.info("Enriching metadata with scorecard indicators...")
stats = enrich_all_metadata(save=True)
logger.info(f"Enriched {stats['enriched']} documents")
Maintenance Tasks¶
Update Scorecard Data¶
- Edit
data/scorecard/scorecard_main_presentation.xlsxwith new data - Force reload:
load_scorecard(force_reload=True) - Re-enrich metadata:
python processors/scorecard_enricher.py - Re-export:
python -c "from processors.scorecard_export import export_scorecard; export_scorecard()"
Verify Data Quality¶
# Check for broken links
python processors/scorecard_validator.py
# Check for stale entries
python processors/scorecard_diff.py
# Run all scorecard tests
pytest tests/test_scorecard.py -v
Add New Indicator¶
-
Add column pair to
data/scorecard/scorecard_main_presentation.xlsx: -
New_Indicator(value column) -
New_Indicator_Source(source URL column) -
Update
INDICATOR_COLUMNSinprocessors/scorecard.py:
- Re-run enrichment and exports
File Locations¶
- Source Data:
data/scorecard/scorecard_main_presentation.xlsx(canonical file) - Archived Data:
data/archive/(superseded scorecard files) - Exports:
data/exports/scorecard_*.csv - Validation Reports:
data/exports/scorecard_url_validation.json - Diff Reports:
data/exports/scorecard_diff_report.json - Cache:
data/cache/scorecard_sources/*.json
Testing¶
# Run all scorecard tests
pytest tests/test_scorecard.py -v
# Run specific test class
pytest tests/test_scorecard.py::TestScorecardLoader -v
# Run with coverage
pytest tests/test_scorecard.py --cov=processors/scorecard --cov-report=html
Troubleshooting¶
Country Not Found¶
Problem: Document country doesn't match scorecard country names
Solution: The loader tries multiple normalization methods:
- Exact match (case-insensitive)
- ISO code lookup
- Fuzzy matching
Check country names in metadata vs scorecard:
from processors.scorecard import get_countries_list
countries = get_countries_list()
print(countries) # List all scorecard countries
Missing Indicators¶
Problem: Enriched document missing some indicators
Solution: Check for empty cells in scorecard_main.xlsx. Empty values are skipped.
Validation Timeouts¶
Problem: URL validation takes too long
Solution: Reduce worker count or increase timeout:
from processors.scorecard_validator import validate_all_urls
report = validate_all_urls(max_workers=5) # Slower but more reliable
Future Enhancements¶
- Auto-update from sources: Automatically scrape monitored sources and update scorecard
- Version tracking: Track scorecard changes over time
- API endpoint: Serve scorecard data via REST API for website
- Visualization: Generate charts/maps from scorecard data
- Comparison mode: Compare countries side-by-side
- Timeline view: Show indicator changes over time per country
Related Documentation¶
- METADATA_SCHEMA.md - Document metadata structure
- PIPELINE_FLOW.md - Main pipeline workflow
- ISO_MAPPING.md - Country code standards