Project Directory Structure¶
This document explains the purpose of each folder and subfolder in the DigitalChild project.
Top-Level Files¶
pipeline_runner.py→ Main entry point for running pipelines (scraper, urls, scorecard modes)init_project.py→ Bootstrap script to create directory structure and placeholdersrequirements.txt→ Python dependenciesCLAUDE.md→ Comprehensive guide for Claude Code AI assistantREADME.md→ Project quickstart and documentationscorecard_main.xlsx→ Scorecard data (194 countries, 10 indicators each).github/workflows/ci.yml→ GitHub Actions CI pipeline.pre-commit-config.yaml→ Pre-commit hooks configuration
Scrapers (scrapers/)¶
Code for fetching raw documents from various sources.
NO __init__.py - Modules are imported directly by name.
Implemented Scrapers¶
au_policy.py→ African Union policy documentsau_policy_sel.py→ AU policy (Selenium variant)ohchr.py→ OHCHR Treaty Body database documentsohchr_sel.py→ OHCHR (Selenium variant)upr.py→ Universal Periodic Review documentsupr_sel.py→ UPR (Selenium variant)unicef.py→ UNICEF reports and publicationsunicef_sel.py→ UNICEF (Selenium variant)acerwc.py→ African Committee of Experts on the Rights and Welfare of the Childacerwc_sel.py→ ACERWC (Selenium variant)achpr.py→ African Commission on Human and Peoples' Rightsachpr_sel.py→ ACHPR (Selenium variant)
Notes:
- Each scraper implements a
scrape()function - Returns list of downloaded file paths
- Writes to
data/raw/<source>/ - Skips existing files automatically
Processors (processors/)¶
Code for processing raw documents and enriching metadata.
HAS __init__.py - Proper Python package.
Document Processing¶
pdf_to_text.py→ Extract text from PDF filesdocx_to_text.py→ Extract text from DOCX fileshtml_to_text.py→ Extract text from HTML filesfallback_handler.py→ Tries multiple processors until success
Feature Extraction¶
tagger.py→ Applies regex-based tag rules to documentstags_summary.py→ Generates CSV summary of tag frequenciesrecommendations.py→ Extracts recommendations from treaty body docs (regex-based)
Scorecard System¶
scorecard.py→ Loads scorecard data from Excel, provides query functionsscorecard_enricher.py→ Enriches metadata with country-level indicatorsscorecard_export.py→ Exports scorecard to CSV formats (summary, sources, by-indicator, by-region)scorecard_validator.py→ Validates all source URLs (parallel workers, retry logic)scorecard_diff.py→ Monitors sources for changes, detects stale entries
Validation & Security¶
validators.py→ Centralized validation module (68 tests)- URL validation (malicious pattern blocking)
- Path validation (traversal protection)
- File validation (size, extension checks)
- String validation (length, patterns, regex)
- Config validation (JSON, tags, structure)
- Schema validation (metadata, documents)
Utilities¶
logger.py→ Unified logging system (console + file logging)json_normalizer.py→ Normalizes country/region names, preserves_rawfields
Configs (configs/)¶
JSON configuration files for tags, recommendations, and filters.
Tags Configs¶
tags_v1.json→ Original tag rulestags_v2.json→ Expanded tag rulestags_v3.json→ Current comprehensive tag rulestags_digital.json→ Digital-specific tagstags_main.json→ Version mapping andlatestalias
URL Dictionaries¶
url_dict/→ Static URL dictionaries for sources- Used when scraping isn't feasible
- Processed via
--mode urlsin pipeline_runner
Future Configs¶
recommendations/→ Recommendations extraction configs (planned)comparison/→ Comparison configuration files (planned)filters/→ Filter configuration files (planned)
Data (data/)¶
All data directories are gitignored - not committed to version control.
Raw Data (data/raw/)¶
Source documents as scraped or manually uploaded.
data/raw/au_policy/→ AU policy PDFsdata/raw/ohchr/→ OHCHR treaty body documentsdata/raw/upr/→ UPR reportsdata/raw/unicef/→ UNICEF publicationsdata/raw/acerwc/→ ACERWC documentsdata/raw/achpr/→ ACHPR documentsdata/raw/manual/→ Manually uploaded files
Processed Data (data/processed/)¶
Converted text, structured data, OCR results.
Organized as: data/processed/<region>/<org>/text/
Example:
data/processed/Africa/AU/text/data/processed/Global/OHCHR/text/
Metadata (data/metadata/)¶
metadata.json→ Central metadata file- Project identity
- Documents list with tags_history, recommendations_history, scorecard
- Full schema in
docs/standards/METADATA_SCHEMA.md
Scorecard (data/scorecard/)¶
Primary scorecard data files (Excel).
scorecard_main_presentation.xlsx→ CANONICAL FILE - Most complete data (Sept 2025 conference)- 194 countries × 44 columns
- Color-coded regional groups
- Complete source URLs for all indicators
- Regional analysis sheets (SADC, ECOWAS)
Global_QueerAI_Child_Scorecard_MASTER.xlsx→ Visualization version_GLOBAL_Policy_Matrix_UPR_Main_FINAL.xlsx→ Source verification tool
Archive (data/archive/)¶
Superseded or archived scorecard files.
scorecard_main.xlsx→ Earlier version (Jan 2025)scorecard_main_ALL_filled_sources.xlsx→ Source URL collection (Jan 2026)
Verification Documentation (data/)¶
Data quality verification and reconciliation analysis.
VERIFICATION_RESULTS.md→ Detailed verification of 9 data conflicts between scorecard filesDATA_RECONCILIATION_RECOMMENDATIONS.md→ Strategic recommendations for file consolidationREVISION_SUMMARY.md→ Major correction to understanding (file maintenance vs research quality)VERIFICATION_CHECKLIST.md→ Template for verifying high-priority conflicts
Exports (data/exports/)¶
Output summaries, comparisons, timelines, scorecard exports.
tags_summary.csv→ Tag frequency countsscorecard_summary.csv→ Country indicators summaryscorecard_sources.csv→ All source URLs (2,543 tracked)scorecard_url_validation.json→ URL validation resultsscorecard_broken_links.csv→ Broken links reportscorecard_diff_report.json→ Source change detection resultsscorecard_<indicator>.csv→ Per-indicator exportsscorecard_<region>.csv→ Per-region exports
Cache (data/cache/)¶
Temporary cache files for scorecard source monitoring.
scorecard_sources/→ Content hashes for change detection
Logs (logs/)¶
Gitignored - not committed to version control.
Unified and per-module logs per run.
- Format:
<timestamp>_<pipeline>.log - Example:
2025-08-28_12-22-40_au_policy_run.log - Controlled via
--no-module-logsflag inpipeline_runner.py
Docs (docs/)¶
Comprehensive documentation (24+ markdown files).
Main Docs (docs/)¶
README.md→ Documentation overview and navigationDOCS_INDEX.md→ Complete index of all documentationROADMAP.md→ Project roadmap and development phasesSCORECARD_WORKFLOW.md→ Complete scorecard system guideSCORECARD_REVIEW_SUMMARY.md→ Scorecard review and analysisVALIDATORS_USAGE.md→ Using the centralized validation moduleTAGS_VISUALIZATION_PLAN.md→ Future visualization plansSOURCE_FEASIBILITY_CHECKLIST.md→ Checklist for evaluating new sources
Implementation Notes (docs/notes/)¶
DIRECTORY_STRUCTURE.md→ This filePIPELINE_FLOW.md→ End-to-end data flowPIPELINE_LOGGING.md→ Logging system detailsTAGS_MAIN_NOTES.md→ Tag version managementTAGS_EXPORT_NOTES.md→ How to read tag exportsCOMPARISON_EXPORT_NOTES.md→ Comparison export format
Run Guides (docs/runs/)¶
RUNBOOK.md→ Common commands for pipelines and testsFIRST_RUN_ERRORS.md→ Troubleshooting first-run issuesPROCESSOR_TEST_RUN.md→ Testing individual processors
Standards (docs/standards/)¶
METADATA_SCHEMA.md→ Complete metadata JSON schemaTAGS_CONFIG_FORMAT.md→ Tag configuration structureRECOMMENDATIONS_CONFIG_FORMAT.md→ Recommendations config formatCOMPARISON_CONFIG_FORMAT.md→ Comparison config formatFILE_NAMING_STANDARDS.md→ File naming conventionsDOC_TYPE_STANDARDS.md→ Document type classificationsISO_MAPPING.md→ Country/region ISO mappingSCRAPER_STRUCTURE.md→ How to build scrapers
Tests (tests/)¶
Comprehensive test suite (124 tests, 100% passing).
Test Files¶
conftest.py→ Pytest configuration, adds project root to pathtest_validators.py→ 68 tests for validators moduletest_scorecard.py→ 20 tests for scorecard systemtest_tagger.py→ Tagging functionality teststest_metadata.py→ Metadata operations teststest_logging.py→ Logging system teststest_fallback_handler.py→ Multi-format processing teststest_year_extraction.py→ Year extraction teststest_country_region.py→ Country/region normalization teststest_comparison.py→ Comparison export teststest_recommendations.py→ Recommendations extraction teststest_csv_footer.py→ CSV footer formatting tests
Test Data¶
test_data/→ Sample files for testing- Temporary files created during tests are cleaned up automatically
GitHub Actions (.github/)¶
CI/CD pipeline configuration.
.github/workflows/ci.yml→ Runs on push/PR to main/homebase/basecamp- Job 1: test - Python 3.12, runs pre-commit + pytest
- Job 2: docs - Python 3.11, checks markdown formatting
Virtual Environment (.LittleRainbow/)¶
Gitignored - Python virtual environment.
- Created with:
python3 -m venv .LittleRainbow - Activated with:
source .LittleRainbow/bin/activate - Contains all dependencies from
requirements.txt
Presentations (presentations/)¶
Conference presentations and publication materials.
QUEERAI.pdf→ Conference presentation slides (15 slides)QueerAI_Slides_Data.pdf→ Detailed analytical frameworks (20 pages)Script - QueerAI.pdf→ Presentation scriptREADME.md→ Documentation of presentation materials
Conference: Second International Conference on Children's Rights, Stellenbosch, South Africa (Sept 9-11, 2025)
Repository Cleaning (repo-cleaning/)¶
Temporary development files used during site development.
alignmentissue_01.PNGthroughalignmentissue_08.PNG→ Screenshots documenting layout issues__stillToFix_01.PNGand__stillToFix_02.PNG→ Outstanding issuesREADME.md→ Documentation of temporary files
Note: These files are temporary and should not be committed to repository.
Summary¶
DigitalChild/
├── pipeline_runner.py # Main entry point
├── init_project.py # Bootstrap script
├── requirements.txt # Dependencies
├── CLAUDE.md # AI assistant guide
├── README.md # Project quickstart
├── scrapers/ # Data fetching (no __init__.py)
├── processors/ # Data processing (has __init__.py)
├── configs/ # Tag/rec configs, URL dictionaries
├── data/ # Data storage (gitignored)
│ ├── raw/ # Source documents
│ ├── processed/ # Converted text
│ ├── metadata/ # Central metadata.json
│ ├── scorecard/ # Scorecard Excel files (primary data)
│ ├── archive/ # Archived/superseded files
│ ├── exports/ # CSV/JSON outputs
│ ├── cache/ # Temporary cache
│ ├── VERIFICATION_RESULTS.md # Data conflict verification
│ ├── DATA_RECONCILIATION_RECOMMENDATIONS.md # Consolidation guide
│ ├── REVISION_SUMMARY.md # Analysis correction
│ └── VERIFICATION_CHECKLIST.md # Verification template
├── presentations/ # Conference materials
├── repo-cleaning/ # Temporary dev screenshots
├── logs/ # Run logs (gitignored)
├── docs/ # Documentation (24+ files)
│ ├── notes/ # Implementation notes
│ ├── runs/ # Run guides
│ └── standards/ # Format standards
├── tests/ # Test suite (124 tests)
└── .github/ # CI/CD configuration
Last updated: January 2026