Project Roadmap¶
This roadmap outlines the milestones for the GRIMdata / LittleRainbowRights pipeline.
Phase 1: Core Pipeline (✅ COMPLETE)¶
- Project scaffolding (
init_project.py) - AU Policy scraper (requests and Selenium variants)
- PDF → text processor
- DOCX → text processor
- HTML → text processor
- Fallback handler for multi-format processing
- Tagging system (v1, v2, v3, digital versions)
- Tags version management (
tags_main.json) - Metadata tracking with history
- Unified logging system with per-module logs
- Comprehensive test suite (124 tests passing)
- Documentation (setup, structure, standards, pipeline flow)
- CI/CD pipeline with GitHub Actions
- Pre-commit hooks (black, isort, flake8, markdown, yaml)
Phase 2: Data Enrichment & Validation (✅ COMPLETE)¶
Scorecard System¶
- Scorecard data loading (194 countries, 10 indicators each)
- Scorecard metadata enrichment
- Scorecard CSV exports (summary, sources, by-indicator, by-region)
- URL validation system (parallel workers, retry logic)
- Source change detection and monitoring
- Diff checking for stale scorecard entries
- Integration with pipeline runner
Validation & Security¶
- Centralized validators module (68 tests)
- URL validation with malicious pattern blocking
- Path validation with traversal protection
- File validation (size limits, extension checks)
- String validation (length, patterns, regex)
- Config validation (JSON, tags, structure)
- Schema validation (metadata, documents)
- Security improvements across all processors
Expanded Sources¶
- OHCHR Treaty Body scraper
- UPR documents scraper
- UNICEF reports scraper
- ACERWC scraper
- ACHPR scraper
- Manual upload ingestion (via
data/raw/manual/) - Static URL dictionary processing
Phase 3: Advanced Processing (⏳ IN PROGRESS)¶
Recommendations System¶
- Recommendations extraction (regex-based)
- Recommendations config format (
recs_v1.json) - Recommendations versioning and history tracking
- NLP-based recommendations extraction (future)
- Recommendations export to CSV
Comparison & Analysis¶
- Timeline exports (
tags_timeline.py) - Comparison across tagging versions
- Comparison across recommendations versions
- Comparison export to CSV with version headers
- Year-over-year trend analysis
Enhanced Normalization¶
- Country/region normalization with ISO codes
- Preservation of
_rawfields for provenance - Complete ISO 3166-1 alpha-2 mapping
- Automatic doc type classification (Policy, Law, TreatyBody, etc.)
- Source reliability scoring
Phase 4: Research Dashboard (🔜 PLANNED)¶
Backend API¶
- Flask backend for serving metadata/exports
- RESTful API endpoints:
-
/api/documents- list/filter documents -
/api/tags- tag frequency and filters -
/api/scorecard- country indicators -
/api/timeline- temporal analysis -
/api/export- download datasets - Authentication and rate limiting
- Caching layer for performance
Visualization Frontend¶
- Interactive dashboard (React or Vue.js)
- Tag frequency bar charts and heatmaps
- Timeline visualizations (D3.js or Plotly)
- Country/region filtering
- Scorecard indicator displays
- Interactive filters (region, country, tags, year, source)
- Export/download UI for datasets
- Mobile-responsive design
Charts & Analysis¶
- Tag frequency bar charts
- Timeline view (tags over time)
- Geographic heatmaps (countries × tags)
- Comparison mode (version side-by-side)
- Scorecard indicator visualizations
- Gap analysis (missing data visualization)
- Correlation analysis (tags vs indicators)
Phase 5: Global Expansion (📅 FUTURE)¶
Geographic Expansion¶
- European sources (EU, Council of Europe)
- Asian sources (ASEAN, national bodies)
- Americas sources (OAS, IACHR)
- Merge African + global content
- Multi-language support (translation pipeline)
Advanced Features¶
- Machine learning for document classification
- Automated entity extraction (organizations, people, dates)
- Sentiment analysis on recommendations
- Network analysis (document citations)
- Automated report generation
- Email alerts for new documents
- Collaborative annotation tools
Integration & API¶
- Public API for researchers
- Integration with human rights databases
- Data export to common formats (JSON-LD, RDF)
- Citation management integration (Zotero, Mendeley)
- SPARQL endpoint for semantic queries
Phase 6: Sustainability & Community (📅 FUTURE)¶
Infrastructure¶
- Cloud deployment (AWS/GCP/Azure)
- Automated daily scraper runs
- Database migration (from JSON to PostgreSQL)
- Data versioning and snapshots
- Disaster recovery and backups
Community & Documentation¶
- Contributor guidelines
- Research methodology documentation
- User guides and tutorials
- Video walkthroughs
- Academic publications and citations
- Conference presentations
Quality & Maintenance¶
- Automated data quality checks
- Link rot monitoring and alerts
- Performance optimization (caching, indexing)
- Code refactoring and technical debt reduction
- Security audits and updates
Current Status Summary¶
Completed:
- ✅ Core pipeline (scraping, processing, tagging)
- ✅ Scorecard system (194 countries, 2543 source URLs)
- ✅ Validation and security framework
- ✅ 124 tests (100% passing)
- ✅ Comprehensive documentation (24+ files)
In Progress:
- ⏳ Recommendations extraction system
- ⏳ Timeline and comparison exports
- ⏳ Enhanced doc type classification
Next Priority:
- 🎯 Complete recommendations system
- 🎯 Build comparison/timeline exports
- 🎯 Begin research dashboard prototyping
Metrics¶
- Lines of Code: ~15,000+ (Python, config, tests)
- Test Coverage: 124 tests, comprehensive validation
- Documentation: 24 markdown files, 1 comprehensive guide (CLAUDE.md)
- Data Sources: 7 scrapers (AU, OHCHR, UPR, UNICEF, ACERWC, ACHPR, manual)
- Countries Tracked: 194 (via scorecard)
- Indicators: 10 per country
- Source URLs: 2,543 tracked and validated
- Tags Versions: 4 (v1, v2, v3, digital)
Contributing¶
The project is actively developed. Contributions welcome in:
- New scrapers for additional sources
- Enhanced processors (OCR, image extraction)
- Visualization components for dashboard
- Documentation improvements and examples
- Testing coverage expansion
- Performance optimizations
See CLAUDE.md for development guide.
Notes¶
- End-to-end pipeline is production-ready for AU Policy + scorecard workflow
- Future work focuses on expanding analytics and building research dashboard
- All core infrastructure is stable and well-tested
- Documentation is comprehensive and up-to-date
Last updated: January 2026