Project Roadmap¶
This roadmap outlines the milestones for the GRIMdata / LittleRainbowRights pipeline.
Phase 1: Core Pipeline (✅ COMPLETE)¶
- Project scaffolding (
init_project.py) - AU Policy scraper (requests and Selenium variants)
- PDF → text processor
- DOCX → text processor
- HTML → text processor
- Fallback handler for multi-format processing
- Tagging system (v1, v2, v3, digital versions)
- Tags version management (
tags_main.json) - Metadata tracking with history
- Unified logging system with per-module logs
- Comprehensive test suite (170 tests passing)
- Documentation (setup, structure, standards, pipeline flow)
- CI/CD pipeline with GitHub Actions
- Pre-commit hooks (black, isort, flake8, markdown, yaml)
Phase 2: Data Enrichment & Validation (✅ COMPLETE)¶
Scorecard System¶
- Scorecard data loading (194 countries, 10 indicators each)
- Scorecard metadata enrichment
- Scorecard CSV exports (summary, sources, by-indicator, by-region)
- URL validation system (parallel workers, retry logic)
- Source change detection and monitoring
- Diff checking for stale scorecard entries
- Integration with pipeline runner
Note: System infrastructure complete; country-level data population and validation ongoing (0-1-2 scoring framework finalized January 2026).
Validation & Security¶
- Centralized validators module (68 tests)
- URL validation with malicious pattern blocking
- Path validation with traversal protection
- File validation (size limits, extension checks)
- String validation (length, patterns, regex)
- Config validation (JSON, tags, structure)
- Schema validation (metadata, documents)
- Security improvements across all processors
Expanded Sources¶
- OHCHR Treaty Body scraper
- UPR documents scraper
- UNICEF reports scraper
- ACERWC scraper
- ACHPR scraper
- Manual upload ingestion (via
data/raw/manual/) - Static URL dictionary processing
Phase 3: Advanced Processing (✅ COMPLETE - 8/9 Tasks)¶
Recommendations System¶
- Recommendations extraction (regex-based)
- Recommendations config format (
recs_v1.json) - Recommendations versioning and history tracking
- NLP-based recommendations extraction (future)
- Recommendations export to CSV
Comparison & Analysis¶
- Timeline exports (
tags_timeline.py,tags_timeline_country.py,tags_timeline_region.py) - Comparison across tagging versions
- Comparison across recommendations versions
- Comparison export to CSV with version headers
- Year-over-year trend analysis (via timeline exports)
Enhanced Normalization¶
- Country/region normalization with ISO codes
- Preservation of
_rawfields for provenance - Complete ISO 3166-1 alpha-2 mapping (194 countries)
- Automatic doc type classification (Policy, Law, TreatyBody, etc.)
- Source reliability scoring (Phase 4)
Scorecard Maintenance¶
- Alternative source identification for failed monitors
- Phase 1 critical updates (6 countries, 18 fields, 20+ year old entries)
- Multi-format exports (CSV, XLSX, ODS, Google Sheets JSON)
- Update documentation and workflows
- Phase 2-4 updates (ongoing maintenance)
Phase 4: REST API Backend (✅ COMPLETE)¶
Flask API Infrastructure¶
- App factory pattern with configuration management
- Flask-CORS, Flask-Caching, Flask-Limiter extensions
- Request validation and error handling
- Standard JSON response format with metadata
- Comprehensive API documentation
Core API Endpoints¶
- Health and system info endpoints
- Documents API (list, filter, detail with pagination and sorting)
- Scorecard API (countries summary, indicators, statistics)
- Tags API (frequency analysis, version management, filtering)
- Timeline API (temporal analysis, year × tag matrices)
- Export API (CSV downloads with SPDX license headers)
Authentication & Security¶
- API key authentication via X-API-Key header
- Dynamic rate limiting (100 req/hr public, 1000 req/hr authenticated)
- Custom limits for expensive operations
- Security headers and best practices
Production Deployment¶
- Docker and docker-compose configuration
- Nginx reverse proxy with SSL/TLS support
- Redis for caching and rate limiting
- Health checks and monitoring
- Complete production deployment guide
Testing & Quality¶
- 104 integration tests (100% pass rate)
- All pre-commit hooks passing
- Full endpoint coverage
Status: 14 endpoints operational, production-ready with Docker deployment
Phase 5: Visualization Dashboard (📅 NEXT)¶
Interactive Frontend¶
- Dashboard framework (React or Vue.js)
- Tag frequency visualizations (bar charts, heatmaps)
- Timeline views (tags over time with D3.js or Plotly)
- Geographic heatmaps (countries × tags)
- Scorecard indicator displays
- Interactive filters (region, country, tags, year, source)
- Export/download UI for datasets
- Mobile-responsive design
- API integration with authentication
- Real-time data updates
Advanced Analytics¶
- Comparison mode (version side-by-side)
- Gap analysis (missing data visualization)
- Correlation analysis (tags vs indicators)
- Trend analysis and forecasting
- Custom report generation
Note: Backend API is complete and ready for frontend integration
Phase 6: Global Expansion (📅 FUTURE)¶
Geographic Coverage¶
- European sources (EU, Council of Europe)
- Asian sources (ASEAN, national bodies)
- Americas sources (OAS, IACHR)
- Merge African + global datasets
- Multi-language support (translation pipeline)
Advanced Features¶
- Machine learning for document classification
- Automated entity extraction (organizations, people, dates)
- Sentiment analysis on recommendations
- Network analysis (document citations)
- Automated report generation
- Email alerts for new documents
Phase 7: Infrastructure & Community (📅 FUTURE)¶
Infrastructure¶
- Cloud deployment (AWS/GCP/Azure)
- Automated daily scraper runs
- Database migration (from JSON to PostgreSQL)
- Data versioning and snapshots
- Disaster recovery and backups
Community & Documentation¶
- Contributor guidelines
- Research methodology documentation
- User guides and tutorials
- Video walkthroughs
- Academic publications and citations
- Conference presentations
Quality & Maintenance¶
- Automated data quality checks
- Link rot monitoring and alerts
- Performance optimization (caching, indexing)
- Code refactoring and technical debt reduction
- Security audits and updates
Current Status Summary¶
Completed:
- ✅ Core pipeline (scraping, processing, tagging) - Multiple sources: 6 automated scrapers + direct URL tracking
- ✅ Scorecard system (194 countries, 10 indicators, 2,543 source URLs tracked)
- ✅ Validation and security framework - 170 tests passing (68 validator tests)
- ✅ Recommendations extraction system - Regex-based with versioning and history tracking
- ✅ Timeline exports - Global, by-country, and by-region analysis over time
- ✅ Comparison analytics - Compare tags and recommendations across versions
- ✅ ISO 3166-1 alpha-2 country code mapping - 194 countries fully mapped
- ✅ Document type classifier - Multi-stage rules-based classification
- ✅ Scorecard maintenance - Phase 1 critical updates (6 countries, 18 fields updated)
- ✅ Multi-format scorecard exports - CSV, XLSX, ODS, Google Sheets JSON
- ✅ Comprehensive documentation (40+ markdown files)
Completed (Phase 4 - All 5 Weeks):
- ✅ Flask API backend (14 endpoints working, documented, tested)
- ✅ Tags, Timeline, Export APIs (frequency analysis, temporal data, CSV downloads)
- ✅ Authentication and rate limiting (API keys, dynamic limits 100-2000 req/hr)
- ✅ Production deployment (Docker, Redis, Nginx, complete guide)
- ✅ 104 integration tests passing (100% success rate)
Next Priority:
- 🎯 Phase 5: Dashboard frontend (React/Vue.js with D3.js visualizations)
- 🎯 Source reliability scoring
- 🎯 Continue scorecard maintenance (Phases 2-4: 41 remaining stale entries)
- 🎯 NLP-based recommendations extraction (advanced features)
Metrics¶
- Lines of Code: ~21,000+ (Python, config, tests, API, deployment)
- Test Coverage: 274 tests (170 pipeline + 104 API)
- Documentation: 75+ markdown files, comprehensive API docs
- Data Sources: 7 scrapers (AU, OHCHR, UPR, UNICEF, ACERWC, ACHPR, manual)
- Countries Tracked: 194 (via scorecard, all with ISO 3166-1 alpha-2 codes)
- Documents Tracked: 78 (via metadata.json)
- Indicators: 10 per country (29 total indicator fields tracked)
- Source URLs: 2,543 tracked and validated
- Tags Versions: 4 (v1, v2, v3, digital)
- Export Formats: CSV, XLSX, ODS, Google Sheets JSON (scorecard)
- API Endpoints: 14 production-ready (health, info, documents × 2, scorecard × 3, tags × 2, timeline × 1, export × 2)
- Authentication: API key based with dynamic rate limiting (100-2000 req/hr)
- Deployment: Docker + docker-compose + Nginx + Redis
Contributing¶
The project is actively developed. Contributions welcome in:
- New scrapers for additional sources
- Enhanced processors (OCR, image extraction)
- Visualization components for dashboard
- Documentation improvements and examples
- Testing coverage expansion
- Performance optimizations
Notes¶
- End-to-end pipeline is production-ready for AU Policy + scorecard workflow
- Future work focuses on expanding analytics and building research dashboard
- All core infrastructure is stable and well-tested (170 tests passing in ~106 seconds)
- Documentation is comprehensive and up-to-date
- Recent Completions (January 2026):
- Migrated from PyPDF2 to pypdf - no more deprecation warnings
- ISO 3166-1 alpha-2 mapping for all 194 countries
- Document type classifier (multi-stage rules-based)
- Scorecard Phase 1 maintenance (6 countries, 18 fields updated)
- REUSE 3.0 compliance - 235/235 files with SPDX headers
- Phase 4 API Backend (All 5 weeks):
- Week 1-2: Foundation + core endpoints (documents, scorecard)
- Week 3: Extended APIs (tags, timeline, export)
- Week 4: Authentication & rate limiting (API keys, dynamic limits)
- Week 5: Production deployment (Docker, Redis, Nginx, 678-line guide)
- 14 endpoints operational, 104 tests passing (100% success rate)
Last updated: January 2026