Skip to content

Project Roadmap

This roadmap outlines the milestones for the GRIMdata / LittleRainbowRights pipeline.


Phase 1: Core Pipeline (✅ COMPLETE)

  • Project scaffolding (init_project.py)
  • AU Policy scraper (requests and Selenium variants)
  • PDF → text processor
  • DOCX → text processor
  • HTML → text processor
  • Fallback handler for multi-format processing
  • Tagging system (v1, v2, v3, digital versions)
  • Tags version management (tags_main.json)
  • Metadata tracking with history
  • Unified logging system with per-module logs
  • Comprehensive test suite (170 tests passing)
  • Documentation (setup, structure, standards, pipeline flow)
  • CI/CD pipeline with GitHub Actions
  • Pre-commit hooks (black, isort, flake8, markdown, yaml)

Phase 2: Data Enrichment & Validation (✅ COMPLETE)

Scorecard System

  • Scorecard data loading (194 countries, 10 indicators each)
  • Scorecard metadata enrichment
  • Scorecard CSV exports (summary, sources, by-indicator, by-region)
  • URL validation system (parallel workers, retry logic)
  • Source change detection and monitoring
  • Diff checking for stale scorecard entries
  • Integration with pipeline runner

Note: System infrastructure complete; country-level data population and validation ongoing (0-1-2 scoring framework finalized January 2026).

Validation & Security

  • Centralized validators module (68 tests)
  • URL validation with malicious pattern blocking
  • Path validation with traversal protection
  • File validation (size limits, extension checks)
  • String validation (length, patterns, regex)
  • Config validation (JSON, tags, structure)
  • Schema validation (metadata, documents)
  • Security improvements across all processors

Expanded Sources

  • OHCHR Treaty Body scraper
  • UPR documents scraper
  • UNICEF reports scraper
  • ACERWC scraper
  • ACHPR scraper
  • Manual upload ingestion (via data/raw/manual/)
  • Static URL dictionary processing

Phase 3: Advanced Processing (✅ COMPLETE - 8/9 Tasks)

Recommendations System

  • Recommendations extraction (regex-based)
  • Recommendations config format (recs_v1.json)
  • Recommendations versioning and history tracking
  • NLP-based recommendations extraction (future)
  • Recommendations export to CSV

Comparison & Analysis

  • Timeline exports (tags_timeline.py, tags_timeline_country.py, tags_timeline_region.py)
  • Comparison across tagging versions
  • Comparison across recommendations versions
  • Comparison export to CSV with version headers
  • Year-over-year trend analysis (via timeline exports)

Enhanced Normalization

  • Country/region normalization with ISO codes
  • Preservation of _raw fields for provenance
  • Complete ISO 3166-1 alpha-2 mapping (194 countries)
  • Automatic doc type classification (Policy, Law, TreatyBody, etc.)
  • Source reliability scoring (Phase 4)

Scorecard Maintenance

  • Alternative source identification for failed monitors
  • Phase 1 critical updates (6 countries, 18 fields, 20+ year old entries)
  • Multi-format exports (CSV, XLSX, ODS, Google Sheets JSON)
  • Update documentation and workflows
  • Phase 2-4 updates (ongoing maintenance)

Phase 4: REST API Backend (✅ COMPLETE)

Flask API Infrastructure

  • App factory pattern with configuration management
  • Flask-CORS, Flask-Caching, Flask-Limiter extensions
  • Request validation and error handling
  • Standard JSON response format with metadata
  • Comprehensive API documentation

Core API Endpoints

  • Health and system info endpoints
  • Documents API (list, filter, detail with pagination and sorting)
  • Scorecard API (countries summary, indicators, statistics)
  • Tags API (frequency analysis, version management, filtering)
  • Timeline API (temporal analysis, year × tag matrices)
  • Export API (CSV downloads with SPDX license headers)

Authentication & Security

  • API key authentication via X-API-Key header
  • Dynamic rate limiting (100 req/hr public, 1000 req/hr authenticated)
  • Custom limits for expensive operations
  • Security headers and best practices

Production Deployment

  • Docker and docker-compose configuration
  • Nginx reverse proxy with SSL/TLS support
  • Redis for caching and rate limiting
  • Health checks and monitoring
  • Complete production deployment guide

Testing & Quality

  • 104 integration tests (100% pass rate)
  • All pre-commit hooks passing
  • Full endpoint coverage

Status: 14 endpoints operational, production-ready with Docker deployment


Phase 5: Visualization Dashboard (📅 NEXT)

Interactive Frontend

  • Dashboard framework (React or Vue.js)
  • Tag frequency visualizations (bar charts, heatmaps)
  • Timeline views (tags over time with D3.js or Plotly)
  • Geographic heatmaps (countries × tags)
  • Scorecard indicator displays
  • Interactive filters (region, country, tags, year, source)
  • Export/download UI for datasets
  • Mobile-responsive design
  • API integration with authentication
  • Real-time data updates

Advanced Analytics

  • Comparison mode (version side-by-side)
  • Gap analysis (missing data visualization)
  • Correlation analysis (tags vs indicators)
  • Trend analysis and forecasting
  • Custom report generation

Note: Backend API is complete and ready for frontend integration


Phase 6: Global Expansion (📅 FUTURE)

Geographic Coverage

  • European sources (EU, Council of Europe)
  • Asian sources (ASEAN, national bodies)
  • Americas sources (OAS, IACHR)
  • Merge African + global datasets
  • Multi-language support (translation pipeline)

Advanced Features

  • Machine learning for document classification
  • Automated entity extraction (organizations, people, dates)
  • Sentiment analysis on recommendations
  • Network analysis (document citations)
  • Automated report generation
  • Email alerts for new documents

Phase 7: Infrastructure & Community (📅 FUTURE)

Infrastructure

  • Cloud deployment (AWS/GCP/Azure)
  • Automated daily scraper runs
  • Database migration (from JSON to PostgreSQL)
  • Data versioning and snapshots
  • Disaster recovery and backups

Community & Documentation

  • Contributor guidelines
  • Research methodology documentation
  • User guides and tutorials
  • Video walkthroughs
  • Academic publications and citations
  • Conference presentations

Quality & Maintenance

  • Automated data quality checks
  • Link rot monitoring and alerts
  • Performance optimization (caching, indexing)
  • Code refactoring and technical debt reduction
  • Security audits and updates

Current Status Summary

Completed:

  • ✅ Core pipeline (scraping, processing, tagging) - Multiple sources: 6 automated scrapers + direct URL tracking
  • ✅ Scorecard system (194 countries, 10 indicators, 2,543 source URLs tracked)
  • ✅ Validation and security framework - 170 tests passing (68 validator tests)
  • ✅ Recommendations extraction system - Regex-based with versioning and history tracking
  • ✅ Timeline exports - Global, by-country, and by-region analysis over time
  • ✅ Comparison analytics - Compare tags and recommendations across versions
  • ✅ ISO 3166-1 alpha-2 country code mapping - 194 countries fully mapped
  • ✅ Document type classifier - Multi-stage rules-based classification
  • ✅ Scorecard maintenance - Phase 1 critical updates (6 countries, 18 fields updated)
  • ✅ Multi-format scorecard exports - CSV, XLSX, ODS, Google Sheets JSON
  • ✅ Comprehensive documentation (40+ markdown files)

Completed (Phase 4 - All 5 Weeks):

  • ✅ Flask API backend (14 endpoints working, documented, tested)
  • ✅ Tags, Timeline, Export APIs (frequency analysis, temporal data, CSV downloads)
  • ✅ Authentication and rate limiting (API keys, dynamic limits 100-2000 req/hr)
  • ✅ Production deployment (Docker, Redis, Nginx, complete guide)
  • ✅ 104 integration tests passing (100% success rate)

Next Priority:

  • 🎯 Phase 5: Dashboard frontend (React/Vue.js with D3.js visualizations)
  • 🎯 Source reliability scoring
  • 🎯 Continue scorecard maintenance (Phases 2-4: 41 remaining stale entries)
  • 🎯 NLP-based recommendations extraction (advanced features)

Metrics

  • Lines of Code: ~21,000+ (Python, config, tests, API, deployment)
  • Test Coverage: 274 tests (170 pipeline + 104 API)
  • Documentation: 75+ markdown files, comprehensive API docs
  • Data Sources: 7 scrapers (AU, OHCHR, UPR, UNICEF, ACERWC, ACHPR, manual)
  • Countries Tracked: 194 (via scorecard, all with ISO 3166-1 alpha-2 codes)
  • Documents Tracked: 78 (via metadata.json)
  • Indicators: 10 per country (29 total indicator fields tracked)
  • Source URLs: 2,543 tracked and validated
  • Tags Versions: 4 (v1, v2, v3, digital)
  • Export Formats: CSV, XLSX, ODS, Google Sheets JSON (scorecard)
  • API Endpoints: 14 production-ready (health, info, documents × 2, scorecard × 3, tags × 2, timeline × 1, export × 2)
  • Authentication: API key based with dynamic rate limiting (100-2000 req/hr)
  • Deployment: Docker + docker-compose + Nginx + Redis

Contributing

The project is actively developed. Contributions welcome in:

  1. New scrapers for additional sources
  2. Enhanced processors (OCR, image extraction)
  3. Visualization components for dashboard
  4. Documentation improvements and examples
  5. Testing coverage expansion
  6. Performance optimizations

Notes

  • End-to-end pipeline is production-ready for AU Policy + scorecard workflow
  • Future work focuses on expanding analytics and building research dashboard
  • All core infrastructure is stable and well-tested (170 tests passing in ~106 seconds)
  • Documentation is comprehensive and up-to-date
  • Recent Completions (January 2026):
  • Migrated from PyPDF2 to pypdf - no more deprecation warnings
  • ISO 3166-1 alpha-2 mapping for all 194 countries
  • Document type classifier (multi-stage rules-based)
  • Scorecard Phase 1 maintenance (6 countries, 18 fields updated)
  • REUSE 3.0 compliance - 235/235 files with SPDX headers
  • Phase 4 API Backend (All 5 weeks):
    • Week 1-2: Foundation + core endpoints (documents, scorecard)
    • Week 3: Extended APIs (tags, timeline, export)
    • Week 4: Authentication & rate limiting (API keys, dynamic limits)
    • Week 5: Production deployment (Docker, Redis, Nginx, 678-line guide)
  • 14 endpoints operational, 104 tests passing (100% success rate)

Last updated: January 2026