🌍 Project Domains:
- https://GRIMdata.org
- https://LittleRainbowRights.com
PIPELINE FLOW¶
This document explains the flow of data through the pipeline.
1. Input Sources¶
-
Scrapers pull content from:
-
AU policy PDFs (demo implemented)
- OHCHR Treaty Body Database (planned)
-
UPR, UNICEF, ACERWC, ACHPR (planned)
-
Manual files can be dropped into
data/raw/manual/.
2. Processing¶
-
File type detection Attempt to process based on extension.
-
.pdf→pdf_to_text.py .docx→docx_to_text.py.html→html_to_text.py-
Fallback → try processors in order until success.
-
Normalization
json_normalizer.pycleans country/region names. Preserves_rawfields for provenance. -
Tagging
tagger.pyapplies regex rules fromconfigs/tags_v1.json. Tags are stored intags_historyinmetadata.json. -
Recommendations (future) Config-driven extraction of recommendations.
3. Metadata¶
metadata.jsontracks:- Project identity
- Documents ingested
- Tags history (with versions)
- Recommendations history (with versions)
- Last processed timestamp
4. Exports¶
tags_summary.py→ counts tags, outputs CSV with branding footertags_timeline*.py→ (future) timeline exportscomparison.py→ (future) compare tagging/recommendations across versions
5. Logging¶
logger.pyprovides unified + per-module logging- Controlled via
--no-module-logsflag inpipeline_runner.py - Logs written to
logs/directory
6. Review¶
- Researchers can inspect:
- Raw files in
data/raw/ - Processed text in
data/processed/ - Metadata in
data/metadata/metadata.json - Exports in
data/exports/ - Logs in
logs/