Frequently Asked Questions (FAQ)¶
General Questions¶
What is GRIMdata?¶
GRIMdata (Global Rights Information Monitoring) is a research platform at grimdata.org hosting multiple human rights data analysis projects. Currently, GRIMdata features two projects:
- LittleRainbowRights - Child and LGBTQ+ digital rights research
- Repository: DigitalChild
- Tracks 10 indicators across 194 countries
- SGBV-UPR - Sexual and gender-based violence analysis
- Repository: HumanRights (currently private - retooling in progress - a broken laptop holds the full story
)
- UPR recommendations analysis
What is DigitalChild?¶
DigitalChild is the open-source Python pipeline that powers the LittleRainbowRights project. It scrapes documents from international organizations, processes them into structured data, applies automated tagging, and enriches them with country-level indicators for digital rights analysis.
Who is this for?¶
- Researchers: Human rights academics studying digital rights trends
- NGOs: Organizations tracking child/LGBTQ+ protections globally
- Policy Analysts: Those comparing policies across countries
- Journalists: Investigating digital rights stories
- Students: Learning about data analysis and human rights
Is this free to use?¶
Yes! The code is licensed under MIT (permissive, free for any use including commercial). The data and documentation are licensed under CC BY 4.0 (free to use with attribution).
Can I use this for my research?¶
Absolutely! That's the intended purpose. Please cite the project using the format in CITATION.cff.
Getting Started¶
What do I need to run this?¶
Minimum requirements:
- Python 3.12
- 1GB disk space (for code + small dataset)
- Internet connection (for scraping)
Recommended:
- 10GB+ disk space (for large document collections)
- Good internet connection (faster scraping)
How do I install it?¶
# 1. Clone the repository
git clone https://github.com/MissCrispenCakes/DigitalChild.git
cd DigitalChild
# 2. Set up virtual environment
python3 -m venv .LittleRainbow
source .LittleRainbow/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Initialize project structure
python init_project.py
# 5. Run the pipeline
python pipeline_runner.py --source au_policy
See docs/guides/FIRST_RUN_ERRORS.md for troubleshooting.
How long does it take to scrape documents?¶
Depends on the source and number of documents:
- AU Policy: ~5-10 minutes (small collection)
- UPR (single country): ~2-5 minutes
- OHCHR: ~15-30 minutes (larger collection)
The pipeline skips already-downloaded files, so subsequent runs are faster.
Where do the downloaded files go?¶
- Raw documents:
data/raw/<source>/ - Processed text:
data/processed/<region>/<org>/text/ - Metadata:
data/metadata/metadata.json - Exports:
data/exports/ - Logs:
logs/
Features & Capabilities¶
What sources does it support?¶
Currently supports 7 sources:
- AU Policy - African Union policy documents
- OHCHR - Office of the High Commissioner for Human Rights
- UPR - Universal Periodic Review documents
- UNICEF - UNICEF reports and publications
- ACERWC - African Committee on Child Rights
- ACHPR - African Commission on Human Rights
- Manual - Upload your own documents to
data/raw/manual/
What file formats can it process?¶
- PDF (most common)
- DOCX (Microsoft Word)
- HTML (web pages)
The fallback_handler automatically tries different processors until one succeeds.
What is the scorecard system?¶
The scorecard tracks 10 human rights indicators across 194 countries:
- Data Protection Law - Comprehensive data protection legislation
- DPA Independence - Data Protection Authority independence from executive control
- Children's Data Safeguards - Child-specific data governance safeguards in binding law
- Child Online Protection Strategy - National COP framework addressing online harms
- SOGI Sensitive Data - Sexual orientation/gender identity as sensitive data
- LGBTQ+ Legal Status - Legal recognition and protections for LGBTQ+ individuals
- LGBTQ+ Promotion/Propaganda Offences - Laws restricting LGBTQ+ discussion/advocacy
- AI Policy Status - National AI strategy or framework
- DPIA Required for High-Risk AI - Data Protection Impact Assessments for high-risk AI
- SIM Card Biometric ID Linkage - Biometric data required for SIM registration
Each indicator includes the current status, categories, and source URLs for verification.
How does tagging work?¶
The tagger applies regex-based rules to identify mentions of:
- Child rights
- LGBTQ+ rights
- AI and automation
- Privacy and data protection
- Digital policy
- Online rights
- And more...
Tags are versioned (v1, v2, v3, digital, queerai) allowing comparison across rule sets.
What tags versions are available?¶
The pipeline supports multiple tag configurations:
- latest - Alias for the most current version (currently v3) - recommended
- v1 - Original basic tags
- v2 - Expanded tag set
- v3 - Current comprehensive tags
- digital - Digital rights focused tags
- queerai - QueerAI conference tags
Use with: python pipeline_runner.py --source au_policy --tags-version latest
Can I add my own tags?¶
Yes! Edit configs/tags_v3.json (or create a new version) and add your regex patterns:
Then run: python pipeline_runner.py --tags-version v3
Technical Questions¶
Why Python 3.12 specifically?¶
The project uses modern Python features available in 3.12. While it may work with 3.10+, we only test and support 3.12.
Can I run this on Windows?¶
Yes! The code works on Windows, Mac, and Linux. Paths are handled with os.path for cross-platform compatibility.
Do I need Selenium/ChromeDriver?¶
No, unless you want to use Selenium scrapers (_sel variants). The standard scrapers use requests and work without browser drivers.
How much bandwidth does scraping use?¶
Varies by source:
- Small collection: 10-50 MB
- Large collection: 100-500 MB
- Full multi-source scrape: 1-5 GB
The pipeline skips existing files, so re-runs use minimal bandwidth.
Can I run this in the cloud?¶
Yes! Works on:
- AWS EC2
- Google Cloud Compute
- Azure VMs
- DigitalOcean Droplets
- Any Linux server
Just ensure Python 3.12 is installed and you have sufficient disk space.
Does it use a database?¶
Currently uses JSON files (metadata.json) for simplicity. A future version may migrate to PostgreSQL for better performance at scale.
Data & Privacy¶
What data does this collect?¶
The pipeline downloads publicly available documents from government and UN websites. No personal data is collected from users.
Is the scraped data private?¶
The documents are already public (published by governments/UN). However, be mindful of:
- Where you store the data
- Who has access to your machine
- Data retention policies
See DATA_GOVERNANCE.md for details.
Can I share the scraped documents?¶
The documents themselves are typically public domain (government/UN publications). However:
- Check the original source's terms
- Respect copyright if applicable
- Attribute the original publishers
Your compiled dataset (scorecard, tags, analysis) is licensed CC BY 4.0 (requires attribution).
How often is the scorecard data updated?¶
Manually updated as new information becomes available. The scorecard_diff.py module monitors sources for changes and alerts when updates are needed.
Are the scorecard sources reliable?¶
We aim to use authoritative sources (UNESCO, UNCTAD, ILGA, UNICEF, etc.). Each indicator includes the source URL for verification. If you find an error, please report it.
Contributing & Development¶
Can I contribute?¶
Yes! See CONTRIBUTING.md for guidelines. Contributions welcome:
- Bug reports
- New scrapers
- Documentation improvements
- Test coverage
- Visualization ideas
I found a bug. What should I do?¶
- Check if it's already reported: Issues
- Review FIRST_RUN_ERRORS.md
- If not resolved, open a new issue with details
How do I add a new scraper?¶
See SCRAPER_STRUCTURE.md for a template and guide.
Basic steps:
- Create
scrapers/new_source.py - Implement
scrape()function - Add to
SCRAPER_MAPinpipeline_runner.py - Add tests
- Submit pull request
How can I test my changes?¶
# Run all tests
pytest tests/ -v
# Run specific test
pytest tests/test_validators.py -v
# With coverage
pytest tests/ --cov
Troubleshooting¶
The scraper isn't downloading anything. Why?¶
Common causes:
- Already downloaded: Pipeline skips existing files (check
data/raw/<source>/) - Network issues: Firewall, proxy, or connection problems
- Source changed: Website structure may have changed
- Robots.txt blocking: Some sites block automated access
Check logs in logs/ for error messages.
Processing fails with "File not found" error¶
Run python init_project.py to ensure all directories exist.
Tests are failing¶
Common reasons:
- Pre-commit not installed: Run
pip install pre-commit && pre-commit install - Dependencies outdated: Run
pip install --upgrade -r requirements.txt - Python version: Ensure Python 3.12 is active
- Working directory: Run from project root, not subdirectory
I'm getting import errors¶
Ensure you're running commands from the project root directory (where pipeline_runner.py is located), not from subdirectories.
The website/visualization isn't working¶
The website is static and generated from docs. If it's not working:
- Ensure MkDocs is installed:
pip install mkdocs mkdocs-material - Build locally:
mkdocs serve - Check
mkdocs.ymlconfiguration
Research & Citations¶
How should I cite this project?¶
Use the format in CITATION.cff:
Vollmer, S.C. (2025). DigitalChild: Human Rights Data Pipeline for Child
and LGBTQ+ Digital Protection.
Available at: https://github.com/MissCrispenCakes/DigitalChild
ORCID: 0000-0002-3359-2810
Are there published papers using this?¶
Check the project website at grimdata.org for latest publications.
Can I use this for my thesis/dissertation?¶
Absolutely! That's an intended use case. Please cite the project and consider contributing back improvements.
How can I get involved in research collaborations?¶
Open a discussion or reach out via the website contact form.
Future Development¶
What features are planned?¶
See ROADMAP.md for the full roadmap. Highlights:
- Recommendations extraction (NLP-based)
- Timeline visualizations
- Comparison analytics
- Interactive research dashboard
- Global expansion (Europe, Asia, Americas)
When will the research dashboard be ready?¶
Target: Late 2026. It's in Phase 4 of the roadmap. Focus right now is on completing Phase 3 (advanced processing).
Can I request a feature?¶
Yes! Open a feature request issue or discussion. No guarantees, but we're open to suggestions that align with the project mission.
Contact & Support¶
How do I get help?¶
- Documentation: Check the documentation
- FAQ: This page
- Issues: Search existing issues
- Discussions: Ask in discussions
Is there a mailing list or community forum?¶
Not yet. Use GitHub Discussions for now. A community forum may be added in the future.
Who maintains this project?¶
This project is maintained part-time by one person. Please be patient with response times!
How can I support the project?¶
- ⭐ Star the repo on GitHub
- 📢 Share it with researchers in your network
- 🐛 Report bugs and issues
- 💻 Contribute code or documentation
- 📝 Cite it in your publications
- 💰 Consider sponsoring (if/when GitHub Sponsors is enabled)
Didn't find your answer? Open a discussion or issue.
Last updated: January 2026