Frequently Asked Questions (FAQ)¶

General Questions¶

What is GRIMdata?¶

GRIMdata (Global Rights Information Monitoring) is a research platform at grimdata.org hosting multiple human rights data analysis projects. Currently, GRIMdata features two projects:

LittleRainbowRights - Child and LGBTQ+ digital rights research
Repository: DigitalChild
Tracks 10 indicators across 194 countries
SGBV-UPR - Sexual and gender-based violence analysis
Repository: HumanRights (currently private - retooling in progress - a broken laptop holds the full story )
UPR recommendations analysis

What is DigitalChild?¶

DigitalChild is the open-source Python pipeline that powers the LittleRainbowRights project. It scrapes documents from international organizations, processes them into structured data, applies automated tagging, and enriches them with country-level indicators for digital rights analysis.

Who is this for?¶

Researchers: Human rights academics studying digital rights trends
NGOs: Organizations tracking child/LGBTQ+ protections globally
Policy Analysts: Those comparing policies across countries
Journalists: Investigating digital rights stories
Students: Learning about data analysis and human rights

Is this free to use?¶

Yes! The code is licensed under MIT (permissive, free for any use including commercial). The data and documentation are licensed under CC BY 4.0 (free to use with attribution).

Can I use this for my research?¶

Absolutely! That's the intended purpose. Please cite the project using the format in CITATION.cff.

Getting Started¶

What do I need to run this?¶

Minimum requirements:

Python 3.12
1GB disk space (for code + small dataset)
Internet connection (for scraping)

Recommended:

10GB+ disk space (for large document collections)
Good internet connection (faster scraping)

How do I install it?¶

# 1. Clone the repository
git clone https://github.com/MissCrispenCakes/DigitalChild.git
cd DigitalChild

# 2. Set up virtual environment
python3 -m venv .LittleRainbow
source .LittleRainbow/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Initialize project structure
python init_project.py

# 5. Run the pipeline
python pipeline_runner.py --source au_policy

See docs/guides/FIRST_RUN_ERRORS.md for troubleshooting.

How long does it take to scrape documents?¶

Depends on the source and number of documents:

AU Policy: ~5-10 minutes (small collection)
UPR (single country): ~2-5 minutes
OHCHR: ~15-30 minutes (larger collection)

The pipeline skips already-downloaded files, so subsequent runs are faster.

Where do the downloaded files go?¶

Raw documents: data/raw/<source>/
Processed text: data/processed/<region>/<org>/text/
Metadata: data/metadata/metadata.json
Exports: data/exports/
Logs: logs/

Features & Capabilities¶

What sources does it support?¶

Currently supports 7 sources:

AU Policy - African Union policy documents
OHCHR - Office of the High Commissioner for Human Rights
UPR - Universal Periodic Review documents
UNICEF - UNICEF reports and publications
ACERWC - African Committee on Child Rights
ACHPR - African Commission on Human Rights
Manual - Upload your own documents to data/raw/manual/

What file formats can it process?¶

PDF (most common)
DOCX (Microsoft Word)
HTML (web pages)

The fallback_handler automatically tries different processors until one succeeds.

What is the scorecard system?¶

The scorecard tracks 10 human rights indicators across 194 countries:

Data Protection Law - Comprehensive data protection legislation
DPA Independence - Data Protection Authority independence from executive control
Children's Data Safeguards - Child-specific data governance safeguards in binding law
Child Online Protection Strategy - National COP framework addressing online harms
SOGI Sensitive Data - Sexual orientation/gender identity as sensitive data
LGBTQ+ Legal Status - Legal recognition and protections for LGBTQ+ individuals
LGBTQ+ Promotion/Propaganda Offences - Laws restricting LGBTQ+ discussion/advocacy
AI Policy Status - National AI strategy or framework
DPIA Required for High-Risk AI - Data Protection Impact Assessments for high-risk AI
SIM Card Biometric ID Linkage - Biometric data required for SIM registration

Each indicator includes the current status, categories, and source URLs for verification.

How does tagging work?¶

The tagger applies regex-based rules to identify mentions of:

Child rights
LGBTQ+ rights
AI and automation
Privacy and data protection
Digital policy
Online rights
And more...

Tags are versioned (v1, v2, v3, digital, queerai) allowing comparison across rule sets.

What tags versions are available?¶

The pipeline supports multiple tag configurations:

latest - Alias for the most current version (currently v3) - recommended
v1 - Original basic tags
v2 - Expanded tag set
v3 - Current comprehensive tags
digital - Digital rights focused tags
queerai - QueerAI conference tags

Use with: python pipeline_runner.py --source au_policy --tags-version latest

Can I add my own tags?¶

Yes! Edit configs/tags_v3.json (or create a new version) and add your regex patterns:

{
  "rules": {
    "YourNewTag": [
      "keyword1",
      "keyword2",
      "regex.*pattern"
    ]
  }
}

Then run: python pipeline_runner.py --tags-version v3

Technical Questions¶

Why Python 3.12 specifically?¶

The project uses modern Python features available in 3.12. While it may work with 3.10+, we only test and support 3.12.

Can I run this on Windows?¶

Yes! The code works on Windows, Mac, and Linux. Paths are handled with os.path for cross-platform compatibility.

Do I need Selenium/ChromeDriver?¶

No, unless you want to use Selenium scrapers (_sel variants). The standard scrapers use requests and work without browser drivers.

How much bandwidth does scraping use?¶

Varies by source:

Small collection: 10-50 MB
Large collection: 100-500 MB
Full multi-source scrape: 1-5 GB

The pipeline skips existing files, so re-runs use minimal bandwidth.

Can I run this in the cloud?¶

Yes! Works on:

AWS EC2
Google Cloud Compute
Azure VMs
DigitalOcean Droplets
Any Linux server

Just ensure Python 3.12 is installed and you have sufficient disk space.

Does it use a database?¶

Currently uses JSON files (metadata.json) for simplicity. A future version may migrate to PostgreSQL for better performance at scale.

Data & Privacy¶

What data does this collect?¶

The pipeline downloads publicly available documents from government and UN websites. No personal data is collected from users.

Is the scraped data private?¶

The documents are already public (published by governments/UN). However, be mindful of:

Where you store the data
Who has access to your machine
Data retention policies

See DATA_GOVERNANCE.md for details.

The documents themselves are typically public domain (government/UN publications). However:

Check the original source's terms
Respect copyright if applicable
Attribute the original publishers

Your compiled dataset (scorecard, tags, analysis) is licensed CC BY 4.0 (requires attribution).

How often is the scorecard data updated?¶

Manually updated as new information becomes available. The scorecard_diff.py module monitors sources for changes and alerts when updates are needed.

Are the scorecard sources reliable?¶

We aim to use authoritative sources (UNESCO, UNCTAD, ILGA, UNICEF, etc.). Each indicator includes the source URL for verification. If you find an error, please report it.

Contributing & Development¶

Can I contribute?¶

Yes! See CONTRIBUTING.md for guidelines. Contributions welcome:

Bug reports
New scrapers
Documentation improvements
Test coverage
Visualization ideas

I found a bug. What should I do?¶

Check if it's already reported: Issues
Review FIRST_RUN_ERRORS.md
If not resolved, open a new issue with details

How do I add a new scraper?¶

See SCRAPER_STRUCTURE.md for a template and guide.

Basic steps:

Create scrapers/new_source.py
Implement scrape() function
Add to SCRAPER_MAP in pipeline_runner.py
Add tests
Submit pull request

How can I test my changes?¶

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_validators.py -v

# With coverage
pytest tests/ --cov

Troubleshooting¶

The scraper isn't downloading anything. Why?¶

Common causes:

Already downloaded: Pipeline skips existing files (check data/raw/<source>/)
Network issues: Firewall, proxy, or connection problems
Source changed: Website structure may have changed
Robots.txt blocking: Some sites block automated access

Check logs in logs/ for error messages.

Processing fails with "File not found" error¶

Run python init_project.py to ensure all directories exist.

Tests are failing¶

Common reasons:

Pre-commit not installed: Run pip install pre-commit && pre-commit install
Dependencies outdated: Run pip install --upgrade -r requirements.txt
Python version: Ensure Python 3.12 is active
Working directory: Run from project root, not subdirectory

I'm getting import errors¶

Ensure you're running commands from the project root directory (where pipeline_runner.py is located), not from subdirectories.

The website/visualization isn't working¶

The website is static and generated from docs. If it's not working:

Ensure MkDocs is installed: pip install mkdocs mkdocs-material
Build locally: mkdocs serve
Check mkdocs.yml configuration

Research & Citations¶

How should I cite this project?¶

Use the format in CITATION.cff:

Vollmer, S.C. (2025). DigitalChild: Human Rights Data Pipeline for Child
and LGBTQ+ Digital Protection.
Available at: https://github.com/MissCrispenCakes/DigitalChild
ORCID: 0000-0002-3359-2810

Are there published papers using this?¶

Check the project website at grimdata.org for latest publications.

Can I use this for my thesis/dissertation?¶

Absolutely! That's an intended use case. Please cite the project and consider contributing back improvements.

How can I get involved in research collaborations?¶

Open a discussion or reach out via the website contact form.

Future Development¶

What features are planned?¶

See ROADMAP.md for the full roadmap. Highlights:

Recommendations extraction (NLP-based)
Timeline visualizations
Comparison analytics
Interactive research dashboard
Global expansion (Europe, Asia, Americas)

When will the research dashboard be ready?¶

Target: Late 2026. It's in Phase 4 of the roadmap. Focus right now is on completing Phase 3 (advanced processing).

Can I request a feature?¶

Yes! Open a feature request issue or discussion. No guarantees, but we're open to suggestions that align with the project mission.

Contact & Support¶

How do I get help?¶

Documentation: Check the documentation
FAQ: This page
Issues: Search existing issues
Discussions: Ask in discussions

Is there a mailing list or community forum?¶

Not yet. Use GitHub Discussions for now. A community forum may be added in the future.

Who maintains this project?¶

This project is maintained part-time by one person. Please be patient with response times!

How can I support the project?¶

⭐ Star the repo on GitHub
📢 Share it with researchers in your network
🐛 Report bugs and issues
💻 Contribute code or documentation
📝 Cite it in your publications
💰 Consider sponsoring (if/when GitHub Sponsors is enabled)

Didn't find your answer? Open a discussion or issue.

Last updated: January 2026