Metadata Schema¶
This document defines the complete schema for data/metadata/metadata.json.
Top-Level Structure¶
Project Identity¶
{
"project_identity": {
"name": "GRIMdata / LittleRainbowRights",
"domains": [
"https://GRIMdata.org",
"https://LittleRainbowRights.com"
],
"note": "Pipeline for analyzing child & LGBTQ+ digital protections."
}
}
Document Schema¶
Core Fields¶
{
"id": "AU_Digital_Compact_2024.pdf",
"source": "au_policy",
"file_type": "PDF",
"ingestion_method": "scraper",
"last_processed": "2025-08-28T15:22:00Z"
}
Field Descriptions:
id(string, required): Unique document identifier (usually filename)source(string, required): Source scraper or ingestion method- Examples:
au_policy,ohchr,upr,unicef,manual file_type(string): File format- Examples:
PDF,DOCX,HTML,TXT ingestion_method(string): How document was obtained- Examples:
scraper,manual,api,url_dict last_processed(string, ISO 8601): Last processing timestamp
Geographic Fields¶
{
"country": "African_Union",
"country_raw": "African Union",
"country_iso": "AU",
"region": "Africa",
"region_raw": "Sub-Saharan Africa"
}
Field Descriptions:
country(string): Normalized country name (underscores, no special chars)country_raw(string): Original country name as extractedcountry_iso(string, optional): ISO 3166-1 alpha-2 code- Examples:
KE(Kenya),NG(Nigeria),ZA(South Africa) region(string): Normalized region nameregion_raw(string): Original region string as extracted
Temporal Fields¶
Field Descriptions:
year(integer, optional): Document publication year (1900-2100)year_extracted_from(string, optional): Extraction source- Examples:
filename,first_page,metadata,url
Classification Fields¶
Field Descriptions:
doc_type(string, optional): Document type classification- Examples:
Policy,Law,Treaty,UPR,Observation,Report,Research
Tags History¶
{
"tags_history": [
{
"tags": ["AI", "DigitalPolicy", "ChildRights"],
"version": "tags_v3",
"timestamp": "2025-08-28T15:22:00Z"
},
{
"tags": ["AI", "ChildRights"],
"version": "tags_v1",
"timestamp": "2025-08-20T10:15:00Z"
}
]
}
Field Descriptions:
tags_history(array): Historical record of tag applicationstags(array of strings): Tags matched for this versionversion(string): Tag config version used- Examples:
tags_v1,tags_v2,tags_v3,tags_digital,latest
- Examples:
timestamp(string, ISO 8601): When tags were applied
Notes:
- Multiple entries allow comparison across tag versions
- Most recent entry typically at index 0
- Empty array
[]if document hasn't been tagged
Recommendations History¶
{
"recommendations_history": [
{
"recommendations": [
"The Committee recommends that the State Party adopt...",
"urges the Government to ensure..."
],
"version": "recs_v1",
"timestamp": "2025-08-28T16:00:00Z"
}
]
}
Field Descriptions:
recommendations_history(array): Historical record of recommendations extractionrecommendations(array of strings): Extracted recommendation textsversion(string): Recommendations config versiontimestamp(string, ISO 8601): When recommendations were extracted
Notes:
- Used for treaty body documents, UPR reports, observations
- Empty array
[]if no recommendations found or not yet processed
Scorecard Integration¶
{
"scorecard": {
"matched_country": "Kenya",
"enriched_at": "2025-01-15T10:30:00Z",
"indicators": {
"AI_Policy_Status": {
"value": "Draft policy under development (2024)",
"source": "https://unesco.org/ai-policy-observatory/kenya"
},
"Data_Protection_Law": {
"value": "In force - Data Protection Act 2019",
"source": "https://unctad.org/data-protection-tracker"
},
"LGBTQ_Legal_Status": {
"value": "Criminalized - Penal Code Sections 162-165",
"source": "https://ilga.org/kenya-profile"
}
}
}
}
Field Descriptions:
scorecard(object, optional): Country-level human rights indicatorsmatched_country(string): Country name matched in scorecardenriched_at(string, ISO 8601): When scorecard data was addedindicators(object): Dictionary of indicator data- Each indicator has:
value(string): Indicator status/descriptionsource(string): Source URL for verification
Available Indicators (10 per country):
- AI_Policy_Status - National AI policy/strategy status
- Data_Protection_Law - Data protection legislation status
- LGBTQ_Legal_Status - Legal status of LGBTQ+ rights
- Child_Online_Protection - Child protection laws and policies
- SIM_Biometric - SIM card registration requirements
- Encryption_Backdoors - Government encryption/backdoor requirements
- Promotion_Propaganda - LGBTQ+ promotion/propaganda laws
- DPA_Independence - Data Protection Authority independence
- Content_Moderation - Content moderation legal framework
- Age_Verification - Age verification requirements
Notes:
- Scorecard is added during enrichment step
- Only present if document has a country field AND country exists in scorecard
- 194 countries currently tracked in
data/scorecard/scorecard_main_presentation.xlsx(canonical file) - 2,543 source URLs tracked and validated
Complete Example¶
{
"project_identity": {
"name": "GRIMdata / LittleRainbowRights",
"domains": [
"https://GRIMdata.org",
"https://LittleRainbowRights.com"
],
"note": "Pipeline for analyzing child & LGBTQ+ digital protections."
},
"documents": [
{
"id": "Kenya_UPR_Report_2020.pdf",
"source": "upr",
"country": "Kenya",
"country_raw": "Kenya",
"country_iso": "KE",
"region": "Africa",
"region_raw": "Sub-Saharan Africa",
"year": 2020,
"year_extracted_from": "filename",
"doc_type": "UPR",
"file_type": "PDF",
"ingestion_method": "scraper",
"tags_history": [
{
"tags": ["ChildRights", "LGBTQ", "Privacy", "OnlineRights"],
"version": "tags_v3",
"timestamp": "2025-08-28T15:22:00Z"
}
],
"recommendations_history": [
{
"recommendations": [
"The Committee recommends that Kenya adopt comprehensive child online protection legislation",
"urges the State to decriminalize consensual same-sex conduct"
],
"version": "recs_v1",
"timestamp": "2025-08-28T16:00:00Z"
}
],
"scorecard": {
"matched_country": "Kenya",
"enriched_at": "2025-01-15T10:30:00Z",
"indicators": {
"AI_Policy_Status": {
"value": "Draft policy under development (2024)",
"source": "https://unesco.org/ai-policy-observatory/kenya"
},
"Data_Protection_Law": {
"value": "In force - Data Protection Act 2019",
"source": "https://unctad.org/data-protection-tracker"
},
"LGBTQ_Legal_Status": {
"value": "Criminalized - Penal Code Sections 162-165",
"source": "https://ilga.org/kenya-profile"
}
}
},
"last_processed": "2025-08-28T15:22:00Z"
}
]
}
Validation¶
Documents are validated using processors/validators.py:
from processors.validators import validate_document_metadata
# Validates:
# - Required fields (id, source)
# - Field types (year must be int, tags_history must be list)
# - Value constraints (year 1900-2100, non-empty strings)
validate_document_metadata(document)
See VALIDATORS_USAGE.md for complete validation guide.
Notes¶
- All timestamps use ISO 8601 format with UTC timezone
- Fields with
_rawpreserve original values for provenance - History arrays allow version comparison and research reproducibility
- Empty arrays
[]are valid for history fields if not yet processed - Optional fields may be omitted entirely or set to
null - Scorecard enrichment is independent of tags/recommendations
- Schema evolves as features are added; maintain backward compatibility
Last updated: January 2026