Source Feasibility Checklist¶
This checklist helps determine whether a source can be scraped or must be ingested manually.
Questions to Ask¶
-
Is there a public URL or portal for the documents?
-
Yes → proceed to scrape.
-
No → manual ingestion required.
-
Does the site allow automated scraping?
-
Check robots.txt and terms of service.
-
If blocked, use manual ingestion.
-
Is there an API or bulk download option?
-
If yes → preferred over scraping.
-
Are documents PDFs, DOCX, HTML, or mixed?
-
PDFs → use
pdf_to_text.py. - DOCX → use
docx_to_text.py. - HTML → use
html_to_text.py. -
Mixed → use
fallback_handler.py. -
Do file names include year/country?
-
Yes → easier metadata extraction.
-
No → rely on text scanning.
-
Are there metadata pages (HTML tables, JSON endpoints)?
-
Yes → scrape for structured metadata.
- No → metadata must be inferred.
Output of Feasibility Review¶
scrapable: true/falseapi_available: true/falsefile_formats: listobstacles: notespriority: high/medium/low
Example¶
- Source: OHCHR Treaty Body Database
- scrapable: true
- api_available: false
- file_formats: [pdf, docx]
- obstacles: pagination, session cookies
- priority: high