Skip to content

Source Feasibility Checklist

This checklist helps determine whether a source can be scraped or must be ingested manually.


Questions to Ask

  1. Is there a public URL or portal for the documents?

  2. Yes → proceed to scrape.

  3. No → manual ingestion required.

  4. Does the site allow automated scraping?

  5. Check robots.txt and terms of service.

  6. If blocked, use manual ingestion.

  7. Is there an API or bulk download option?

  8. If yes → preferred over scraping.

  9. Are documents PDFs, DOCX, HTML, or mixed?

  10. PDFs → use pdf_to_text.py.

  11. DOCX → use docx_to_text.py.
  12. HTML → use html_to_text.py.
  13. Mixed → use fallback_handler.py.

  14. Do file names include year/country?

  15. Yes → easier metadata extraction.

  16. No → rely on text scanning.

  17. Are there metadata pages (HTML tables, JSON endpoints)?

  18. Yes → scrape for structured metadata.

  19. No → metadata must be inferred.

Output of Feasibility Review

  • scrapable: true/false
  • api_available: true/false
  • file_formats: list
  • obstacles: notes
  • priority: high/medium/low

Example

  • Source: OHCHR Treaty Body Database
  • scrapable: true
  • api_available: false
  • file_formats: [pdf, docx]
  • obstacles: pagination, session cookies
  • priority: high