Skip to content

๐ŸŒ Project Domains:

  • https://GRIMdata.org
  • https://LittleRainbowRights.com

Processor Test Run Recipe

This guide shows how to test each processor (PDF, DOCX, HTML, fallback, tagger) independently before running the full pipeline.


1. Test PDF Processor

echo "This is a test PDF with child and AI mentioned." > sample.txt
pandoc sample.txt -o sample.pdf

python -c "from processors import pdf_to_text; pdf_to_text.convert('sample.pdf', 'out')"

Expected:

  • File out/sample.txt created.
  • Contains text from the PDF.
  • Log entry in logs/*pdf_to_text.log.

2. Test DOCX Processor

echo "This is a test DOCX mentioning youth and privacy." > sample_docx.txt
pandoc sample_docx.txt -o sample.docx

python -c "from processors import docx_to_text; docx_to_text.convert('sample.docx', 'out')"

Expected:

  • File out/sample.txt created.
  • Contains text from DOCX.
  • Log entry in logs/*docx_to_text.log.

3. Test HTML Processor

echo '<html><body><p>This HTML mentions LGBT and data protection.</p></body></html>' > sample.html

python -c "from processors import html_to_text; html_to_text.convert('sample.html', 'out')"

Expected:

  • File out/sample.txt created.
  • Contains text: โ€œThis HTML mentions LGBT and data protection.โ€
  • Log entry in logs/*html_to_text.log.

4. Test Fallback Handler

python -c "from processors import fallback_handler; print(fallback_handler.process_with_fallback('sample.html', 'out'))"

Expected:

  • Prints path to extracted text file.
  • Log entries show fallback attempts.

5. Run Tagger on Extracted Text

python -c "from processors import tagger; text=open('out/sample.txt').read(); print(tagger.apply_tags(text, 'configs/tags_v1.json'))"

Expected:

  • Prints tags like ['ChildRights', 'AI'] or ['LGBTQ', 'Privacy'] depending on sample text.

6. Run Tests

pytest tests/ -v

Expected:

  • test_tagger.py, test_logging.py, test_fallback_handler.py, and test_metadata.py all pass.