๐ Project Domains:
- https://GRIMdata.org
- https://LittleRainbowRights.com
Processor Test Run Recipe¶
This guide shows how to test each processor (PDF, DOCX, HTML, fallback, tagger) independently before running the full pipeline.
1. Test PDF Processor¶
echo "This is a test PDF with child and AI mentioned." > sample.txt
pandoc sample.txt -o sample.pdf
python -c "from processors import pdf_to_text; pdf_to_text.convert('sample.pdf', 'out')"
Expected:
- File
out/sample.txtcreated. - Contains text from the PDF.
- Log entry in
logs/*pdf_to_text.log.
2. Test DOCX Processor¶
echo "This is a test DOCX mentioning youth and privacy." > sample_docx.txt
pandoc sample_docx.txt -o sample.docx
python -c "from processors import docx_to_text; docx_to_text.convert('sample.docx', 'out')"
Expected:
- File
out/sample.txtcreated. - Contains text from DOCX.
- Log entry in
logs/*docx_to_text.log.
3. Test HTML Processor¶
echo '<html><body><p>This HTML mentions LGBT and data protection.</p></body></html>' > sample.html
python -c "from processors import html_to_text; html_to_text.convert('sample.html', 'out')"
Expected:
- File
out/sample.txtcreated. - Contains text: โThis HTML mentions LGBT and data protection.โ
- Log entry in
logs/*html_to_text.log.
4. Test Fallback Handler¶
python -c "from processors import fallback_handler; print(fallback_handler.process_with_fallback('sample.html', 'out'))"
Expected:
- Prints path to extracted text file.
- Log entries show fallback attempts.
5. Run Tagger on Extracted Text¶
python -c "from processors import tagger; text=open('out/sample.txt').read(); print(tagger.apply_tags(text, 'configs/tags_v1.json'))"
Expected:
- Prints tags like
['ChildRights', 'AI']or['LGBTQ', 'Privacy']depending on sample text.
6. Run Tests¶
Expected:
test_tagger.py,test_logging.py,test_fallback_handler.py, andtest_metadata.pyall pass.