Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the change
This PR adds a new
ScanPptxscanner for extracting metadata and text content from Microsoft PowerPoint OOXML files (.pptx, .pptm, .potx, .ppsx).Summary
ScanPptxscanner that extracts document metadata (author, title, creation date, etc.), slide/word/image counts, hyperlinks, full text of notes content, and optionally full text contentpython-pptxlibrary (v1.0.2) for reliable OOXML parsingScanDocxscanner for consistencyNew dependency
python-pptx==1.0.2Motivation
PPTX files were previously only processed by
ScanZip(extracting internal XML files) without extracting the actual presentation content. This scanner enables direct extraction of slide text, which is valuable for content analysis and detecting phishing lures embedded in PowerPoint presentations.Files changed
src/python/strelka/scanners/scan_pptx.pybuild/python/backend/requirements.txtbuild/configs/scanners.yamlbuild/configs/taste.yarapptx_fileYARA rulesrc/python/strelka/tests/test_scan_pptx.pysrc/python/strelka/tests/test_scan_pptx_standalone.pysrc/python/strelka/tests/fixtures/test.pptxDescribe testing procedures
Unit tests
python -m pytest src/python/strelka/tests/test_scan_pptx.py -v
python -m pytest src/python/strelka/tests/test_scan_pptx_standalone.py -v### Manual testing
Build and run Strelka
docker-compose up --build
Submit a PPTX file
strelka-fileshot -c fileshot.yaml /path/to/test.pptx## Sample output
{ "file": { "flavors": { "mime": ["application/vnd.openxmlformats-officedocument.presentationml.presentation"], "yara": ["pptx_file", "zip_file"] }, "scanners": ["ScanPptx", "ScanZip"] }, "scan": { "pptx": { "elapsed": 0.045123, "flags": [], "author": "", "category": "", "comments": "generated using python-pptx", "content_status": "", "created": 1359299656, "identifier": "", "keywords": "", "language": "", "last_modified_by": "Test Author", "modified": 1359299758, "revision": 1, "subject": "", "title": "", "version": "", "slide_count": 4, "word_count": 307, "image_count": 1, "notes": [ "Speaker notes for slide 1: Introduction to contract update.", "Speaker notes for slide 2: Summary of key changes.", "Speaker notes for slide 3: Required steps for completion.", "Speaker notes for slide 4: Contact information and support.", ], "hyperlinks": [ "https://test.tracking-domain.example.com/click/https%3A%2F%2Fphishing.example.com%2Flogin/tracking-id-12345#6a6f686e2e646f65406578616d706c652e636f6d" ] } } }Checklist