Skip to content

Conversation

@MSAdministrator
Copy link
Member

@MSAdministrator MSAdministrator commented Dec 18, 2025

Describe the change

This PR adds a new ScanPptx scanner for extracting metadata and text content from Microsoft PowerPoint OOXML files (.pptx, .pptm, .potx, .ppsx).

Summary

  • Adds ScanPptx scanner that extracts document metadata (author, title, creation date, etc.), slide/word/image counts, hyperlinks, full text of notes content, and optionally full text content
  • Uses the python-pptx library (v1.0.2) for reliable OOXML parsing
  • Follows the same pattern as the existing ScanDocx scanner for consistency

New dependency

python-pptx==1.0.2

Motivation

PPTX files were previously only processed by ScanZip (extracting internal XML files) without extracting the actual presentation content. This scanner enables direct extraction of slide text, which is valuable for content analysis and detecting phishing lures embedded in PowerPoint presentations.

Files changed

File Description
src/python/strelka/scanners/scan_pptx.py New scanner
build/python/backend/requirements.txt Added python-pptx dependency
build/configs/scanners.yaml Scanner configuration
build/configs/taste.yara Added pptx_file YARA rule
src/python/strelka/tests/test_scan_pptx.py Unit tests
src/python/strelka/tests/test_scan_pptx_standalone.py Standalone pytest
src/python/strelka/tests/fixtures/test.pptx Test fixture

Describe testing procedures

Unit tests

python -m pytest src/python/strelka/tests/test_scan_pptx.py -v
python -m pytest src/python/strelka/tests/test_scan_pptx_standalone.py -v### Manual testing

Build and run Strelka

docker-compose up --build

Submit a PPTX file

strelka-fileshot -c fileshot.yaml /path/to/test.pptx## Sample output

{
  "file": {
    "flavors": {
      "mime": ["application/vnd.openxmlformats-officedocument.presentationml.presentation"],
      "yara": ["pptx_file", "zip_file"]
    },
    "scanners": ["ScanPptx", "ScanZip"]
  },
  "scan": {
    "pptx": {
      "elapsed": 0.045123,
      "flags": [],
      "author": "",
      "category": "",
      "comments": "generated using python-pptx",
      "content_status": "",
      "created": 1359299656,
      "identifier": "",
      "keywords": "",
      "language": "",
      "last_modified_by": "Test Author",
      "modified": 1359299758,
      "revision": 1,
      "subject": "",
      "title": "",
      "version": "",
      "slide_count": 4,
      "word_count": 307,
      "image_count": 1,
      "notes": [
            "Speaker notes for slide 1: Introduction to contract update.",
            "Speaker notes for slide 2: Summary of key changes.",
            "Speaker notes for slide 3: Required steps for completion.",
            "Speaker notes for slide 4: Contact information and support.",
       ],
      "hyperlinks": [
        "https://test.tracking-domain.example.com/click/https%3A%2F%2Fphishing.example.com%2Flogin/tracking-id-12345#6a6f686e2e646f65406578616d706c652e636f6d"
      ]
    }
  }
}

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of and tested my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

@socket-security
Copy link

socket-security bot commented Dec 18, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpython-pptx@​1.0.299100100100100

View full report

@MSAdministrator MSAdministrator self-assigned this Dec 18, 2025
@notion-workspace
Copy link

@MSAdministrator MSAdministrator changed the title Msadministrator.new.pptx scanner 2 Msadministrator.new.pptx scanner Dec 18, 2025
@MSAdministrator MSAdministrator changed the title Msadministrator.new.pptx scanner New ScanPptx scanner Dec 18, 2025
@MSAdministrator MSAdministrator merged commit d146b0a into main Jan 5, 2026
3 checks passed
@MSAdministrator MSAdministrator deleted the msadministrator.new.pptx_scanner_2 branch January 5, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants