Skip to content

Conversation

@keaton-sublime
Copy link
Member

adding:

  • scan_pdf_obj_hash.py
  • requirements.txt to pull in the main branch for the most up-to-date pdf object hashing library
  • updating scanners.yaml to also run this scanner on PDFs

Describe the change
This adds in support for PDF object hashing, a method to create a structural hash for PDF objects (think imphash or ja3). This allows us to match complete matches with the object_hash value and partial matches by matching parts of the hash_string

Describe testing procedures
I build the scanner in our strelka docker instance on a VM, and used oneshot to send the file to the strelka scanner. After that was working I added the fields to the platform and tested there, testing against ~200 PDFs attached to emails.

Sample output

  "pdf_obj_hash": {
      "elapsed": 0.001209,
      "hash_string": "Filter|Filter|Filter|Page|ProcSet|N|None|StructTreeRoot|StructElem|StructElem|StructElem|StructElem|Pages|Catalog|None|Font/TrueType|FontDescriptor|Length1|Font/TrueType|Length|FontDescriptor|Length1|Title|",
      "object_hash": "e8e2199997a36a260df436937ba63f7f"
    },

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of and tested my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

adding:
- scan_pdf_obj_hash.py
- requirements.txt to pull in the main branch for the most up-to-date pdf object hashing library
- updating scanners.yaml to also run this scanner on PDFs
adding a time out
self.event["object_hash"] = obj_hash
self.event["hash_string"] = obj_hash_str
else:
self.event["object_hash"] = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is just considered 'pythonic', but switching up the data type here gives me pause. Up above, we set it to a string "error" when there's a problem, but here we just set it to false. I'm assuming we're doing this so that we can quickly see that there isn't an object hash or a hash string, but then we'd need to check the string in case it exists to see if its an error or not anyway.

Could we track the error and/or flag presence in a field other than object_hash/hash_string, and only populate those keys in the dict if we have a meaningful value to provide?

limit: 2000
pdf_to_png: True
no_object_extraction: True
'ScanPdfObjHash':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use our existing PDF scanner instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m open to migrating this into the existing scanner, but keeping it separate makes sense to me given that the object hashing has a much narrower scope than the overall PDF scanner. I also like the idea of not impacting existing PDF processes in the scanner (pdf->png) or having the hashing guardrails (timeouts, object thresholds) affect the main PDF scanner. That being said, if you have ideas on how to proceed, I’m all ears.

from hashlib import md5
from pdf_object_hashing import pdf_object as po

class ScanPdfObjHash(strelka.Scanner):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do something for every object? How long is it and is it expensive? We've observed in the past that naively getting every object is expensive and creates a bunch of noise. See #116 for more context, and we can also chat further internally if needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded internally

Copy link

@dlynch-sublime dlynch-sublime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good for the comments I raised; I'm assuming you and @rw-access are working out resolutions to his comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants