-
Notifications
You must be signed in to change notification settings - Fork 1
adding support for pdf-object-hashing #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
adding: - scan_pdf_obj_hash.py - requirements.txt to pull in the main branch for the most up-to-date pdf object hashing library - updating scanners.yaml to also run this scanner on PDFs
adding a time out
| self.event["object_hash"] = obj_hash | ||
| self.event["hash_string"] = obj_hash_str | ||
| else: | ||
| self.event["object_hash"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is just considered 'pythonic', but switching up the data type here gives me pause. Up above, we set it to a string "error" when there's a problem, but here we just set it to false. I'm assuming we're doing this so that we can quickly see that there isn't an object hash or a hash string, but then we'd need to check the string in case it exists to see if its an error or not anyway.
Could we track the error and/or flag presence in a field other than object_hash/hash_string, and only populate those keys in the dict if we have a meaningful value to provide?
| limit: 2000 | ||
| pdf_to_png: True | ||
| no_object_extraction: True | ||
| 'ScanPdfObjHash': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use our existing PDF scanner instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m open to migrating this into the existing scanner, but keeping it separate makes sense to me given that the object hashing has a much narrower scope than the overall PDF scanner. I also like the idea of not impacting existing PDF processes in the scanner (pdf->png) or having the hashing guardrails (timeouts, object thresholds) affect the main PDF scanner. That being said, if you have ideas on how to proceed, I’m all ears.
| from hashlib import md5 | ||
| from pdf_object_hashing import pdf_object as po | ||
|
|
||
| class ScanPdfObjHash(strelka.Scanner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do something for every object? How long is it and is it expensive? We've observed in the past that naively getting every object is expensive and creates a bunch of noise. See #116 for more context, and we can also chat further internally if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responded internally
dlynch-sublime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look good for the comments I raised; I'm assuming you and @rw-access are working out resolutions to his comments
adding:
Describe the change
This adds in support for PDF object hashing, a method to create a structural hash for PDF objects (think imphash or ja3). This allows us to match complete matches with the
object_hashvalue and partial matches by matching parts of thehash_stringDescribe testing procedures
I build the scanner in our strelka docker instance on a VM, and used oneshot to send the file to the strelka scanner. After that was working I added the fields to the platform and tested there, testing against ~200 PDFs attached to emails.
Sample output
Checklist