Skip to content

Conversation

@jokerale
Copy link

Added pytesseract support in order to be able to scan flat pdfs (those that contains images as pages) and to retrieve the text inside it. Added also a little check that trigger the function using OCR when zero lines of text are found in a pdf.
Also added libraries used in the requirements files.

@Wazzabeee
Copy link
Owner

Buonasera 🇮🇹

Thanks for adding this! Could you rebase your PR with the latest commits of the repo? I added some checks on code quality and reformatting it should not cause conflicts with your code. Also to merge this PR it would be nice to add a pdf that contains only scanned text so that the example now supports and works with scanned text.

If you know how to It would be perfect if you could add one or more tests to test your changes.

I created this project a long time ago so I know the current code is not tested properly, but I will gradually take the time to add tests for all my functions.

Thanks in advance !

jokerale and others added 2 commits May 8, 2024 16:26
feat: add pre commit to repo

fix: remove init

fix: scripts structure

Bump black from 23.11.0 to 24.3.0

Bumps [black](https://github.com/psf/black) from 23.11.0 to 24.3.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](psf/black@23.11.0...24.3.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Bump nltk from 3.6.3 to 3.6.6

Bumps [nltk](https://github.com/nltk/nltk) from 3.6.3 to 3.6.6.
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](nltk/nltk@3.6.3...3.6.6)

---
updated-dependencies:
- dependency-name: nltk
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

fix: readme & saving path

feat: add setup changelog and version (Wazzabeee#8)

First release

fix: rename package for pypi (Wazzabeee#9)

rename package from plagiarism-checker to plagiarism-detector

fix: rename pypi package (Wazzabeee#10)

fix: rename files with copy-spotter name

feat: add tags and automatic versioning
@jokerale
Copy link
Author

jokerale commented May 8, 2024

Bonsoir 🇫🇷

I've tried to rebase the PR with the latest commits.
Please let me know if this is the right way.

I'll add some tests for the OCR function with the added pdf in future PR.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants