Skip to content

Writing Ocr Text And Re-Reading Position Differences #189

@ImMyQuest

Description

@ImMyQuest

Hi Guys,

I'm getting some differences between when I write out Ocr'd text to a pdf and when I read it back from the pdf, in terms of line positioning. This is causing me to display the text read with extra line feeds. I'm not sure whether it's how I'm writing the text. I'm using a TextWriter and the Append method like this - tw.Append(spn.Origin, spn.Text, font, spn.Size), spn being a Span and font created from New Font(spn.Font) at the start of looping.

Attached is the scanned pdf and a screenshot comparing (part of) the Ocr'd text to the re-read text. Also attached are 3 text files with the data after Ocr, before writing and after re-reading. Paragraph 3) splits after the hyphen on the first line and after "still a" on the second line. The data for the second line shows -

before writing, the Y1 values are consistent:
are Blk:7 Ln:5 Org:Point(293.52878, 345.53262) Siz:13.4508705 Y0:329.45575 Y1:345.5487
still Blk:7 Ln:6 Org:Point(315.02258, 345.53262) Siz:10.6641865 Y0:329.45575 Y1:345.5487
a Blk:7 Ln:7 Org:Point(346.25247, 345.53262) Siz:7.007743 Y0:329.45575 Y1:345.5487
problem, they Blk:7 Ln:8 Org:Point(347.7798, 345.53262) Siz:14.379586 Y0:329.45575 Y1:345.5487

after re-reading, the Y1 values vary:
are Bl:7 Ln:1 Y0:331.15366 Y1:349.47372
still Bl:7 Ln:1 Y0:334.1326 Y1:348.65723
a Bl:7 Ln:2 Y0:338.04135 Y1:347.5859
problem, Bl:7 Ln:2 Y0:330.16086 Y1:349.74585
they Bl:7 Ln:2 Y0:330.16086 Y1:349.74585

Currently, I'm testing whether the difference in Y1 values between words is greater than 2 and if so add a newline.

My question is: is the way I'm writing out the text causing the differences in Y1 values and what can I do about it? Is there a better/easier way to write out the text to get consistent results.

Thanks.

doc.pdf
afterOcr.txt
beforeWrite.txt
reReading.txt
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions