-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi Guys,
I'm getting some differences between when I write out Ocr'd text to a pdf and when I read it back from the pdf, in terms of line positioning. This is causing me to display the text read with extra line feeds. I'm not sure whether it's how I'm writing the text. I'm using a TextWriter and the Append method like this - tw.Append(spn.Origin, spn.Text, font, spn.Size), spn being a Span and font created from New Font(spn.Font) at the start of looping.
Attached is the scanned pdf and a screenshot comparing (part of) the Ocr'd text to the re-read text. Also attached are 3 text files with the data after Ocr, before writing and after re-reading. Paragraph 3) splits after the hyphen on the first line and after "still a" on the second line. The data for the second line shows -
before writing, the Y1 values are consistent:
are Blk:7 Ln:5 Org:Point(293.52878, 345.53262) Siz:13.4508705 Y0:329.45575 Y1:345.5487
still Blk:7 Ln:6 Org:Point(315.02258, 345.53262) Siz:10.6641865 Y0:329.45575 Y1:345.5487
a Blk:7 Ln:7 Org:Point(346.25247, 345.53262) Siz:7.007743 Y0:329.45575 Y1:345.5487
problem, they Blk:7 Ln:8 Org:Point(347.7798, 345.53262) Siz:14.379586 Y0:329.45575 Y1:345.5487
after re-reading, the Y1 values vary:
are Bl:7 Ln:1 Y0:331.15366 Y1:349.47372
still Bl:7 Ln:1 Y0:334.1326 Y1:348.65723
a Bl:7 Ln:2 Y0:338.04135 Y1:347.5859
problem, Bl:7 Ln:2 Y0:330.16086 Y1:349.74585
they Bl:7 Ln:2 Y0:330.16086 Y1:349.74585
Currently, I'm testing whether the difference in Y1 values between words is greater than 2 and if so add a newline.
My question is: is the way I'm writing out the text causing the differences in Y1 values and what can I do about it? Is there a better/easier way to write out the text to get consistent results.
Thanks.
