Failing text recognition after printing to file pdf

desperado · January 2, 2024, 1:42pm

Hello
I have this error with F39 and previous F3x: after having ‘printed’ a text from a website into a pdf file and after having opened that file for reading with GNOME’s pdf readers I recognized that ‘search’ for a text string does not find the text in the pdf precisely. E.g. I search for the german word ‘versöhnen’ (which is in the pdf several times) ii is not detected by the ‘search’ function. This does not happen with ‘external’ PDF Readers like ‘Master PDF Editor’ or Adobe Reader (on a MS Windows desktop). Sometimes I assume it might be caused by certain characters like ä,ö,ü. But anyway: this failing text recognition is related to Gnome’s Evince/Document Viewer only.
Any idea what might cause the missing/failing text recognition after having created a pdf file with ‘print to file’?

Thank you n advance

vekruse · January 2, 2024, 4:12pm

Printing to PDF results in a raster image encapsulated into a PDF file. This means there are only dots in the files and no letters.
If you however do a Save to Pdf, you will get a proper PDF file where the text is stored as text, and thus searchable.

You can also see that the raster image PDF file is much bigger than the proper PDF file.

desperado · January 3, 2024, 1:41pm

Printing to PDF results in a raster image encapsulated into a PDF file. This means there are only dots in the files and no letters.

This does not explain why most of the search strings can be found and only some search strings are not found. And it also does not explain why some PDF viewers fail to recognize 100% of text and some do not fail.

vekruse · January 3, 2024, 1:57pm

It would if some pdf viewer has OCR capability turning a raster images picture into text.

desperado · January 3, 2024, 2:28pm

I just made another test with the same text, same pdf viewer (Evince/Document Viewer ) but different browser:

In case I use Brave and print the text into a pdf file is results in a pdf with a size of 75 kB. Text recognition is failing sometimes.
In case I use Firefox and print the text into a pdf file is results in a pdf with a size of 46 kB. Text recognition is working with no errors.

Possible conclusion: Brave’s ‘Print to file/Save as PDF’ is the cause for failing text recognition

gnwiii · January 3, 2024, 2:33pm

Not always. Some systems create characters like “ä,ö,ü. etc.” by composing the base character with a decoration. There are tricks used to make it harder to clone a site or print content that may be in play here.

[I was once involved with a non-profit that had a web site. Someone registered a similar name that came up before the legit site in searches, cloned the site, and added copyright notices on every page, then asked for money. I would not be surprised if they also used the phony site to distribute malware.]

vekruse · January 3, 2024, 4:23pm

Are we talking about “Print to Pdf”, that is, the printing goes through the various cups conversion filters? Or are we talking about “Save to Pdf” meaning the data does NOT go through cups?

gnwiii · January 3, 2024, 8:16pm

Both. Strange ways of presenting text can (e.g., are designed to) prevent use of web content without visiting the original site.

desperado · January 4, 2024, 9:53am

After your discussion I recognized that printing a webpage using the browser there is usually 2 options:
‘Save as PDF’:

or extent ‘Save to PDF’ to ‘Print using system dialog’ and ‘Print to File’:

‘Save as PDF’ results in a file with a size that is twice as large than ‘Print to File’. What is the difference between the two options?

gnwiii · January 4, 2024, 11:31am

“Print to file” converts the rendered document to PDF, while “Save as PDF” converts HTML directly to PDF. The rendered document has “screen quality” elements and decisions like font substitutions are done in the browser, while with “Save as PDF” may make different substitutions depending on the PDF render. Images that are higher that screen resolution in the HTML source may be downsampled in “Print to file” but saved in the source resolution using “Save as PDF”. To see what you actually get, you can open the PDF in Inkscape. There are very useful command-line tools: pdfinfo, pdffonts, pdfimages from popplar-utils that will allow you to determine what major elements are in a PDF document.

There are many commercial HTML to PDF converters because the browser support fails for some documents. I think there is a war between document providers and people trying to reuse content. I suppose there are now efforts to make HTML that can’t be used to train AI’s that will also break browser conversions.

desperado · January 8, 2024, 9:04pm

Thank you for the detailed explanation.
I often use these feature to save html text files (newspaper articles) into pdf, with no images. That’s why I recognized the difficulties of the text recognition within the created pdf. With your explanations I assume that I should prefer ‘save to PDF’ instead of ‘print to file’.

gnwiii · January 9, 2024, 12:00am

Yes, but still no guarantee that the PDF will have useful text content.

vekruse · January 9, 2024, 5:01am

For some explanation see for example https://en.wikipedia.org/wiki/Unicode_equivalence.

desperado · January 9, 2024, 10:11am

I tested once more ‘save to PDF’ with Firefox and Brave. Brave’s created PDF file is approx. twice the size of Firefox’s PDF file. But text recognition in Firefox’s PDF is much better than in Brave’s PDF.
Strange.

Topic		Replies	Views
Printing and Print Preview Errors from Some PDFs (some versions of pdftk, cairo, & xdvipdfmx) Ask Fedora printing , gnome , f41	13	162	December 9, 2024
I have a PDF that doesn't print - possibly cups (ghostscript?) problem - how to debug? Ask Fedora printing , f39	5	560	December 6, 2024
For some reason , default GNOME pdf reader cannot display certain mathematical notations which Brave(Chrome too) can Ask Fedora	2	222	April 9, 2022
Printing PDF documents using Xreader Ask Fedora f37	4	472	March 27, 2023
Missing characters in PDFs since upgrade from F35 to F36 Ask Fedora f36 , gnome-software , gnome	12	1216	May 20, 2022

Failing text recognition after printing to file pdf

Related topics