Hello
I have this error with F39 and previous F3x: after having ‘printed’ a text from a website into a pdf file and after having opened that file for reading with GNOME’s pdf readers I recognized that ‘search’ for a text string does not find the text in the pdf precisely. E.g. I search for the german word ‘versöhnen’ (which is in the pdf several times) ii is not detected by the ‘search’ function. This does not happen with ‘external’ PDF Readers like ‘Master PDF Editor’ or Adobe Reader (on a MS Windows desktop). Sometimes I assume it might be caused by certain characters like ä,ö,ü. But anyway: this failing text recognition is related to Gnome’s Evince/Document Viewer only.
Any idea what might cause the missing/failing text recognition after having created a pdf file with ‘print to file’?
Printing to PDF results in a raster image encapsulated into a PDF file. This means there are only dots in the files and no letters.
If you however do a Save to Pdf, you will get a proper PDF file where the text is stored as text, and thus searchable.
You can also see that the raster image PDF file is much bigger than the proper PDF file.
Printing to PDF results in a raster image encapsulated into a PDF file. This means there are only dots in the files and no letters.
This does not explain why most of the search strings can be found and only some search strings are not found. And it also does not explain why some PDF viewers fail to recognize 100% of text and some do not fail.
Not always. Some systems create characters like “ä,ö,ü. etc.” by composing the base character with a decoration. There are tricks used to make it harder to clone a site or print content that may be in play here.
[I was once involved with a non-profit that had a web site. Someone registered a similar name that came up before the legit site in searches, cloned the site, and added copyright notices on every page, then asked for money. I would not be surprised if they also used the phony site to distribute malware.]
Are we talking about “Print to Pdf”, that is, the printing goes through the various cups conversion filters? Or are we talking about “Save to Pdf” meaning the data does NOT go through cups?
“Print to file” converts the rendered document to PDF, while “Save as PDF” converts HTML directly to PDF. The rendered document has “screen quality” elements and decisions like font substitutions are done in the browser, while with “Save as PDF” may make different substitutions depending on the PDF render. Images that are higher that screen resolution in the HTML source may be downsampled in “Print to file” but saved in the source resolution using “Save as PDF”. To see what you actually get, you can open the PDF in Inkscape. There are very useful command-line tools: pdfinfo, pdffonts, pdfimages from popplar-utils that will allow you to determine what major elements are in a PDF document.
There are many commercial HTML to PDF converters because the browser support fails for some documents. I think there is a war between document providers and people trying to reuse content. I suppose there are now efforts to make HTML that can’t be used to train AI’s that will also break browser conversions.
Thank you for the detailed explanation.
I often use these feature to save html text files (newspaper articles) into pdf, with no images. That’s why I recognized the difficulties of the text recognition within the created pdf. With your explanations I assume that I should prefer ‘save to PDF’ instead of ‘print to file’.
I tested once more ‘save to PDF’ with Firefox and Brave. Brave’s created PDF file is approx. twice the size of Firefox’s PDF file. But text recognition in Firefox’s PDF is much better than in Brave’s PDF.
Strange.