While parsing multiple PDF documents using Docling, I encountered a memory crash error:
std::bad_alloc
Stage preprocess failed
Document sample-tables.pdf failed to convert
I think this isnât Fedora-packaged software? From the screenshots, it looks like itâs running on MINGW in Windows.
However, this thread looks like useful advice for Outreachy folks who might be using the same tool. You could move it to the âProject Discussionâ section and âmentored-project-teamsâ tag.
Thank you for the clarification @pg-tips .Youâre right, Iâm currently testing this on Windows via MINGW64, so it isnât a native Fedora package it is a docling tool.
Iâd love to move this information to the Project Discussion section so other Outreachy applicants can learn from these OCR and memory errors. Could you please provide the direct links to that section and the mentored-project-teams tag so I can ensure Iâm posting in the correct place if I didnât post on either?
In that case, the Bugzilla link is not the right place. You will have to look up Docling support.
You should really install Fedora! Itâs a great system. Way better than Windoze
I have moved the post for you. However I think that putting this type of post on your personal blog would be better, as it is not a Fedora issue as such. It is of course an Outreach issue.
Next time, when you press ânew issueâ you will see different section. Find âProject Discussionâ and add the tags you want.
Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.
Docling is written in Python and supports a number of OCR libraries. Installing it in Fedora might require Years ago at work we had PDFâs created by scanning line-printer output along with the original âtextâ files from a CDC system. The text files contained garbled sections, and some PDF pages were missing data when a page didnât feed properly. I used the Tesseract Python library to reconstruct pages corresponding to the garbled text from the PDF files â fortunately the missfed pages rarely overlapped garbled sections of the text files, and we had some statistical summaries of the data that allowed us to determine which of the two reconstructed text files was correct. I was very impressed by Tesseractâs accuracy.
Looks like it would not be difficult to install Docling in Fedora, but Fedora already has other Tesseract front-ends.
Thanks for the documentation. At first, I wanted to work with Docling on Fedora. I installed WSL and Fedora, launched the Fedora CLI, and decided to install Docling. I went through the entire process of setting up a virtual environment, installing Docling, and verifying the installation.
After a while, I launched Fedora on WSL to use Docling again, but the tool couldnât be found. Since it was my first time, I wasnât sure what to do and concluded that I had to reinstall Docling every time I wanted to work in Fedora. Instead, I decided to switch to my VS Code terminal and work from there.