Docling memory crash (std::bad_alloc) while parsing multiple PDFs Fix

While parsing multiple PDF documents using Docling, I encountered a memory crash error:
std::bad_alloc
Stage preprocess failed
Document sample-tables.pdf failed to convert

It happened when running:

       docling --to text data/*.pdf --output outputs

The issue was caused by one complex PDF (sample-tables.pdf) containing heavy tables, causing Docling to run out of memory during batch processing.

I fixed it by:
Processing the problematic file separately.

    docling --to text data/sample-tables.pdf --output outputs

If you encounter this error while batch processing:

  • Process PDFs individually to identify problematic files
  • Large or complex PDFs (especially with tables) may require separate processing
1 Like

You could report this bug to the Fedora maintainer who maybe able to get a fix from upstream - Making sure you're not a bot!

Thanks Barry A Scott. I would appreciate it if you tag either a person or a team I can reference to incase I bump into errors.

Thanks once again

You raise bugs against the package, then bugzilla will tell the maintainers.

1 Like
1 Like

I think this isn’t Fedora-packaged software? From the screenshots, it looks like it’s running on MINGW in Windows.

However, this thread looks like useful advice for Outreachy folks who might be using the same tool. You could move it to the “Project Discussion” section and “mentored-project-teams” tag.

1 Like

Thank you for the clarification @pg-tips .You’re right, I’m currently testing this on Windows via MINGW64, so it isn’t a native Fedora package it is a docling tool.

I’d love to move this information to the Project Discussion section so other Outreachy applicants can learn from these OCR and memory errors. Could you please provide the direct links to that section and the mentored-project-teams tag so I can ensure I’m posting in the correct place if I didn’t post on either?

In that case, the Bugzilla link is not the right place. You will have to look up Docling support.

You should really install Fedora! It’s a great system. Way better than Windoze :slight_smile:

I have moved the post for you. However I think that putting this type of post on your personal blog would be better, as it is not a Fedora issue as such. It is of course an Outreach issue.

Next time, when you press ‘new issue’ you will see different section. Find ‘Project Discussion’ and add the tags you want.

Thanks alot @MatH for the great insights.

From https://docling-project.github.io/docling/getting_started/installation/,

Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.

Docling is written in Python and supports a number of OCR libraries. Installing it in Fedora might require Years ago at work we had PDF’s created by scanning line-printer output along with the original “text” files from a CDC system. The text files contained garbled sections, and some PDF pages were missing data when a page didn’t feed properly. I used the Tesseract Python library to reconstruct pages corresponding to the garbled text from the PDF files — fortunately the missfed pages rarely overlapped garbled sections of the text files, and we had some statistical summaries of the data that allowed us to determine which of the two reconstructed text files was correct. I was very impressed by Tesseract’s accuracy.

Looks like it would not be difficult to install Docling in Fedora, but Fedora already has other Tesseract front-ends.

1 Like

Thanks for the documentation. At first, I wanted to work with Docling on Fedora. I installed WSL and Fedora, launched the Fedora CLI, and decided to install Docling. I went through the entire process of setting up a virtual environment, installing Docling, and verifying the installation.

After a while, I launched Fedora on WSL to use Docling again, but the tool couldn’t be found. Since it was my first time, I wasn’t sure what to do and concluded that I had to reinstall Docling every time I wanted to work in Fedora. Instead, I decided to switch to my VS Code terminal and work from there.

It is not required to reinstall all the time. I use venv for tools or Fedora and i only have to install once.

If you start a new topic we can help you debug and understand what is going on.

1 Like

Thanks Barry, I will get in touch

This is really nice :raising_hands:@farhana

1 Like