Upgrading My PDF Converter to IBM's Docling

Sat, 02 May 2026 00:00:00 +0000

When My Own Tool Couldn’t Handle My Work

The error message was easy to dismiss: RapidOCR returned empty result!. It appeared twice in the terminal, then silence — a blank .md file where a 40-page Oracle HCM implementation guide should have been. The PDF had come straight from Oracle’s support portal, the same format I use for every triage session. But this one stored its pages as images, and PyMuPDF4LLM had nothing to work with.

That was one category of failure. The other was quieter. For documents that did convert, I started noticing the tables were wrong — not corrupted, just structurally dissolved. An eligibility matrix that should have had six clearly labeled columns came back as a run of loosely connected text. Useful for nothing.

I had built this tool to serve my Oracle work. Then my Oracle work showed me exactly where it fell short.

The Problem with PyMuPDF4LLM

If you’ve followed this series, you know that PyMuPDF4LLM was a solid choice when I first built the converter . It handled text-based PDFs cleanly, installed without friction, and required almost no configuration. For research papers and simple documentation, it worked well.

But Oracle HCM documentation is a different category of document. Oracle’s guides are dense with tables: configuration reference grids, eligibility matrices, step-and-action setup tables. These are not decorative — they carry most of the meaning. When PyMuPDF4LLM dissolved those tables into unstructured text, it was silently degrading the most important parts of the document.

The image-based PDF problem was a hard wall. If a document was captured as page images rather than extractable text, the converter returned nothing. No partial output, no warning — just empty files.

Discovering Docling

IBM Research Zurich’s AI for Knowledge team open-sourced Docling in July 2024. The project has a specific focus: turning complex documents into structured, AI-ready output. In April 2025, IBM donated it to the Linux Foundation AI & Data, and it now powers data ingestion for Red Hat Enterprise Linux AI. As of this writing it has over 24,000 GitHub stars.

What makes Docling different is that it treats document conversion as a computer vision problem, not just a text extraction problem.

Layout analysis: Docling uses an RT-DETR-derived model trained on DocLayNet — IBM’s human-annotated dataset of real-world documents — to detect and classify every region on the page: tables, figures, headers, footers, section titles, body text. It knows the structure before it extracts any content.

Table reconstruction: This is where Docling earns its place for Oracle documentation. It uses a vision transformer called TableFormer that predicts row/column structure and header roles directly from the page image. The result is a proper Markdown table, not a stream of cell values.

Image-based PDFs: For documents stored as page images, Docling integrates OCR into its pipeline natively. The same converter handles text-based and image-based PDFs without any changes on your end.

The Switch

The API change was minimal. The old code:

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(pdf_path)

The new code:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert(pdf_path)
md_text = result.document.export_to_markdown()

Three lines instead of one, but the extra structure pays dividends: DocumentConverter can be initialized once and reused across an entire batch, which matters when processing a folder of 50 Oracle guides.

A note on startup: The first time you run Docling, it downloads its ML models from Hugging Face. You will see this:

Loading weights: 100%|██████████| 770/770 [00:00<00:00, 1656.35it/s]

This is normal. The models cache locally after the first download and subsequent runs start immediately. If you see a warning about HF_TOKEN, that is also expected — Docling works without one, but setting a token removes the rate-limit warning:

echo 'export HF_TOKEN="hf_your_token_here"' >> ~/.zshrc

What Changed in Practice

Oracle documentation: Tables that previously collapsed into text now render as proper Markdown tables. A 6-column configuration reference comes back with headers intact and every row correctly aligned.

AI books: My knowledge base includes dense technical books on LLM engineering and machine learning. These have complex layouts — sidebars, multi-column sections, figures with captions. Docling’s layout model handles these significantly better than PyMuPDF4LLM’s heuristic approach.

Image-based PDFs: Documents that previously produced empty output now convert cleanly. The two-step workaround (ocrmypdf → pdf2md) is no longer necessary for most cases.

Two Other Improvements

While I was updating the engine, I added two things that were overdue:

DOCX support. The converter now handles Word documents using pandoc as a backend. The same pdf2md command works for both file types. This matters for Oracle support exports and study notes from my reMarkable.

Batch manifest. When processing a large folder, the converter now writes a manifest file tracking which files have been converted and their checksums. Re-running on the same folder skips files that haven’t changed. A --force flag overrides this when you need a fresh conversion.

pdf2md --batch ~/oracle-pdfs/                         # skips already-converted
pdf2md --batch ~/oracle-pdfs/ --force                 # reconverts everything

What’s Next

The web UI — which I added in the last post — has also been updated to use Docling. Drag a PDF onto it, click Convert, and the same deep-learning pipeline runs behind the scenes.

The next thing I want to add is direct output to the Obsidian inbox. Right now the flow is: convert → download ZIP → move to vault. A toggle that sends output directly to ~/projects/obsidian-vault/00-inbox/ would cut that manual step entirely.

The tool is doing what I originally wanted: converting my Oracle documentation and AI library into clean, searchable Markdown. Docling is what makes that reliable for the documents that actually matter.

Ibm on