oxidize-pdf

Pure Rust PDF library built for AI/RAG. Structure-aware chunks with bounding boxes, heading context, and token estimates — in one line. No Python, no ML, no C bindings.

RAG pipeline in one line

use oxidize_pdf::parser::PdfDocument;

let chunks = PdfDocument::open("paper.pdf")?.rag_chunks()?;
// Each chunk: text, page numbers, bounding boxes, element types,
//             heading context, token estimate

Each RagChunk ships with the metadata your vector store actually needs: page numbers for citation, bounding boxes for visual grounding, element types (title, table, paragraph, ...) for filtering, and heading context prepended to full_text for stronger embeddings.

Choose your runtime

Runtime	Package	Install
Rust	crates.io/oxidize-pdf	`cargo add oxidize-pdf`
Python	pypi.org/oxidize-pdf	`pip install oxidize-pdf`
.NET	nuget.org/OxidizePdf.NET	`dotnet add package OxidizePdf.NET`

AI/RAG ecosystem integrations

Integration	What it gives you	Install
LlamaIndex reader	Drop-in `OxidizePdfReader` for LlamaIndex pipelines. Each chunk becomes a `Document` with `heading_context`, `page_numbers`, `element_types`, and `token_estimate` in `metadata`.	`pip install llama-index-readers-oxidize-pdf`
MCP server	12 tools and 6 resources for PDF read / create / extract / analyze / secure / manipulate over the Model Context Protocol. Ships as the `oxidize-mcp` entry point of the Python package.	`pip install oxidize-pdf` then point your MCP client at `oxidize-mcp`
Claude Code plugin	6 skills (`read-pdf`, `extract-text`, `create-pdf`, `analyze-pdf`, `secure-pdf`, `manipulate-pdf`) and 1 orchestrating agent.	`/plugin marketplace add bzsanti/oxidize-pdf-integrations` `/plugin install oxidize-pdf@oxidize-pdf`

Why this, not `pypdf + RecursiveCharacterTextSplitter`?

Capability	oxidize-pdf	pypdf + LangChain	unstructured / docling
Structure-aware chunks	yes	no (text-only split)	yes
Bounding boxes per chunk	yes	no	yes
Heading context propagation	yes	no	yes
ML model dependency	none	none	required
Native binary (no Python runtime)	yes	no	no
Language	Rust	Python	Python + ML

Beyond RAG

Parse

PDF 1.0–1.7, 99.3% success on 9,000+ real-world PDFs. CJK text, lenient recovery, decompression-bomb protection.

Generate

Multi-page documents with text, graphics, images. 3,000–4,000 pages/sec.

Encrypt

RC4-40/128, AES-128, AES-256 (R5/R6). Read and write.

Signatures

Detection, PKCS#7 verification, certificate validation.

PDF/A

Validation across 8 conformance levels (1a/b, 2a/b/u, 3a/b/u).

OCR & JBIG2

Pure-Rust JBIG2 decoder (ITU-T T.88). Tesseract integration optional.