oxidize-pdf

Pure Rust PDF library built for AI/RAG. Structure-aware chunks with bounding boxes, heading context, and token estimates — in one line. No Python, no ML, no C bindings.

crates.io docs.rs PyPI NuGet GitHub

RAG pipeline in one line

use oxidize_pdf::parser::PdfDocument;

let chunks = PdfDocument::open("paper.pdf")?.rag_chunks()?;
// Each chunk: text, page numbers, bounding boxes, element types,
//             heading context, token estimate

Each RagChunk ships with the metadata your vector store actually needs: page numbers for citation, bounding boxes for visual grounding, element types (title, table, paragraph, ...) for filtering, and heading context prepended to full_text for stronger embeddings.

Choose your runtime

RuntimePackageInstall
Rust crates.io/oxidize-pdf cargo add oxidize-pdf
Python pypi.org/oxidize-pdf pip install oxidize-pdf
.NET nuget.org/OxidizePdf.NET dotnet add package OxidizePdf.NET

AI/RAG ecosystem integrations

IntegrationWhat it gives youInstall
LlamaIndex reader Drop-in OxidizePdfReader for LlamaIndex pipelines. Each chunk becomes a Document with heading_context, page_numbers, element_types, and token_estimate in metadata. pip install llama-index-readers-oxidize-pdf
MCP server 12 tools and 6 resources for PDF read / create / extract / analyze / secure / manipulate over the Model Context Protocol. Ships as the oxidize-mcp entry point of the Python package. pip install oxidize-pdf
then point your MCP client at oxidize-mcp
Claude Code plugin 6 skills (read-pdf, extract-text, create-pdf, analyze-pdf, secure-pdf, manipulate-pdf) and 1 orchestrating agent. /plugin marketplace add bzsanti/oxidize-pdf-integrations
/plugin install oxidize-pdf@oxidize-pdf

Why this, not pypdf + RecursiveCharacterTextSplitter?

Capability oxidize-pdf pypdf + LangChain unstructured / docling
Structure-aware chunksyesno (text-only split)yes
Bounding boxes per chunkyesnoyes
Heading context propagationyesnoyes
ML model dependencynonenonerequired
Native binary (no Python runtime)yesnono
LanguageRustPythonPython + ML

Beyond RAG

Parse

PDF 1.0–1.7, 99.3% success on 9,000+ real-world PDFs. CJK text, lenient recovery, decompression-bomb protection.

Generate

Multi-page documents with text, graphics, images. 3,000–4,000 pages/sec.

Encrypt

RC4-40/128, AES-128, AES-256 (R5/R6). Read and write.

Signatures

Detection, PKCS#7 verification, certificate validation.

PDF/A

Validation across 8 conformance levels (1a/b, 2a/b/u, 3a/b/u).

OCR & JBIG2

Pure-Rust JBIG2 decoder (ITU-T T.88). Tesseract integration optional.

Links