oxidize-pdf
Pure Rust PDF library built for AI/RAG. Structure-aware chunks with bounding boxes, heading context, and token estimates — in one line. No Python, no ML, no C bindings.
RAG pipeline in one line
use oxidize_pdf::parser::PdfDocument;
let chunks = PdfDocument::open("paper.pdf")?.rag_chunks()?;
// Each chunk: text, page numbers, bounding boxes, element types,
// heading context, token estimate
Each RagChunk ships with the metadata your vector store actually needs: page numbers for citation, bounding boxes for visual grounding, element types (title, table, paragraph, ...) for filtering, and heading context prepended to full_text for stronger embeddings.
Choose your runtime
| Runtime | Package | Install |
|---|---|---|
| Rust | crates.io/oxidize-pdf | cargo add oxidize-pdf |
| Python | pypi.org/oxidize-pdf | pip install oxidize-pdf |
| .NET | nuget.org/OxidizePdf.NET | dotnet add package OxidizePdf.NET |
AI/RAG ecosystem integrations
| Integration | What it gives you | Install |
|---|---|---|
| LlamaIndex reader | Drop-in OxidizePdfReader for LlamaIndex pipelines. Each chunk becomes a Document with heading_context, page_numbers, element_types, and token_estimate in metadata. |
pip install llama-index-readers-oxidize-pdf |
| MCP server | 12 tools and 6 resources for PDF read / create / extract / analyze / secure / manipulate over the Model Context Protocol. Ships as the oxidize-mcp entry point of the Python package. |
pip install oxidize-pdfthen point your MCP client at oxidize-mcp |
| Claude Code plugin | 6 skills (read-pdf, extract-text, create-pdf, analyze-pdf, secure-pdf, manipulate-pdf) and 1 orchestrating agent. |
/plugin marketplace add bzsanti/oxidize-pdf-integrations/plugin install oxidize-pdf@oxidize-pdf |
Why this, not pypdf + RecursiveCharacterTextSplitter?
| Capability | oxidize-pdf | pypdf + LangChain | unstructured / docling |
|---|---|---|---|
| Structure-aware chunks | yes | no (text-only split) | yes |
| Bounding boxes per chunk | yes | no | yes |
| Heading context propagation | yes | no | yes |
| ML model dependency | none | none | required |
| Native binary (no Python runtime) | yes | no | no |
| Language | Rust | Python | Python + ML |
Beyond RAG
Parse
PDF 1.0–1.7, 99.3% success on 9,000+ real-world PDFs. CJK text, lenient recovery, decompression-bomb protection.
Generate
Multi-page documents with text, graphics, images. 3,000–4,000 pages/sec.
Encrypt
RC4-40/128, AES-128, AES-256 (R5/R6). Read and write.
Signatures
Detection, PKCS#7 verification, certificate validation.
PDF/A
Validation across 8 conformance levels (1a/b, 2a/b/u, 3a/b/u).
OCR & JBIG2
Pure-Rust JBIG2 decoder (ITU-T T.88). Tesseract integration optional.
Links
- API documentation — docs.rs/oxidize-pdf
- Source code & issue tracker — github.com/bzsanti/oxidizePdf
- Python bindings — pypi.org/project/oxidize-pdf
- .NET bindings — nuget.org/OxidizePdf.NET
- Ecosystem integrations (MCP, LlamaIndex, Claude Code) — oxidize-pdf-integrations
- Changelog — CHANGELOG.md