ParseKit: Native Document Parsing and OCR for Ruby

by Christopher Petersen — 6 September 2025

ParseKit brings native document parsing capabilities to Ruby through high-performance Rust bindings. Extract text from PDFs, Office documents, images, and more - all without external dependencies or shelling out to Python.

What makes it special:

Zero runtime dependencies (MuPDF and Tesseract statically linked)
Native Rust performance via Magnus FFI
Unified API for multiple formats (PDF, DOCX, XLSX, PPTX, images)
OCR support for images (PNG, JPEG, TIFF, BMP)
Comprehensive format detection and error handling

Key features:

PDF text extraction with MuPDF (handles complex layouts, tables, forms)
Office document parsing (Word, Excel, PowerPoint)
Image OCR with Tesseract (multiple languages supported)
Smart format detection from content or file extension
Production-ready with comprehensive test suite (334 specs)

Perfect for content management systems, document indexing, data extraction pipelines, and any application that needs to process documents at scale - all within your Ruby process.

Recent improvements in v0.1.1:

Streamlined error handling with better context
Enhanced validation helpers
Improved format detection reliability
Cleaner API surface

Links:

This is part of the ruby-nlp ecosystem aimed at bringing best-in-class text processing tools to Ruby. Feedback welcome!

RubyFlow The Ruby and Rails community linklog

The Ruby and Rails community linklog

ParseKit: Native Document Parsing and OCR for Ruby

Post a comment