ParseKit: Native Document Parsing and OCR for Ruby
ParseKit brings native document parsing capabilities to Ruby through high-performance Rust bindings. Extract text from PDFs, Office documents, images, and more - all without external dependencies or shelling out to Python.
What makes it special:
- Zero runtime dependencies (MuPDF and Tesseract statically linked)
- Native Rust performance via Magnus FFI
- Unified API for multiple formats (PDF, DOCX, XLSX, PPTX, images)
- OCR support for images (PNG, JPEG, TIFF, BMP)
- Comprehensive format detection and error handling
Key features:
- PDF text extraction with MuPDF (handles complex layouts, tables, forms)
- Office document parsing (Word, Excel, PowerPoint)
- Image OCR with Tesseract (multiple languages supported)
- Smart format detection from content or file extension
- Production-ready with comprehensive test suite (334 specs)
Perfect for content management systems, document indexing, data extraction pipelines, and any application that needs to process documents at scale - all within your Ruby process.
Recent improvements in v0.1.1:
- Streamlined error handling with better context
- Enhanced validation helpers
- Improved format detection reliability
- Cleaner API surface
Links:
This is part of the ruby-nlp ecosystem aimed at bringing best-in-class text processing tools to Ruby. Feedback welcome!
Post a comment