RubyFlow The Ruby and Rails community linklog

×

The Ruby and Rails community linklog

Made a library? Written a blog post? Found a useful tutorial? Share it with the Ruby community here or just enjoy what everyone else has found!

ParseKit: Native Document Parsing and OCR for Ruby

ParseKit brings native document parsing capabilities to Ruby through high-performance Rust bindings. Extract text from PDFs, Office documents, images, and more - all without external dependencies or shelling out to Python.

What makes it special:

  • Zero runtime dependencies (MuPDF and Tesseract statically linked)
  • Native Rust performance via Magnus FFI
  • Unified API for multiple formats (PDF, DOCX, XLSX, PPTX, images)
  • OCR support for images (PNG, JPEG, TIFF, BMP)
  • Comprehensive format detection and error handling

Key features:

  • PDF text extraction with MuPDF (handles complex layouts, tables, forms)
  • Office document parsing (Word, Excel, PowerPoint)
  • Image OCR with Tesseract (multiple languages supported)
  • Smart format detection from content or file extension
  • Production-ready with comprehensive test suite (334 specs)

Perfect for content management systems, document indexing, data extraction pipelines, and any application that needs to process documents at scale - all within your Ruby process.

Recent improvements in v0.1.1:

  • Streamlined error handling with better context
  • Enhanced validation helpers
  • Improved format detection reliability
  • Cleaner API surface

Links:

This is part of the ruby-nlp ecosystem aimed at bringing best-in-class text processing tools to Ruby. Feedback welcome!

Post a comment

You can use basic HTML markup (e.g. <a>) or Markdown.

As you are not logged in, you will be
directed via GitHub to signup or sign in