MinerU
Project Introduction
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. MinerU was born during the pre-training process of InternLM. We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models. Compared to well-known commercial products domestically and internationally, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on GitHub Issues and attach the relevant PDF.
Key Features
- Remove headers, footers, footnotes, page numbers and other elements to ensure semantic coherence
- Output text in human reading order, suitable for single-column, multi-column and complex layouts
- Retain the original document structure, including titles, paragraphs, lists, etc.
- Extract images, image descriptions, tables, table titles and footnotes
- Automatically identify and convert formulas in documents to LaTeX format
- Automatically identify and convert tables in documents to HTML format
- Automatically detect scanned PDFs and garbled PDFs, and enable OCR functionality
- OCR supports detection and recognition of 84 languages
- Support multiple output formats, such as multimodal and NLP Markdown, reading-order-sorted JSON, and information-rich intermediate formats
- Support multiple visualization results, including layout visualization, span visualization, etc., for efficient confirmation of output effects and quality inspection
- Support pure CPU environment operation, and support GPU(CUDA)/NPU(CANN)/MPS acceleration
- Compatible with Windows, Linux and Mac platforms