Papers
2025
Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
2024
Docling Technical Report
PatCID: an open-access dataset of chemical structures in patent documents
Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Identifying global biases in hydro-hazard research by mining the scientific literature
Wealth Over Woe: Global Biases in Hydro-Hazard Research
Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery
ESG Accountability Made Easy: DocQA at Your Service
2023
MolGrapher: Graph-based Visual Recognition of Chemical Structures
PatCID: Large-scale chemical-structure database from images in patent documents
ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents
Optimized Table Tokenization for Table Structure Recognition
2022
TableFormer: Table Structure Understanding with Transformers
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
2016 - 2021
Robust PDF Document Conversion Using Recurrent Neural Networks
Corpus Processing Service: A Knowledge Graph Platform to perform deep data exploration on corpora
Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale