Papers

2025

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

2024

Docling Technical Report

PatCID: an open-access dataset of chemical structures in patent documents

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Identifying global biases in hydro-hazard research by mining the scientific literature

Wealth Over Woe: Global Biases in Hydro-Hazard Research

Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery

ESG Accountability Made Easy: DocQA at Your Service

2023

MolGrapher: Graph-based Visual Recognition of Chemical Structures

PatCID: Large-scale chemical-structure database from images in patent documents

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Optimized Table Tokenization for Table Structure Recognition

2022

TableFormer: Table Structure Understanding with Transformers

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

2016 - 2021

Robust PDF Document Conversion Using Recurrent Neural Networks

Corpus Processing Service: A Knowledge Graph Platform to perform deep data exploration on corpora

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale