Docling

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding.

Subgrapher: visual fingerprinting of chemical structures

Journal of Cheminformatics arXiv scholar cite view →

Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code are publicly available.

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Proceedings of the IEEE/CVF International Conference on Computer Vision … arXiv scholar cite view →

We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.

Markushgrapher: Joint visual and textual recognition of markush structures

CVPR 2025 … arXiv scholar cite view →

The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

arXiv arXiv scholar cite view →

Granite Vision is a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. The model is trained on a comprehensive instruction-following dataset including document-related tasks such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite language model. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding and the LiveXiv benchmark.

Foundation models for materials discovery–current state and future directions

npj Computational Materials scholar cite view →

Reviews the wider field of foundation models—of which large language models are a component—and their application to materials discovery. Explores applications to property prediction, synthesis planning and molecular generation, and examines how new methods of data capture and modalities of data will influence the direction of this emerging field. Discusses the role of various types of foundation models including multimodal approaches that combine text, images, and structure information. Provides perspective on current capabilities and future directions for using foundation models to accelerate discovery in materials science. Covers techniques for transfer learning, fine-tuning, and domain-specific adaptation of foundation models to materials science challenges.

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

AAAI arXiv scholar cite view →

Docling is an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware. Docling is released as a Python package and can be used as a Python API or CLI tool. Its modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. The toolkit has been integrated in popular frameworks like LangChain, LlamaIndex, and spaCy.

ChemQuery: A Natural Language Query‐Driven Service for Comprehensive Exploration of Chemistry Patent Literature

Applied AI Letters scholar cite view →

ChemQuery is a tool for easily exploring chemistry-related patents using natural language questions. Uses up-to-date information to return specific answers with their sources. Offers comprehensive search experience through capabilities like extracting molecules from diagrams, integrating information from PubChem, and allowing complex queries about molecular structures. The Parser module learns to write database queries from natural language questions rather than directly generating answers like LLMs do. This approach is especially important for patent exploration where literature evolves rapidly and users expect access to latest knowledge. The query-based approach avoids hallucination issues in LLMs and ensures answers are rooted in actual database records. Supports complex questions about molecular structures and provides direct answers traceablethrough source patents.

Automated disaster event extraction to understand lessons learned: A large-scale text analysis on the scientific literature of floods, droughts, and landslides.

EGU General Assembly 2025 scholar cite view →

Large-scale text analysis on the scientific literature of floods, droughts, and landslides to automatically extract disaster event information and understand lessons learned. Uses natural language processing techniques to identify and extract mentions of specific disaster events, their characteristics, impacts, and lessons from scientific literature. Enables systematic analysis of what the research community has learned from past disasters to inform future mitigation and adaptation strategies. Complements the earlier work on research biases by extracting actionable knowledge from the literature about specific disasters and their management. Applies automated event extraction to understand patterns in how different types of hydro-hazards have been studied and what knowledge has been accumulated.

Advanced Layout Analysis Models for Docling

arXiv arXiv scholar cite view →

This technical report documents the development of novel Layout Analysis models integrated into the Docling document-conversion pipeline. We trained several state-of-the-art object detectors based on the RT-DETR, RT-DETRv2 and DFINE architectures on a heterogeneous corpus of 150,000 documents (both openly available and proprietary). Post-processing steps were applied to the raw detections to make them more applicable to the document conversion task. We evaluated the effectiveness of the layout analysis on various document benchmarks using different methodologies while also measuring the runtime performance across different environments (CPU, Nvidia and Apple GPUs). We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline, with comparable or better runtime. Our best model, \"heron-101\", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU. Extensive quantitative and qualitative experiments establish best practices for training, evaluating, and deploying document-layout detectors, providing actionable guidance for the document conversion community. All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

COLING 2025 Industry Track arXiv scholar cite view →

Retrieval Augmented Generation (RAG) systems are widespread applications of Large Language Models in industry. While many tools exist to build custom systems, measuring performance locally with datasets reflective of system use cases is challenging. Shows that using public Q&A datasets to assess retrieval performance can lead to non-optimal systems design, and common tools for dataset generation can lead to unbalanced data. Proposes solutions based on characterizing RAG datasets through labels and label-targeted data generation. Demonstrates that fine-tuned small LLMs can efficiently generate Q&A datasets. The paper's taxonomy comprises four classes: fact_single, summary, reasoning, and unanswerable. Addresses critical gap in RAG evaluation methodology for real-world deployment.

Docling Technical Report

arXiv arXiv scholar cite view →

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

ACL arXiv scholar cite view →

Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. This paper proposes Statements, a novel domain agnostic data structure for extracting quantitative facts and related information from ESG reports. The paper proposes translating tables to statements as a new supervised deep-learning universal information extraction task and introduces SemTabNet, a dataset of over 100K annotated tables. The best T5-based Statement Extraction Model generates statements which are 82% similar to the ground-truth.

Wealth Over Woe: Global Biases in Hydro-Hazard Research

Earth's Future scholar cite view →

Floods, droughts, and rainfall-induced landslides are hydro-hazards affecting millions yearly. Anticipation, mitigation, and adaptation is increasingly outpaced by changing magnitude and frequency due to climate change. Uses natural language processing with a new climate hazard taxonomy to review, identify, and geolocate 100 million abstracts dealing with hydro-hazards. The spatial distribution of study areas is mostly defined by human activity, national wealth, data availability, and population distribution. Key finding: 'Wealth over Woe' bias—100 times more people need to be affected by hazards before low-income countries reach comparable research activity to high-income countries. Recommends enabling and targeting research on hydro-hazards in highly impacted and under-researched regions with urgent need to reduce knowledge base biases to mitigate and adapt to changing hydro-hazards for a sustainable and equitable future.

SciScribe: Automating and contextualizing literature reviews in cardiac surgery

The Journal of Thoracic and Cardiovascular Surgery scholar cite view →

SciScribe is a work product developed in collaboration between Cleveland Clinic and IBM, wherein IBM's Deep Search platform has been augmented to accelerate literature reviews in cardiac surgery. Automates and accelerates the literature review process, mitigates errors associated with repetition and fatigue, and contextualizes results by linking relevant external data sources. Built as an extension of IBM's Deep Search platform, it ingests full-content publications from PubMed Central and structured records from ClinicalTrials and OpenPayments databases. Supports traditional keyword-based search as well as natural language question answering via large language models. Key features: accumulating personal collections from publications, incorporating contextual information from external databases, semantic questioning and answering of documents, and collating results into tables for informed literature assessment.

Preparing a database for a domain specific application using a centralized data repository

US Patent 11,940,962 scholar cite view →

Patent describing methods for preparing and configuring databases for domain-specific applications using a centralized data repository. Enables efficient setup and customization of databases with domain knowledge. Facilitates rapid development of specialized database applications.

PatCID: an open-access dataset of chemical structures in patent documents

Nature Communications scholar cite view →

The automatic analysis of patent publications has potential to accelerate research across drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows access to such information at scale, enabling users to search which molecules are displayed in which documents. Contains 81M chemical-structure images and 14M unique chemical structures sourced from major offices (US, Europe, Japan, Korea, China) since 1978. Achieves 56.0% retrieval rate, higher than automatically-created databases Google Patents (41.5%) and SureChEMBL (23.5%), and manually-created databases Reaxys (53.5%) and SciFinder (49.5%). Enables promising applications for automatic literature review and learning-based molecular generation methods. Dataset is freely accessible.

Method of determining a table structure

US Patent App. 18/316,629 scholar cite view →

Patent application describing methods for automatically determining and analyzing table structures from documents or images. Identifies columns, rows, headers, and relationships in tabular data. Enables extraction and processing of structured data from various document formats.

KVP10k: a comprehensive dataset for key-value pair extraction in business documents

ICDAR 2024 arXiv scholar cite view →

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.

Indus: Effective and efficient language models for scientific applications

Proceedings of the Conference on Empirical Methods in Natural Language … arXiv scholar cite view →

INDUS is a comprehensive suite of large language models tailored for Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, trained using curated scientific corpora. The suite includes: an encoder model for NLP tasks, a contrastive-learning based text embedding model for information retrieval, and smaller knowledge-distilled versions for latency-constrained applications. Three new scientific benchmark datasets are created (CLIMATE-CHANGE NER, NASA-QA, NASA-IR). INDUS outperforms general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on both new and existing tasks in these domains.

Image table generation

US Patent App. 18/339,263 scholar cite view →

Patent application describing methods for automatic generation of tables from image content. Extracts tabular data from visual sources and converts to structured table format. Applicable to document digitization and information extraction from images.

Identifying global biases in hydro-hazard research by mining the scientific literature

EGU General Assembly 2024 scholar cite view →

Uses natural language processing based on a new climate hazard taxonomy to review, identify, and geolocate 100 million abstracts those dealing with hydro-hazards (floods, droughts, rainfall-induced landslides). Maps the global distribution of almost 300,000 abstracts from published flood, drought, and landslide research studies. Finds that the spatial distribution of study areas is mostly defined by human activity, national wealth, data availability, and population distribution. Identifies the 'Wealth over Woe' bias: 100 times more people need to be affected by hazards before low-income countries reach comparable research activity to high-income countries. Recommends high-priority regions for future research and funding to address these global biases and enable more equitable disaster risk reduction.

Dry photoresist or hardmask for euv lithography

US Patent App. 18/319,757 scholar cite view →

Patent application describing compositions and methods for dry photoresist or hardmask materials for extreme ultraviolet (EUV) lithography. Related to advanced semiconductor manufacturing and materials science. Addresses challenges in EUV lithography processing.

ESG Accountability Made Easy: DocQA at Your Service

AAAI arXiv scholar cite view →

Deep Search DocQA is presented as an application that enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations.

MolGrapher: Graph-based Visual Recognition of Chemical Structures

ICCV arXiv scholar cite view →

The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. MolGrapher is introduced to recognize chemical structures visually using a deep keypoint detector to detect atoms, treating all candidate atoms and bonds as nodes in a graph, and classifying them with a Graph Neural Network. A synthetic data generation pipeline is proposed to address the lack of real training data, and a large-scale benchmark of annotated real molecule images (USPTO-30K) is introduced. Extensive experiments show significant improvements over classical and learning-based methods.

Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery

arXiv arXiv scholar cite view →

Recent research on predicting the binding affinity between drug molecules and proteins use representations learned through unsupervised learning techniques. This study demonstrates that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, state-of-the-art results can be achieved for drug-target binding affinity prediction. A set of multimodal knowledge graphs is released, integrating data from seven public data sources and containing over 30 million triples, along with pretrained models and source code for standard benchmark tasks.

Optimized Table Tokenization for Table Structure Recognition

ICDAR arXiv scholar cite view →

Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. This paper investigates how table-structure representation can be optimised and proposes a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct.

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

ICDAR arXiv scholar cite view →

Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. This paper presents the results of the ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. The competition recorded 45 team registrations and received official submissions from 21 teams. A clear trend towards adoption of vision-transformer based methods is evident, with the results demonstrating substantial progress towards achieving robust and highly generalizing methods for document layout understanding.

PatCID: Large-scale chemical-structure database from images in patent documents

ACS Fall 2023 scholar cite view →

PatCID (Patent-extracted Chemical-structure Images database for Discovery) is the Patent-extracted Chemical-structure Images database for Discovery, which allows users to find patents mentioning a given molecule and, conversely, all molecules covered by specific patents. Contains 81M chemical-structure images and 14M unique chemical structures sourced from documents from major offices (United States, Europe, Japan, Korea, and China) since 1978. Creation relies on three key steps: document page segmentation, image classification to identify molecular-structure images, and molecular recognition to obtain final chemical structures. A graph-based visual recognition model was developed comprising a deep keypoint detector and graph neural network. Performance on a random set shows PatCID retrieves 56.0% of molecules, higher than automatically-created databases like Google Patents and SureChEMBL, and manually-created databases. The dataset is freely accessible for download.

Information extraction from document corpora

US Patent App. 17/508,117 scholar cite view →

Patent application describing methods and systems for large-scale information extraction from document corpora. Enables automated discovery and extraction of relevant information from collections of documents. Applicable to knowledge discovery, research mining, and enterprise information extraction.

FlowchartQA: The first large-scale benchmark for reasoning over flowcharts

LIMO 2023 … scholar cite view →

FlowchartQA is a new and unique large-scale benchmark for visual question answering over flowcharts, comprising close to 1M flowchart images and 6M question-answer pairs covering various aspects of geometric and topological information in the charts. Questions have been carefully balanced to minimize biases. A baseline model and comprehensive ablation studies provide a foundation for future work. Experimental results reveal the potential of FlowchartQA as a testbed for flowchart understanding, which has been previously absent from the community. The paper addresses the intersection of visual perception, logical reasoning, and language understanding in the context of technical diagrams.

Efficient ground truth annotation

US Patent 11,556,852 scholar cite view →

Patent describing methods for efficient generation and management of ground truth annotations for machine learning datasets. Enables faster and more cost-effective creation of annotated training data. Applicable to computer vision, NLP, and document processing applications.

ClimateHub: Deep Search for Climate, Earth and Environmental Sciences

AGU Fall Meeting 2023 scholar cite view →

Extracts knowledge from scientific climate literature to create a natural language data mining resource for the climate and Earth sciences community. Leverages IBM Deep Search's corpus extraction service to continuously retrieve documents from public data sources and convert them into machine-readable outputs. Conducts data mining on over 200 million semantic scholar abstracts and 2 million arXiv publications using named entity and relationship recognition. Utilizes a climate ontology to develop a climate knowledge graph representation. Offers climate queries for sophisticated searches over geographic entities and hazard traversals, enabling climate scientists to access structured knowledge from the literature.

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

IEEE CLOUD arXiv scholar cite view →

Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge due to their huge variability in formats and complex structure. This paper focuses on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with strong reliance on machine-learning methods on cloud infrastructure. The key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. The best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.

Knowledge Graphs in Deep Search Climate-Hub

Fall Meeting scholar cite view →

Describes the use of knowledge graphs in the Deep Search Climate-Hub platform. Leverages structured knowledge representations to improve search and discovery in climate and environmental science literature. Demonstrates practical application of knowledge graph technology in domain-specific information retrieval.

Feta: Towards specializing foundational models for expert task applications

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) arXiv scholar cite view →

Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training. This underlines the necessity to explicitly evaluate and finetune FMs on such expert tasks, arguably ones that appear the most in practical real-world applications. In this paper, we propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation, via learning to match their graphical illustrations to corresponding language descriptions. Our FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. FETA is equipped with a procedure for completely automatic annotation extraction (code would be released upon acceptance), allowing easy extension of FETA to more documentation types and application domains in the future. Our automatic annotation leads to an automated performance metric shown to be consistent with metrics computed on human-curated annotations (also released). We provide multiple baselines and analysis of popular FMs on FETA leading to several interesting findings that we believe would be very valuable to the FM community, paving the way towards real-world application of FMs for practical expert tasks currently 'overlooked' by standard benchmarks focusing on common objects.

Doclaynet: A large human-annotated dataset for document-layout segmentation

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and … arXiv scholar cite view →

Accurate document layout analysis is a key requirement for high-quality PDF document conversion. This paper presents DocLayNet, a new publicly available document-layout annotation dataset in COCO format containing 80863 manually annotated pages from diverse data sources. For each PDF page, layout annotations provide labelled bounding-boxes with 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine inter-annotator agreement. The paper demonstrates that DocLayNet-trained models are more robust and the preferred choice for general-purpose document-layout analysis compared to models trained on PubLayNet and DocBank.

Accelerating materials discovery using artificial intelligence, high performance computing and robotics

npj Computational Materials 8 (1), 84 scholar cite view →

In materials discovery, traditional manual, serial, and human-intensive work is being augmented by automated, parallel, and iterative processes driven by Artificial Intelligence, simulation, and experimental automation. This paper describes how these new capabilities enable the acceleration and enrichment of each stage of the discovery cycle. Using the example of developing a novel chemically amplified photoresist, the authors demonstrate how these technologies' impacts are amplified when used in concert with each other as powerful, heterogeneous workflows integrating AI-driven optimization, high-performance computing simulations, and experimental robotics.

Robust PDF Document Conversion Using Recurrent Neural Networks

IAAI arXiv scholar cite view →

The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19.

Using knowledge graphs to navigate through geological concepts extracted from documents

Offshore Mediterranean Conference and Exhibition, OMC--019 scholar cite view →

Presents methods for extracting geological concepts from technical documents and organizing them in knowledge graphs for navigation and discovery. Enables petroleum system analysis and basin studies by automatically identifying and relating geological entities extracted from diverse document sources. Facilitates exploration of geological knowledge contained in technical literature.

Racial Representation Analysis in Dermatology Academic Materials

American Medical Informatics Association (AMIA) Annual Symposium 2021 scholar cite view →

Analyzes racial representation in dermatology academic materials. Studies have shown that images of skin of color are underrepresented in medical textbooks, academic curricula, and educational resources. This work quantitatively examines the prevalence of darker skin tones in dermatology educational materials and curricula. Findings indicate that representation of diverse skin tones remains critically low, with fewer textbooks having increased representation of dermatological diseases in darker skin tones over time. The research highlights the need for improved diversity and inclusion in medical education.

Corpus Processing Service: A Knowledge Graph Platform to perform deep data exploration on corpora

Applied AI Letters scholar cite view →

Knowledge Graphs have emerged as the standard for modeling and exploring knowledge in weakly structured data. The Corpus Processing Service (CPS) is a scalable cloud platform designed to process large document corpora, extract content and embedded facts, and represent these in a consistent knowledge graph that can be intuitively queried. The platform uses state-of-the-art natural language understanding models to extract entities and relationships from documents, complemented with a newly developed graph engine ensuring performant graph queries and powerful analytics. Designed as a modular microservice system on Kubernetes clusters, CPS is validated in real-world applications in the oil and gas industry.

Application of Geocognitive Technologies to Basin & Petroleum System Analyses

Abu Dhabi International Petroleum Exhibition and Conference scholar cite view →

Eni and IBM developed a cognitive engine exploiting a deep-learning approach to scan documents searching for basin geology concepts, extracting information about petroleum system elements such as formation name, geological age, and lithology of source rocks, reservoirs, and seals. Enables basin geologists to perform automated queries to collect all information related to a basin of interest. The cognitive engine succeeds in identifying the correct formations, lithologies, and geological ages with accuracy in the range of 75 to 90%. The system uses convolutional neural networks for structural element recognition and recurrent neural networks for concept extraction, building a knowledge graph to link entities and relationships.

An information extraction and knowledge graph platform for accelerating biochemical discoveries

Workshop on Applied Data Science for Healthcare at KDD 2019 arXiv scholar cite view →

Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document ingestion system that integrates data from databases and publications (in PDF format) in a biochemistry knowledge graph (BCKG). The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. After describing the knowledge ingestion framework, we showcase an application of our system in the field of carbohydrate enzymes. The BCKG represents a way to scale knowledge ingestion and automatically exploit prior knowledge to accelerate discovery in biochemical sciences.

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

KDD arXiv scholar cite view →

Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather ground-truth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.

Papers

2026

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

2025

Subgrapher: visual fingerprinting of chemical structures

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Markushgrapher: Joint visual and textual recognition of markush structures

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

Foundation models for materials discovery–current state and future directions

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

ChemQuery: A Natural Language Query‐Driven Service for Comprehensive Exploration of Chemistry Patent Literature

Automated disaster event extraction to understand lessons learned: A large-scale text analysis on the scientific literature of floods, droughts, and landslides.

Advanced Layout Analysis Models for Docling

2024

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Docling Technical Report

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Wealth Over Woe: Global Biases in Hydro-Hazard Research

SciScribe: Automating and contextualizing literature reviews in cardiac surgery

Preparing a database for a domain specific application using a centralized data repository

PatCID: an open-access dataset of chemical structures in patent documents

Method of determining a table structure

KVP10k: a comprehensive dataset for key-value pair extraction in business documents

Indus: Effective and efficient language models for scientific applications

Image table generation

Identifying global biases in hydro-hazard research by mining the scientific literature

Dry photoresist or hardmask for euv lithography

2023

ESG Accountability Made Easy: DocQA at Your Service

MolGrapher: Graph-based Visual Recognition of Chemical Structures

Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery

Optimized Table Tokenization for Table Structure Recognition

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

PatCID: Large-scale chemical-structure database from images in patent documents

Information extraction from document corpora

FlowchartQA: The first large-scale benchmark for reasoning over flowcharts

Efficient ground truth annotation

ClimateHub: Deep Search for Climate, Earth and Environmental Sciences

2022

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

TableFormer: Table Structure Understanding with Transformers

Unsupervised term extraction for highly technical domains

Unsupervised domain generalization by learning a bridge across domains

Knowledge Graphs in Deep Search Climate-Hub

Feta: Towards specializing foundational models for expert task applications

Doclaynet: A large human-annotated dataset for document-layout segmentation

Accelerating materials discovery using artificial intelligence, high performance computing and robotics

2021

Robust PDF Document Conversion Using Recurrent Neural Networks

Using knowledge graphs to navigate through geological concepts extracted from documents

Racial Representation Analysis in Dermatology Academic Materials

2020

Corpus Processing Service: A Knowledge Graph Platform to perform deep data exploration on corpora

2019

Application of Geocognitive Technologies to Basin & Petroleum System Analyses

An information extraction and knowledge graph platform for accelerating biochemical discoveries

2018

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale