Papers

2026

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding.

2025

Subgrapher: visual fingerprinting of chemical structures

arXiv scholar cite view →
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code are publicly available.

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

arXiv scholar cite view →
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.

Markushgrapher: Joint visual and textual recognition of markush structures

The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

Granite Vision is a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. The model is trained on a comprehensive instruction-following dataset including document-related tasks such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite language model. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding and the LiveXiv benchmark.

Foundation models for materials discovery–current state and future directions

scholar cite view →
Reviews the wider field of foundation models—of which large language models are a component—and their application to materials discovery. Explores applications to property prediction, synthesis planning and molecular generation, and examines how new methods of data capture and modalities of data will influence the direction of this emerging field. Discusses the role of various types of foundation models including multimodal approaches that combine text, images, and structure information. Provides perspective on current capabilities and future directions for using foundation models to accelerate discovery in materials science. Covers techniques for transfer learning, fine-tuning, and domain-specific adaptation of foundation models to materials science challenges.

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

Docling is an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware. Docling is released as a Python package and can be used as a Python API or CLI tool. Its modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. The toolkit has been integrated in popular frameworks like LangChain, LlamaIndex, and spaCy.

ChemQuery: A Natural Language Query‐Driven Service for Comprehensive Exploration of Chemistry Patent Literature

scholar cite view →
ChemQuery is a tool for easily exploring chemistry-related patents using natural language questions. Uses up-to-date information to return specific answers with their sources. Offers comprehensive search experience through capabilities like extracting molecules from diagrams, integrating information from PubChem, and allowing complex queries about molecular structures. The Parser module learns to write database queries from natural language questions rather than directly generating answers like LLMs do. This approach is especially important for patent exploration where literature evolves rapidly and users expect access to latest knowledge. The query-based approach avoids hallucination issues in LLMs and ensures answers are rooted in actual database records. Supports complex questions about molecular structures and provides direct answers traceablethrough source patents.

Automated disaster event extraction to understand lessons learned: A large-scale text analysis on the scientific literature of floods, droughts, and landslides.

scholar cite view →
Large-scale text analysis on the scientific literature of floods, droughts, and landslides to automatically extract disaster event information and understand lessons learned. Uses natural language processing techniques to identify and extract mentions of specific disaster events, their characteristics, impacts, and lessons from scientific literature. Enables systematic analysis of what the research community has learned from past disasters to inform future mitigation and adaptation strategies. Complements the earlier work on research biases by extracting actionable knowledge from the literature about specific disasters and their management. Applies automated event extraction to understand patterns in how different types of hydro-hazards have been studied and what knowledge has been accumulated.

Advanced Layout Analysis Models for Docling

This technical report documents the development of novel Layout Analysis models integrated into the Docling document-conversion pipeline. We trained several state-of-the-art object detectors based on the RT-DETR, RT-DETRv2 and DFINE architectures on a heterogeneous corpus of 150,000 documents (both openly available and proprietary). Post-processing steps were applied to the raw detections to make them more applicable to the document conversion task. We evaluated the effectiveness of the layout analysis on various document benchmarks using different methodologies while also measuring the runtime performance across different environments (CPU, Nvidia and Apple GPUs). We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline, with comparable or better runtime. Our best model, \"heron-101\", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU. Extensive quantitative and qualitative experiments establish best practices for training, evaluating, and deploying document-layout detectors, providing actionable guidance for the document conversion community. All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.

2024

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

arXiv scholar cite view →
Retrieval Augmented Generation (RAG) systems are widespread applications of Large Language Models in industry. While many tools exist to build custom systems, measuring performance locally with datasets reflective of system use cases is challenging. Shows that using public Q&A datasets to assess retrieval performance can lead to non-optimal systems design, and common tools for dataset generation can lead to unbalanced data. Proposes solutions based on characterizing RAG datasets through labels and label-targeted data generation. Demonstrates that fine-tuned small LLMs can efficiently generate Q&A datasets. The paper's taxonomy comprises four classes: fact_single, summary, reasoning, and unanswerable. Addresses critical gap in RAG evaluation methodology for real-world deployment.

Docling Technical Report

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. This paper proposes Statements, a novel domain agnostic data structure for extracting quantitative facts and related information from ESG reports. The paper proposes translating tables to statements as a new supervised deep-learning universal information extraction task and introduces SemTabNet, a dataset of over 100K annotated tables. The best T5-based Statement Extraction Model generates statements which are 82% similar to the ground-truth.

Wealth Over Woe: Global Biases in Hydro-Hazard Research

Floods, droughts, and rainfall-induced landslides are hydro-hazards affecting millions yearly. Anticipation, mitigation, and adaptation is increasingly outpaced by changing magnitude and frequency due to climate change. Uses natural language processing with a new climate hazard taxonomy to review, identify, and geolocate 100 million abstracts dealing with hydro-hazards. The spatial distribution of study areas is mostly defined by human activity, national wealth, data availability, and population distribution. Key finding: 'Wealth over Woe' bias—100 times more people need to be affected by hazards before low-income countries reach comparable research activity to high-income countries. Recommends enabling and targeting research on hydro-hazards in highly impacted and under-researched regions with urgent need to reduce knowledge base biases to mitigate and adapt to changing hydro-hazards for a sustainable and equitable future.

SciScribe: Automating and contextualizing literature reviews in cardiac surgery

scholar cite view →
SciScribe is a work product developed in collaboration between Cleveland Clinic and IBM, wherein IBM's Deep Search platform has been augmented to accelerate literature reviews in cardiac surgery. Automates and accelerates the literature review process, mitigates errors associated with repetition and fatigue, and contextualizes results by linking relevant external data sources. Built as an extension of IBM's Deep Search platform, it ingests full-content publications from PubMed Central and structured records from ClinicalTrials and OpenPayments databases. Supports traditional keyword-based search as well as natural language question answering via large language models. Key features: accumulating personal collections from publications, incorporating contextual information from external databases, semantic questioning and answering of documents, and collating results into tables for informed literature assessment.

Preparing a database for a domain specific application using a centralized data repository

scholar cite view →
Patent describing methods for preparing and configuring databases for domain-specific applications using a centralized data repository. Enables efficient setup and customization of databases with domain knowledge. Facilitates rapid development of specialized database applications.

PatCID: an open-access dataset of chemical structures in patent documents

scholar cite view →
The automatic analysis of patent publications has potential to accelerate research across drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows access to such information at scale, enabling users to search which molecules are displayed in which documents. Contains 81M chemical-structure images and 14M unique chemical structures sourced from major offices (US, Europe, Japan, Korea, China) since 1978. Achieves 56.0% retrieval rate, higher than automatically-created databases Google Patents (41.5%) and SureChEMBL (23.5%), and manually-created databases Reaxys (53.5%) and SciFinder (49.5%). Enables promising applications for automatic literature review and learning-based molecular generation methods. Dataset is freely accessible.

Method of determining a table structure

scholar cite view →
Patent application describing methods for automatically determining and analyzing table structures from documents or images. Identifies columns, rows, headers, and relationships in tabular data. Enables extraction and processing of structured data from various document formats.

KVP10k: a comprehensive dataset for key-value pair extraction in business documents

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.

Indus: Effective and efficient language models for scientific applications

arXiv scholar cite view →
INDUS is a comprehensive suite of large language models tailored for Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, trained using curated scientific corpora. The suite includes: an encoder model for NLP tasks, a contrastive-learning based text embedding model for information retrieval, and smaller knowledge-distilled versions for latency-constrained applications. Three new scientific benchmark datasets are created (CLIMATE-CHANGE NER, NASA-QA, NASA-IR). INDUS outperforms general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on both new and existing tasks in these domains.

Image table generation

scholar cite view →
Patent application describing methods for automatic generation of tables from image content. Extracts tabular data from visual sources and converts to structured table format. Applicable to document digitization and information extraction from images.

Identifying global biases in hydro-hazard research by mining the scientific literature

scholar cite view →
Uses natural language processing based on a new climate hazard taxonomy to review, identify, and geolocate 100 million abstracts those dealing with hydro-hazards (floods, droughts, rainfall-induced landslides). Maps the global distribution of almost 300,000 abstracts from published flood, drought, and landslide research studies. Finds that the spatial distribution of study areas is mostly defined by human activity, national wealth, data availability, and population distribution. Identifies the 'Wealth over Woe' bias: 100 times more people need to be affected by hazards before low-income countries reach comparable research activity to high-income countries. Recommends high-priority regions for future research and funding to address these global biases and enable more equitable disaster risk reduction.

Dry photoresist or hardmask for euv lithography

scholar cite view →
Patent application describing compositions and methods for dry photoresist or hardmask materials for extreme ultraviolet (EUV) lithography. Related to advanced semiconductor manufacturing and materials science. Addresses challenges in EUV lithography processing.

2023

ESG Accountability Made Easy: DocQA at Your Service

Deep Search DocQA is presented as an application that enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations.

MolGrapher: Graph-based Visual Recognition of Chemical Structures

The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. MolGrapher is introduced to recognize chemical structures visually using a deep keypoint detector to detect atoms, treating all candidate atoms and bonds as nodes in a graph, and classifying them with a Graph Neural Network. A synthetic data generation pipeline is proposed to address the lack of real training data, and a large-scale benchmark of annotated real molecule images (USPTO-30K) is introduced. Extensive experiments show significant improvements over classical and learning-based methods.

Otter-Knowledge: benchmarks of multimodal knowledge graph representation learning from different sources for drug discovery

Recent research on predicting the binding affinity between drug molecules and proteins use representations learned through unsupervised learning techniques. This study demonstrates that by incorporating knowledge graphs from diverse sources and modalities into the sequences or SMILES representation, state-of-the-art results can be achieved for drug-target binding affinity prediction. A set of multimodal knowledge graphs is released, integrating data from seven public data sources and containing over 30 million triples, along with pretrained models and source code for standard benchmark tasks.

Optimized Table Tokenization for Table Structure Recognition

Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. This paper investigates how table-structure representation can be optimised and proposes a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct.

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. This paper presents the results of the ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. The competition recorded 45 team registrations and received official submissions from 21 teams. A clear trend towards adoption of vision-transformer based methods is evident, with the results demonstrating substantial progress towards achieving robust and highly generalizing methods for document layout understanding.

PatCID: Large-scale chemical-structure database from images in patent documents

PatCID (Patent-extracted Chemical-structure Images database for Discovery) is the Patent-extracted Chemical-structure Images database for Discovery, which allows users to find patents mentioning a given molecule and, conversely, all molecules covered by specific patents. Contains 81M chemical-structure images and 14M unique chemical structures sourced from documents from major offices (United States, Europe, Japan, Korea, and China) since 1978. Creation relies on three key steps: document page segmentation, image classification to identify molecular-structure images, and molecular recognition to obtain final chemical structures. A graph-based visual recognition model was developed comprising a deep keypoint detector and graph neural network. Performance on a random set shows PatCID retrieves 56.0% of molecules, higher than automatically-created databases like Google Patents and SureChEMBL, and manually-created databases. The dataset is freely accessible for download.

Information extraction from document corpora

scholar cite view →
Patent application describing methods and systems for large-scale information extraction from document corpora. Enables automated discovery and extraction of relevant information from collections of documents. Applicable to knowledge discovery, research mining, and enterprise information extraction.

FlowchartQA: The first large-scale benchmark for reasoning over flowcharts

FlowchartQA is a new and unique large-scale benchmark for visual question answering over flowcharts, comprising close to 1M flowchart images and 6M question-answer pairs covering various aspects of geometric and topological information in the charts. Questions have been carefully balanced to minimize biases. A baseline model and comprehensive ablation studies provide a foundation for future work. Experimental results reveal the potential of FlowchartQA as a testbed for flowchart understanding, which has been previously absent from the community. The paper addresses the intersection of visual perception, logical reasoning, and language understanding in the context of technical diagrams.

Efficient ground truth annotation

scholar cite view →
Patent describing methods for efficient generation and management of ground truth annotations for machine learning datasets. Enables faster and more cost-effective creation of annotated training data. Applicable to computer vision, NLP, and document processing applications.

2022

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge due to their huge variability in formats and complex structure. This paper focuses on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with strong reliance on machine-learning methods on cloud infrastructure. The key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. The best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.

TableFormer: Table Structure Understanding with Transformers

Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graphs, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a non-trivial task. In this paper, a new table-structure identification model is presented that improves the latest end-to-end deep learning model (encoder-dual-decoder from PubTabNet) by introducing a new object detection decoder for table-cells and replacing LSTM decoders with transformer-based decoders, achieving 98.5% TEDS on simple tables and 95% on complex tables.

Unsupervised term extraction for highly technical domains

arXiv scholar cite view →
Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned (or pre-trained) over the training data generated by running the UA over large unlabeled corpora. Our experiments demonstrate that our setup can improve the predictive performance while decreasing the inference latency on both CPUs and GPUs. Our annotators provide a very competitive baseline for all the cases where annotations are not available.

Unsupervised domain generalization by learning a bridge across domains

The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generalization (UDG) setup of having no training supervision in neither source nor target domains. Our approach is based on self-supervised learning of a Bridge Across Domains (BrAD) - an auxiliary bridge domain accompanied by a set of semantics preserving visual (image-to-image) mappings to BrAD from each of the training domains. The BrAD and mappings to it are learned jointly (end-to-end) with a contrastive self-supervised representation model that semantically aligns each of the domains to its BrAD-projection, and hence implicitly drives all the domains (seen or unseen) to semantically align to each other. In this work, we show how using an edge-regularized BrAD our approach achieves significant gains across multiple benchmarks and a range of tasks, including UDG, Few-shot UDA, and unsupervised generalization across multi-domain datasets (including generalization to unseen domains and classes).

Feta: Towards specializing foundational models for expert task applications

arXiv scholar cite view →
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training. This underlines the necessity to explicitly evaluate and finetune FMs on such expert tasks, arguably ones that appear the most in practical real-world applications. In this paper, we propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation, via learning to match their graphical illustrations to corresponding language descriptions. Our FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. FETA is equipped with a procedure for completely automatic annotation extraction (code would be released upon acceptance), allowing easy extension of FETA to more documentation types and application domains in the future. Our automatic annotation leads to an automated performance metric shown to be consistent with metrics computed on human-curated annotations (also released). We provide multiple baselines and analysis of popular FMs on FETA leading to several interesting findings that we believe would be very valuable to the FM community, paving the way towards real-world application of FMs for practical expert tasks currently 'overlooked' by standard benchmarks focusing on common objects.

Doclaynet: A large human-annotated dataset for document-layout segmentation

arXiv scholar cite view →
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. This paper presents DocLayNet, a new publicly available document-layout annotation dataset in COCO format containing 80863 manually annotated pages from diverse data sources. For each PDF page, layout annotations provide labelled bounding-boxes with 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine inter-annotator agreement. The paper demonstrates that DocLayNet-trained models are more robust and the preferred choice for general-purpose document-layout analysis compared to models trained on PubLayNet and DocBank.

Accelerating materials discovery using artificial intelligence, high performance computing and robotics

scholar cite view →
In materials discovery, traditional manual, serial, and human-intensive work is being augmented by automated, parallel, and iterative processes driven by Artificial Intelligence, simulation, and experimental automation. This paper describes how these new capabilities enable the acceleration and enrichment of each stage of the discovery cycle. Using the example of developing a novel chemically amplified photoresist, the authors demonstrate how these technologies' impacts are amplified when used in concert with each other as powerful, heterogeneous workflows integrating AI-driven optimization, high-performance computing simulations, and experimental robotics.

2021

Robust PDF Document Conversion Using Recurrent Neural Networks

The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19.

Racial Representation Analysis in Dermatology Academic Materials

scholar cite view →
Analyzes racial representation in dermatology academic materials. Studies have shown that images of skin of color are underrepresented in medical textbooks, academic curricula, and educational resources. This work quantitatively examines the prevalence of darker skin tones in dermatology educational materials and curricula. Findings indicate that representation of diverse skin tones remains critically low, with fewer textbooks having increased representation of dermatological diseases in darker skin tones over time. The research highlights the need for improved diversity and inclusion in medical education.

2020

Corpus Processing Service: A Knowledge Graph Platform to perform deep data exploration on corpora

scholar cite view →
Knowledge Graphs have emerged as the standard for modeling and exploring knowledge in weakly structured data. The Corpus Processing Service (CPS) is a scalable cloud platform designed to process large document corpora, extract content and embedded facts, and represent these in a consistent knowledge graph that can be intuitively queried. The platform uses state-of-the-art natural language understanding models to extract entities and relationships from documents, complemented with a newly developed graph engine ensuring performant graph queries and powerful analytics. Designed as a modular microservice system on Kubernetes clusters, CPS is validated in real-world applications in the oil and gas industry.

2019

Application of Geocognitive Technologies to Basin & Petroleum System Analyses

scholar cite view →
Eni and IBM developed a cognitive engine exploiting a deep-learning approach to scan documents searching for basin geology concepts, extracting information about petroleum system elements such as formation name, geological age, and lithology of source rocks, reservoirs, and seals. Enables basin geologists to perform automated queries to collect all information related to a basin of interest. The cognitive engine succeeds in identifying the correct formations, lithologies, and geological ages with accuracy in the range of 75 to 90%. The system uses convolutional neural networks for structural element recognition and recurrent neural networks for concept extraction, building a knowledge graph to link entities and relationships.

An information extraction and knowledge graph platform for accelerating biochemical discoveries

arXiv scholar cite view →
Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document ingestion system that integrates data from databases and publications (in PDF format) in a biochemistry knowledge graph (BCKG). The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. After describing the knowledge ingestion framework, we showcase an application of our system in the field of carbohydrate enzymes. The BCKG represents a way to scale knowledge ingestion and automatically exploit prior knowledge to accelerate discovery in biochemical sciences.

2018

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather ground-truth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.