6 min
Introduction
Recent advancements and improvements in Large Language Models have opened up a wide variety of new use cases, with a focus on advanced Retrieval-Augmented Generation applications. These use cases are primarily focused on querying information in complex textual documents such as PDFs.
At Lettria, we are working on addressing this issue by leveraging the power of knowledge management and graphs through the development of a GraphRAG pipeline.
The process of building a GraphRAG typically involves:
- Extracting a Graph from a document
- Merging to an existing Graph
- Querying the Graph
Each of these steps present challenges, and one of them is the multimodality of source documents.
Moreover, we are working for clients who have increasingly complex table structures in their documents. This is the case, for example, with corporate annual reports and industry documents. Naive solutions such as transcribing each row or cell of the table into a sentence lack cross-references and depend on the quality of the user queries. Some have tried describing the whole table as a chunk and prompted the LLM with the full table. While promising for small tables, this approach can lead to hallucinations on larger tables and does not efficiently use the LLM’s context size.
To address these challenges, we focused on integrating the knowledge found in these tables into the graph structure of our GraphRAG solution. Join us as we explore our journey to extract, structure and query tabular data from source documents.
The challenge behind tables
Tables present a significant challenge in written data processing due to their often complex structures, especially in specialized domains (health, finance…). These tables may feature multiple headers, shared headers or cross-references, making algorithmic processing difficult.
A brief online search reveals a wide variety of tables in PDF documents, which showcases the variety of formats and structures used across different types of industries.
Example of a complex table. We can see that the All industries header includes the following rows. The Companies header is shared into two nested headers.
To begin our work, we collected multiple documents from a diverse set of domains with a large variety of table structure. Our goal is to find a general framework to structure table information as graphs.
Lettria Table Parsing Framework
Based on our exploration and analysis of sample tables, we tried to develop a formalized approach to structure and represent table as graphs. This formalization process involved describing table features and defining specific node types, relationships, and structural patterns.
We identified two main types of tables: single-entry tables and double-entry tables.
- In single-entry tables, each cell in the column header contains the name of a node. Each cell in the row header contains the name of an attribute. Each intersection contains the value of the attribute specific to each node.
- In double-entry tables, each cell in the row header contains the name of a node of a certain type, and each cell in the column header contains the name of a node of another type. Each of the table's other cells contains a value associated with the relationship uniting the nodes whose intersection it constitutes.
Example of a single-entry table:
Example of a double-entry table:
Moreover, we made a distinction between two types of nodes: concept nodes and dimension nodes.
- Concept nodes represent the core entities or records found in the table.
- Dimension nodes represent variable temporal or spatial elements, such as dates or regions.
This distinction allowed us to better show the hierarchical relationships between items and dimensions. For example, in a table with product sales data, the items would be the individual products, while the dimensions could be time periods.
Regarding relationships, we defined a few key ones:
- subsetOf relationship to represent hierarchies of nested elements in the table.
- hasValueRelatedTo relationship to link values with their relevant items or dimensions.
- total relationship to indicate total values.
The final graph representation will respect the following RDF schema.
Example on Amazon financial results
As an exemple, we focused on a financial report table from Amazon. We converted the first few lines of information in the table into a graph representation:
- Regarding nodes, "Current Assets", "Cash and cash equivalents", "Marketable securities", "Inventories", "Accounts receivable, net and other" are defined as concept nodes, whereas "December 31, 2022" and "September 30, 2023" are dimension nodes.
- Key relationships were also defined in the graph. The relationship subsetOf is used to show that "Cash and cash equivalents", "Marketable securities" etc. are subsets of "Current Assets". Concept nodes are connected to their respective values for each date with hasValueRelatedTo. Eventually, the Total relationship connects the "Total current assets" node to the sum of all the concept nodes.
- As for values, numerical values are attached to each concept node. For example, "Cash and cash equivalents" has a value of $53,888 million on December 31, 2022.
This step helped us conceptualize two types of graphs: a specific kind with the actual values from the table and then a generic kind to represent the overall structure of the table.
Conclusion
Through our exploration and formalization process, we have developed an approach for representing tabular data as graphs. By analyzing tables across various domains and formats, we identified common elements and relationships that capture the hierarchical and contextual information present in tables.
Consolidating information from complex tables into graphs offers great value. By integrating the data in the same place as the rest of the document, we provide a unified format and we ensure that all the information is connected in context. It offers a clearer breakdown of information than traditional approaches, which make it easier to analyze and to interpret.
Moreover, translating tables into graphs ensures that the tabular data is placed back in context and connected with the overall graph. It allows the aggregation and the comparison of information across different tables.
While our approach has shown potential in representing table data, there are still challenges to address when dealing with large, complex tables and when adding contextual information such as footnote indicators, table titles or captions.
Eventually, this approach reveals multiple possibilities for graph querying. Indeed, the graph representation provides a unified and structured format that can be queried using graph databases or knowledge graph systems. This is paving the way for new opportunities of information retrieval and data exploration.