7 min
In the rapidly expanding field of automatic information processing, companies face a major challenge: extracting relevant information from large volumes of unstructured data. Whether it's financial reports, scientific documents, product sheets or any other type of text, the effective exploitation of this data is crucial to making informed decisions. Lettria has developed a cutting-edge tool to meet this challenge: an information extraction system featuring two innovative techniques, VectorRAG and GraphRAG. Offering two independent approaches not only optimizes information extraction, but also offers our customers a direct comparison between two powerful methods.
1. Two extraction methods in one tool
Our tool is distinguished by the simultaneous implementation of two independent approaches in parallel: VectorRAG and GraphRAG. Although both methods were created for the same purpose, they operate differently.
1.1 What is VectorRAG
VectorRAG (Retrieval-Augmented Generation via vectors) is a method based on contextualized vector representations of text chunks. Each fragment of text is transformed into a vector, i.e. a mathematical representation that captures the meaning of the text. The model uses these vectors to retrieve and synthesize the most relevant information from large volumes of data. VectorRAG is essentially based on embeddings, i.e. numerical representations of words and phrases that facilitate the retrieval of related information. Our VectorRAG is based on Verba by Weaviate, a vector database. It is optimized for speed and efficiency when searching for simple facts or extracting direct information.
1.2 What is GraphRAG
GraphRAG (Retrieval-Augmented Generation via graphs) is an approach that differs in its use of knowledge graphs to establish relationships between concepts. Here, instead of simply converting data into vectors, GraphRAG builds a graph representation of information by relating entities (distinct elements of the world, concrete or abstract) and their relations. Each node in the graph represents an entity, and the edges indicate the relations between these entities. This enables the model to better understand and link complex information, offering a superior ability to address questions that require a deeper understanding of the context and relationships in the data.
GraphRAG is configured to deal with complex issues that require linking multiple pieces of information scattered throughout the data. It excels in contexts where relationships between entities and the overall context play a crucial role. For example, in the case of a legal analysis, where the relationships between several articles of law need to be understood, or in the context of a scientific investigation, where it is essential to link different studies and concepts.
Different parameters can be modified to obtain the best results. Among them, the maximum number of relations retrieved (the top 10 for instance) and the number of hops which corresponds to the number of steps from a given retrieved node to explore its neighbor in the graph. It can potentially enrich the context with more diverse and indirectly related information. Indeed, for complex questions, it is often necessary to obtain intermediate information before finding the one that answers the initial question. For example, if you're requesting results for a company's aeronautical subsidiary, you'll first need to find the subsidiary's name before searching for its results.
2. Powering information extraction with an optimized pipeline
At Lettria, we've ensured that our document processing pipelines are optimized so that both methods deliver the best results.
2.1 VectorRAG’s pipeline
In the context of VectorRAG, the pipeline is essential for processing large volumes of unstructured data and extracting relevant information. Different stages are involved in transforming raw documents into a vector representation. Here is a step-by-step breakdown of this process.
2.1.1 VectorRAG’s ingestion pipeline
- Document sink: The pipeline begins by loading files from a storage system using S3 Loader. This component retrieves the raw data, which can consist of various types of unstructured text documents.
- Document parsing: Once the files are loaded, they undergo document parsing. This step involves extracting and formatting the content from the files to generate a structured source document that can be further processed.
- Source and split: After parsing, the source document is obtained. Depending on the nature of the data, the source can either proceed as a single unit or be split into smaller sections to optimize processing time.
- Sectioning and text splitting: The document is divided into sections, and each section is processed by the text splitter, which further breaks down the text into manageable vector chunks (V-chunks). This step is crucial as it prepares the data for the vector-building process.
- Batch duplicate filter: This filter removes duplicate chunks from the batch, ensuring only unique data is processed, improving efficiency and reducing redundancy.
- Document embedder: Using a pre-trained embedder model, the text chunks are transformed into vector-embedded chunks, which capture the semantic meaning of the text.
- Indexing: Finally, the embedded chunks are fed into Qdrant, a vector search engine that stores the embeddings and enables fast, similarity-based retrieval during queries.
2.1.2 VectorRAG’s inference pipeline
During the inference phase of Lettria’s VectorRAG solution, the process starts by embedding the query into a vector that captures its semantic meaning. This vector is then used to query the vector database, retrieving the most relevant chunks of information aligned with the query. These semantically similar chunks form the foundation for the system’s response. The selected data is then fed into the large language model (LLM), which leverages it to generate a comprehensive, contextually relevant response to the query.
2.2 GraphRAG’s pipeline
2.2.1 GraphRAG’s ingestion pipeline
For GraphRAG, the first stages of the pipeline are identical to those of vectoRAG, up to and including text-splitting. Then comes graph extraction and graph ingestion.
2.2.2 GraphRAG’s graph extraction
Graph extraction is the key step of GraphRAG’s pipeline. For each chunk (G-chunk), the process involves applying graph-based extraction techniques. Here are the steps.
- Duplicate filter: Before creating the graph, a duplicate filter ensures that redundant information is filtered out, improving the efficiency and accuracy of the final graph.
- Graph extraction: The graph extractor then extracts entities and relations from the chunks, creating n-sub graphs (one per chunk).
- Saving: Finally, the sub graphs are saved in a Neo4J database.
2.2.3 GraphRAG’s graph ingestion
Lettria’s GraphRAG solution uses a hybrid ingestion process that stores the graph obtained from the graph extraction in two systems: a vector database and a graph database. This dual approach improves the system’s ability to handle both semantic meaning and structural relationships.
- Vector embeddings: Nodes (e.g., Amazon, Andy Jassy), relations (e.g., "chiefExecutiveOfficer," "hasRevenue") and documents chunks are encoded as high-dimensional vectors, optimized for semantic-based retrieval in a vector database. This enables fast and efficient similarity searches for both entities and their relationships.
- Graph structure: In parallel, the same nodes and relationships are stored in a graph database (e.g., Neo4j, Neptune), which excels at representing and querying complex interconnections. This allows for efficient traversal and relationship analysis, revealing how entities are linked.
By leveraging both databases, the system combines the strengths of vector-based semantic searches with graph-based structural queries, increasing its ability to understand both meaning and relationships.
2.2.4 GraphRAG’s inference pipeline
During the inference phase of Lettria’s GraphRAG solution, the process starts by embedding the query into a vector that captures its semantic meaning. This vector is then used to query the vector database, retrieving the most relevant chunks and a list of relations closely aligned with the query. These semantically similar relations form the foundation for the system's response.
The system then takes this information and expands it into a graph within the graph database, creating a richer network of related nodes and connections. To ensure relevance, this expanded graph is filtered by cross-referencing it with the original vector data, keeping only the most meaningful and important relationships. From the refined graph, the top K most relevant nodes are selected, and these are combined with the key relations identified earlier from the vector database.
This approach allows Lettria's GraphRAG to seamlessly merge the power of vector-based semantic understanding with the structural depth of the graph database, delivering highly relevant and precise results.
3. Comparing two approaches for enhanced results
The simultaneous use of VectorRAG and GraphRAG allows us to compare results and benefits from the specific strengths of each approach. This comparison is useful for several reasons. By comparing the two, you can identify whether one approach is better suited to your specific needs.
Dual analysis also helps to validate results. If both methods converge on the same information, confidence in the relevance and accuracy of the results increase. On the other hand, if the answers differ, this prompts further exploration of the data to understand the discrepancy, opening the door to more informed decision-making. Thanks to the ability to see the context, i.e. the textual elements behind the answer, it's easy to determine which is the right answer in the event of a discrepancy.
The performance of VectorRAG and GraphRAG can be influenced by various factors, including the length of the provided context and the choice of LLM for ingestion or inference. Shorter contexts may be more suitable for tasks requiring quick responses or focused information retrieval, while longer contexts can be beneficial for tasks that require understanding broader context or complex relationships. Similarly, LLMs with larger parameter counts and trained on more diverse datasets may exhibit superior performance in certain tasks.
By carefully considering these factors and comparing the performance of VectorRAG and GraphRAG, we can gain valuable insights into their respective strengths and weaknesses, enabling us to optimize our approach to information retrieval and decision-making.