Introduction
Today, generative AI plays a key role in information technology. However, despite technical progress, the reliability of the data generated is still an issue. That's why Lettria has created a high-performance GraphRAG, a hybrid solution that combines the classic vector approach with graph representation. And to ensure the quality of the results, Lettria is also developing an evaluation tool to precisely measure the quality of its solution. Here's a look at how this tool was conceived and how it works.
Greval
Greval is an internal API developed by Lettria to evaluate the performance of our GraphRAG solution. It includes a flexible metric creation tool, enabling teams to design custom evaluation metrics tailored for Retrieval-Augmented Generation (RAG) systems. In addition to using language models (LLMs) and a library of predefined prompts, Greval integrates key Langchain components, including Langchain Runnables for managing multiple logic blocks sequentially or concurrently, and Langserve for effortless deployment. This integration enables Greval to generate structured evaluations based on task descriptions and other inputs. These evaluations offer a detailed and comprehensive assessment of GraphRAG's performance. The results can also be exported for further annotation and analysis, using standard performance metrics for comparison.
Types of metrics
Greval provides various types of metrics tailored to meet different GraphRAG’s evaluation needs. Value Metrics are quantitative, generating scores on various integer scales, such as from 0 to 1 or from 1 to 5. These metrics are ideal for assessing performance with numerical precision. However, when using Value Metrics, it is important to describe the different ranges in as much detail as possible in order to avoid ambiguity. On the other hand, Categorical Metrics offer a more qualitative approach, allowing results to be classified into categories, providing a nuanced and descriptive evaluation that goes beyond numbers.
Both metrics can be extended to evaluate multiple elements simultaneously, whether chunks or triples, using the Many Value Metric and Many Categorical Metric options. This streamlines evaluation, saving time and ensuring consistency in more complex tasks.
Metrics configuration
Setting up an evaluation metric in Greval involves just a few simple steps. First, the LLM to be used as the evaluator is selected from the available models.
Next, the evaluation steps must be defined. These are the criteria and process the tool will follow to analyze the inputs. The criteria should be written in a 'chain of thought' style, encouraging a step-by-step, reasoned approach to the evaluation. Afterward, a classification system needs to be established, which could range from a numerical scale (e.g., 1 to 5) to categorical labels (e.g. Excellent, Good, Poor). Finally, the metric needs to be given a name and a clear description stating its purpose, this will help the LLM understand the metric's intent and how to apply it effectively. Any additional instruction necessary for the evaluation can also be provided.
As part of the evaluation, Greval processes several input assets: the question, the expected answer if applicable, the actual answer from GraphRAG, and the context, which includes the retrieved chunks and graph triples used to generate the answer.
Structured outputs
One of Greval's major strengths is its ability to produce results in a structured form via Langchain components. Each evaluation provides not only an assigned value or category, but also a detailed reason, offering transparency and better understanding of the evaluation process. This approach enables results to be interpreted in a more informed way and make decisions accordingly.
Automatic metrics generator
Greval simplifies the creation of new metrics. With its integrated automatic generator, when providing a name and description for the metric Greval will generate the necessary steps for its evaluation. This feature is especially useful for quickly creating metrics while also ensuring that specific evaluation needs are met consistently.
Dataset Annotation with Argilla
With Greval's API, results can be easily exported to Argilla, a data annotation platform, and use their interface to annotate the dataset with the same categorical or numerical labels defined in the metrics. Once annotated, traditional performance metrics such as precision, recall, accuracy, and F1 score can be generated.
For categorical metrics, multiple insights such as a full classification report can be obtained. On the numerical side, Greval offers detailed statistics including the mean, median, standard deviation, as well as the minimum, maximum, and total count of values.
These tools help enhance dataset's quality by providing a thorough statistical analysis, while Argilla’s interface simplifies the annotation process.
Predefined metrics
Several metrics are already available in Greval, enabling GraphRAG results to be evaluated according to different approaches.
Answer alignment
This first metric compares the responses given by the model with ground truth responses. The evaluator assesses each response by assigning it a class – Correct, Acceptable, Not Acceptable or Incorrect – according to predefined criteria.
Answer quality
Ground truth responses are not always available. This metric therefore evaluates the relevance of the response to the request. We can't know whether the data in the answer is correct or not, but we can determine whether it answers the request exactly. On the other hand, this metric analyzes the clarity and internal consistency of the response: is it fully comprehensible, and does it contain any contradictory elements? The different classes are the same as for the previous metric.
Chunk evaluation
For this metric, the model is specifically tasked with critically and objectively evaluating how well the content of each of the retrieved chunks fulfills the query’s information needs. The model is tasked with analyzing the meaning and intent of the question in detail and then with comparing the chunks according to several criteria. These include semantic alignment in general and topic alignment in particular, as well as relevance, quality and usefulness assessment.
Finally, the evaluator must assign each chunk one of the following categories: Relevant, Indirectly relevant or Irrelevant. Each criterion and category is described in detail to guide the tool in its analysis.
Triple evaluation
This metric is similar to the previous one, but this time applied to the retrieved triples, which together with the chunks form the context used by GraphRAG to generate the response. However, it's very specific to Lettria because of our distinctive graph-based approach compared with other RAG solutions. The guidelines are broadly the same as for chunks, taking into account the particular format of triples, composed of a subject, a predicate and an object. The classification system is identical: each triple is classified as Relevant, Indirectly relevant or Irrelevant.
Lettria has previously carried out work on the evaluation of triples, and you can read more about it here.
Answer completeness
This last predefined metric evaluates the completeness of the response. The evaluation is made by comparison with the request and with the chunks and triples deemed relevant by the two previous evaluators, chunk evaluator and triple evaluator. This is a very comprehensive metric, as the information in the response is compared both to the information requested and to the information available in the sources.
I've modified the previous paragraph.This metric asks the model to list the entities and properties (relations or attributes) cited among the request and the retrieved chunks. For triples, we assume that the subject is an entity and that the predicate is either a relation, in which case the object is another entity, or an attribute, in which case the object is the value of the attribute.
For each element of the query, the model must also analyse whether it is considered identified and real, implicit or necessary to reach the response, or unknown. For elements of chunks and triples, the model must distinguish those that directly provide the information requested in the query and those that help to understand the query or indirectly reach the answer.
The evaluator is then responsible for comparing all these elements with those of the response in order to classify it among one of these categories: Relevant, Somewhat relevant, Hardly relevant and Irrelevant. Once again, each category is accompanied by a precise description.
In order to improve the results, the model is also asked to provide recommendations for enriching or improving the response.
Conclusion
Greval is a powerful and flexible evaluation tool that helps ensure the reliability of Lettria's GraphRAG. With its structured and customizable metrics, Greval provides precise performance analysis and supports continuous improvement of our solutions. It is mainly used for pre-deployment evaluation by Lettria experts to guarantee the quality of our deliverables. By automating assessments, it allows fast development, as we do not have to manually evaluate each time we test a new method even though, for production models, we can still rely on manual evaluation to avoid biases.