Greval: an evaluation tool for Lettria's GraphRAG

6 min

Introduction

Today, generative AI plays a key role in information technology. However, despite technical progress, the reliability of the data generated is still an issue. That's why Lettria has created a high-performance GraphRAG, a hybrid solution that combines the classic vector approach with graph representation. And to ensure the quality of the results, Lettria is also developing an evaluation tool to precisely measure the quality of its solution. Here's a look at how this tool was conceived and how it works.

Greval

Greval is an internal API developed by Lettria to evaluate the performance of our GraphRAG solution. It includes a flexible metric creation tool, enabling teams to design custom evaluation metrics tailored for Retrieval-Augmented Generation (RAG) systems. In addition to using language models (LLMs) and a library of predefined prompts, Greval integrates key Langchain components, including Langchain Runnables for managing multiple logic blocks sequentially or concurrently, and Langserve for effortless deployment. This integration enables Greval to generate structured evaluations based on task descriptions and other inputs. These evaluations offer a detailed and comprehensive assessment of GraphRAG's performance. The results can also be exported for further annotation and analysis, using standard performance metrics for comparison.

Types of metrics

Greval provides various types of metrics tailored to meet different GraphRAG’s evaluation needs. Value Metrics are quantitative, generating scores on various integer scales, such as from 0 to 1 or from 1 to 5. These metrics are ideal for assessing performance with numerical precision. However, when using Value Metrics, it is important to describe the different ranges in as much detail as possible in order to avoid ambiguity. On the other hand, Categorical Metrics offer a more qualitative approach, allowing results to be classified into categories, providing a nuanced and descriptive evaluation that goes beyond numbers.

Both metrics can be extended to evaluate multiple elements simultaneously, whether chunks or triples, using the Many Value Metric and Many Categorical Metric options. This streamlines evaluation, saving time and ensuring consistency in more complex tasks.

Metrics configuration

Setting up an evaluation metric in Greval involves just a few simple steps. First, the LLM to be used as the evaluator is selected from the available models.

Next, the evaluation steps must be defined. These are the criteria and process the tool will follow to analyze the inputs. The criteria should be written in a 'chain of thought' style, encouraging a step-by-step, reasoned approach to the evaluation. Afterward, a classification system needs to be established, which could range from a numerical scale (e.g., 1 to 5) to categorical labels (e.g. Excellent, Good, Poor). Finally, the metric needs to be given a name and a clear description stating its purpose, this will help the LLM understand the metric's intent and how to apply it effectively. Any additional instruction necessary for the evaluation can also be provided.

As part of the evaluation, Greval processes several input assets: the question, the expected answer if applicable, the actual answer from GraphRAG, and the context, which includes the retrieved chunks and graph triples used to generate the answer.

Structured outputs

One of Greval's major strengths is its ability to produce results in a structured form via Langchain components. Each evaluation provides not only an assigned value or category, but also a detailed reason, offering transparency and better understanding of the evaluation process. This approach enables results to be interpreted in a more informed way and make decisions accordingly.

Automatic metrics generator

Greval simplifies the creation of new metrics. With its integrated automatic generator, when providing a name and description for the metric Greval will generate the necessary steps for its evaluation. This feature is especially useful for quickly creating metrics while also ensuring that specific evaluation needs are met consistently.

Dataset Annotation with Argilla

With Greval's API, results can be easily exported to Argilla, a data annotation platform, and use their interface to annotate the dataset with the same categorical or numerical labels defined in the metrics. Once annotated, traditional performance metrics such as precision, recall, accuracy, and F1 score can be generated.

For categorical metrics, multiple insights such as a full classification report can be obtained. On the numerical side, Greval offers detailed statistics including the mean, median, standard deviation, as well as the minimum, maximum, and total count of values.

These tools help enhance dataset's quality by providing a thorough statistical analysis, while Argilla’s interface simplifies the annotation process.

Improve Your RAG Performance with Graph-Based AI.

Download our free white paper →

Predefined metrics

Several metrics are already available in Greval, enabling GraphRAG results to be evaluated according to different approaches.

Answer alignment

This first metric compares the responses given by the model with ground truth responses. The evaluator assesses each response by assigning it a class – Correct, Acceptable, Not Acceptable or Incorrect – according to predefined criteria.

Answer quality

Ground truth responses are not always available. This metric therefore evaluates the relevance of the response to the request. We can't know whether the data in the answer is correct or not, but we can determine whether it answers the request exactly. On the other hand, this metric analyzes the clarity and internal consistency of the response: is it fully comprehensible, and does it contain any contradictory elements? The different classes are the same as for the previous metric.

Chunk evaluation

For this metric, the model is specifically tasked with critically and objectively evaluating how well the content of each of the retrieved chunks fulfills the query’s information needs. The model is tasked with analyzing the meaning and intent of the question in detail and then with comparing the chunks according to several criteria. These include semantic alignment in general and topic alignment in particular, as well as relevance, quality and usefulness assessment.

Finally, the evaluator must assign each chunk one of the following categories: Relevant, Indirectly relevant or Irrelevant. Each criterion and category is described in detail to guide the tool in its analysis.

Triple evaluation

This metric is similar to the previous one, but this time applied to the retrieved triples, which together with the chunks form the context used by GraphRAG to generate the response. However, it's very specific to Lettria because of our distinctive graph-based approach compared with other RAG solutions. The guidelines are broadly the same as for chunks, taking into account the particular format of triples, composed of a subject, a predicate and an object. The classification system is identical: each triple is classified as Relevant, Indirectly relevant or Irrelevant.

Lettria has previously carried out work on the evaluation of triples, and you can read more about it here.

Answer completeness

This last predefined metric evaluates the completeness of the response. The evaluation is made by comparison with the request and with the chunks and triples deemed relevant by the two previous evaluators, chunk evaluator and triple evaluator. This is a very comprehensive metric, as the information in the response is compared both to the information requested and to the information available in the sources.

This metric asks the model to list the entities and properties (relations or attributes) cited among the request and the retrieved chunks. For triples, we assume that the subject is an entity and that the predicate is either a relation, in which case the object is another entity, or an attribute, in which case the object is the value of the attribute.

For each element of the query, the model must also analyse whether it is considered identified and real, implicit or necessary to reach the response, or unknown. For elements of chunks and triples, the model must distinguish those that directly provide the information requested in the query and those that help to understand the query or indirectly reach the answer.

The evaluator is then responsible for comparing all these elements with those of the response in order to classify it among one of these categories: Relevant, Somewhat relevant, Hardly relevant and Irrelevant. Once again, each category is accompanied by a precise description.

In order to improve the results, the model is also asked to provide recommendations for enriching or improving the response.

Conclusion

Greval is a powerful and flexible evaluation tool that helps ensure the reliability of Lettria's GraphRAG. With its structured and customizable metrics, Greval provides precise performance analysis and supports continuous improvement of our solutions. It is mainly used for pre-deployment evaluation by Lettria experts to guarantee the quality of our deliverables. By automating assessments, it allows fast development, as we do not have to manually evaluate each time we test a new method even though, for production models, we can still rely on manual evaluation to avoid biases.

Ready to revolutionize your RAG?

Download our GraphRAG white paper →

Édouard Trouillez

Senior Linguist

Greval: an evaluation tool for Lettria's GraphRAG

Introduction

Greval

Types of metrics

Metrics configuration

Structured outputs

Automatic metrics generator

Dataset Annotation with Argilla

Predefined metrics

Answer alignment

Answer quality

Chunk evaluation

Triple evaluation

Answer completeness

Conclusion

Keep reading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading

Heading