Benchmark

Ontology Toolkit: Instructing LLM for Generating Ontologies

3

Models compared

2

prompting methods

6

Ontologies cross-domains

Introduction

This benchmark report provides a comprehensive analysis of the performance of various ontology generation models, with a specific focus on the GPT and Sonnet models. The ontologies were generated using our Ontology-Toolkit. By evaluating these models across diverse use cases (UC) and configurations, this report aims to offer insights into their effectiveness in producing accurate and relevant ontologies.

Ontology. We evaluate our approach using a real-world use case by comparing the ontology generated by Ontology-Toolkit with a reference ontology manually developed by professional knowledge engineers. This reference ontology belongs to the financial domain and enables to provide information about companies as well as specific financial events. It comprises 37 classes, 54 object properties, and 48 data properties. It also follows an event-based design pattern, primarily created to capture information on events affecting companies such as mergers, acquisitions, settlements, and legal disputes. The Event superclass plays a central role, with no fewer than 15 subclasses, including MergerEvent, SettlementEvent, and LegalDisputeEvent. The ontology also includes 20 object properties indicating the participation of companies and organizations in these events, such as biddingParticipant, regulatorParticipant, and victimParticipant.

Experiments. We used the Ontology-Toolkit to generate an ontology from a corpus of 30 documents on financial events from BBC and Reuters, averaging 900 tokens each. The domain is defined as market-moving events, and the use case (UC) is Assess and analyze the impact of market-moving events, the parties involved, and their subsequent effects.

In this quantitative evaluation, we have set the number of classes and competency questions to generate to 50 and focus the evaluation on the question: is specifying a use case when modeling an ontology useful or not? Both configurations will be named with_use_case_50 and no_use_case_50

These configurations are based on the results of a previous evaluation, in which 12 ontologies were generated by considering various factors: the number of classes and questions in the prompt (10, 30, 50), the presence or absence of a UC, and the use of a prompt chaining step. The best results were achieved when 50 questions were asked with and without use case and no prompt chaining step.

Six ontologies were generated for this experiment, specifying a use case or not and using three LLMs: Claude 3.5 Sonnet, GPT-4o, and GPT-4o-mini. This setup allowed for a comparative analysis of the models’ performance based on the presence or absence of an explicit use case.

Generated Ontologies 

  1. Claude 3.5 Sonnet_with_use_case_50
  2. Claude 3.5 Sonnet_no_use_case_50
  3. GPT-4o_with_use_case _50
  4. GPT-4o_no_use_case _50_
  5. GPT-4o-mini_with_use_case _50
  6. GPT-4o-mini_no_use_case _50

Evaluation results. For each generated ontology, a manual analysis was conducted, comparing the presence of classes and properties between the generated and reference ontologies. We also assessed the presence of hallucinations. We had a degree of tolerance in the comparison: missing classes or properties may correspond to elements with broader or narrower meanings, which were noted separately. For example, the object property hasLocation is broader than country, while the data property usesCryptocurrency is narrower than currency. Two sets of evaluations were conducted using different metrics: (1) Accuracy, and (2) Precision, Recall, and F1 Score.

  • Accuracy: The overall proportion of matching classes and properties between the generated and reference ontologies.
  • Precision: The ratio of concepts or relations in the generated ontology that are present in the reference ontology.
  • Recall: The ratio of concepts or relations in the reference ontology that are also present in the generated ontology.
  • F1 Score: The harmonic mean of precision and recall.

Accuracy

As seen in Table 1, the Claude 3.5 Sonnet shows stable results both for ontologies with and without UC. GPT-4o-Mini comes second, particularly for classes, and although more limited in the precision of its properties, also shows an improvement with UC. Finally, GPT-4o shows weaker results, particularly in the generation of properties with UC. These results suggest that the integration of a UC enhances the consistency of ontologies generated by Claude 3.5 Sonnet and GPT-4o-Mini, but less so for those generated with GPT-4o.

Comparing the three models, Claude 3.5 Sonnet stands out for its balanced performance, both with and without a use case, offering stable accuracy in the generation of classes and properties.

Model Ontology Type Yes Narrower Broader No Total Yes, Narrower, Broader
Claude 3.5 Sonnet UC 50 Class 13 (35.14%) 2 (5.41%) 11 (29.73%) 11 (29.73%) 70.27%
DatatypeProperty 17 (35.42%) 0 (0.00%) 16 (33.33%) 15 (31.25%)
ObjectProperty 8 (14.81%) 0 (0.00%) 27 (50.00%) 19 (35.19%)
Properties 8 (14.81%) 0 (0.00%) 27 (50.00%) 19 (35.19%) 66.67%
Properties + classes 38 (27.34%) 2 (1.44%) 54 (38.85%) 45 (32.37%) 67,63%
Without UC 50 Class 13 (35.14%) 5 (13.51%) 10 (27.03%) 9 (24.32%) 75.68%
DatatypeProperty 9 (18.75%) 0 (0.00%) 20 (41.67%) 19 (39.58%)
ObjectProperty 6 (11.11%) 0 (0.00%) 24 (44.44%) 24 (44.44%)
Properties 15 (14.71%) 0 (0.00%) 44 (43.14%) 43 (42.16%) 57.84%
Properties + classes 28 (20.14%) 5 (3.60%) 54 (38.85%) 52 (37.41%) 62,59%
GPT-4o UC 50 Class 13 (35.14%) 1 (2.70%) 11 (29.73%) 12 (32.43%) 67.57%
DatatypeProperty 1 (2.08%) 7 (14.58%) 0 (0.00%) 40 (83.33%)
ObjectProperty 4 (7.41%) 0 (0.00%) 0 (0.00%) 50 (92.59%)
AllProperties 5 (4.90%) 7 (6.86%) 0 (0.00%) 90 (88.24%) 11.76%
AllEntities 18 (12.95%) 8 (5.76%) 11 (7.91%) 102 (73.38%) 26.62%
Without UC 50 Class 5 (13.51%) 1 (2.70%) 16 (43.24%) 15 (40.54%) 59.46%
DatatypeProperty 0 (0.00%) 4 (8.33%) 9 (18.75%) 35 (72.92%)
ObjectProperty 2 (3.70%) 3 (5.56%) 29 (53.70%) 20 (37.04%)
AllProperties 2 (1.96%) 7 (6.86%) 38 (37.25%) 55 (53.92%) 46.08%
AllEntities 7 (5.04%) 8 (5.76%) 54 (38.85%) 70 (50.36%) 49.64%
GPT-4o-Mini UC 50 Class 12 (32.43%) 1 (2.70%) 13 (35.14%) 11 (29.73%) 70.27%
DatatypeProperty 0 (0.00%) 0 (0.00%) 9 (18.75%) 39 (81.25%)
ObjectProperty 1 (1.85%) 0 (0.00%) 28 (51.85%) 25 (46.30%)
AllProperties 1 (0.98%) 0 (0.00%) 37 (36.27%) 64 (62.75%) 37.25%
AllEntities 13 (9.35%) 1 (0.72%) 50 (35.97%) 75 (53.96%) 46.04%
Without UC 50 Class 15 (40.54%) 0 (0.00%) 12 (32.43%) 10 (27.03%) 72.97%
DatatypeProperty 0 (0.00%) 0 (0.00%) 9 (18.75%) 38 (79.17%)
ObjectProperty 5 (9.26%) 0 (0.00%) 8 (14.81%) 41 (75.93%)
AllProperties 5 (4.95%) 0 (0.00%) 17 (16.83%) 79 (78.22%) 21.78%
AllEntities 20 (14.49%) 0 (0.00%) 29 (21.01%) 89 (64.49%) 35.51%

(Table 1 : ​​Accuracy comparison of 6 ontologies generated using 3 LLMs with respect to a reference ontology)

Precision, Recall, and F1 Score

Based on F1 Scores (Table 2), the Claude 3.5 Sonnet model offers the best overall results, with a clear improvement when UCs are taken into account. It outperforms the other models in class generation (0.76 with UC), although its performance on properties is limited. GPT-4o shows a good F1 Score for classes with UC (0.70), but drops drastically without this option. Finally, GPT-4o-Mini shows stable but weaker results than the other two models, particularly when it comes to property generation. As a result, Claude 3.5 Sonnet stands out especially for ontologies with UC.

Model Ontology Precision Recall F1 Score Class Precision Class Recall Class F1 Score Property Precision Property Recall Property F1 Score
Claude 3.5 Sonnet UC 50 0,30 0,30 0,30 0,46 0,68 0,55 0,20 0,17 0,18
Without UC 50 0,46 0,46 0,46 0,58 1,14 0,76 0,33 0,22 0,26
GPT-4o UC 50 0,57 0,36 0,44 0,60 0,84 0,70 0,53 0,18 0,27
Without UC5 0 0,88 0,10 0,18 1 0,11 0,20 0,83 0,10 0,18
GPT-4o-Mini UC 50 0,39 0,25 0,31 0,52 0,73 0,61 0,22 0,08 0,12
Without UC 50 0,4 0,28 0,33 0,53 0,76 0,62 0,24 0,10 0,14

(Table 2 : Precision, Recall, and F1 score comparison of 6 ontologies using 3 LLMs with respect to a reference ontology)

Conclusion

To sum up, our work shows that Claude 3.5 Sonnet offers stable performance whether or not the use case is specified. GPT-4o-mini ranks second, ahead of GPT-4o. But we also note that adding the use case is not always beneficial: in this case, it improves the consistency of ontologies generated by Claude 3.5 Sonnet and GPT-4o-mini, but has less impact on GPT-4o.

Get started with GraphRAG in 2 minutes
Talk to an expert