Benchmark

Ontology Toolkit: Instructing LLM for Generating Ontologies

3

Models compared

2

prompting methods

6

Ontologies cross-domains

7 min read

Introduction

This benchmark report provides a comprehensive analysis of the performance of various ontology generation models, with a specific focus on the GPT and Sonnet models. The ontologies were generated using our Ontology-Toolkit. By evaluating these models across diverse use cases (UC) and configurations, this report aims to offer insights into their effectiveness in producing accurate and relevant ontologies.

Ontology. We evaluate our approach using a real-world use case by comparing the ontology generated by Ontology-Toolkit with a reference ontology manually developed by professional knowledge engineers. This reference ontology belongs to the financial domain and enables to provide information about companies as well as specific financial events. It comprises 37 classes, 54 object properties, and 48 data properties. It also follows an event-based design pattern, primarily created to capture information on events affecting companies such as mergers, acquisitions, settlements, and legal disputes. The Event superclass plays a central role, with no fewer than 15 subclasses, including MergerEvent, SettlementEvent, and LegalDisputeEvent. The ontology also includes 20 object properties indicating the participation of companies and organizations in these events, such as biddingParticipant, regulatorParticipant, and victimParticipant.

Experiments. We used the Ontology-Toolkit to generate an ontology from a corpus of 30 documents on financial events from BBC and Reuters, averaging 900 tokens each. The domain is defined as market-moving events, and the use case (UC) is Assess and analyze the impact of market-moving events, the parties involved, and their subsequent effects.

In this quantitative evaluation, we have set the number of classes and competency questions to generate to 50 and focus the evaluation on the question: is specifying a use case when modeling an ontology useful or not? Both configurations will be named with_use_case_50 and no_use_case_50.

These configurations are based on the results of a previous evaluation, in which 12 ontologies were generated by considering various factors: the number of classes and questions in the prompt (10, 30, 50), the presence or absence of a UC, and the use of a prompt chaining step. The best results were achieved when 50 questions were asked with and without use case and no prompt chaining step.

Six ontologies were generated for this experiment, specifying a use case or not and using three LLMs: Claude 3.5 Sonnet, GPT-4o, and GPT-4o-mini. This setup allowed for a comparative analysis of the models’ performance based on the presence or absence of an explicit use case.

Generated Ontologies

Claude 3.5 Sonnet_with_use_case_50
Claude 3.5 Sonnet_no_use_case_50
GPT-4o_with_use_case _50
GPT-4o_no_use_case _50_
GPT-4o-mini_with_use_case _50
GPT-4o-mini_no_use_case _50

‍

Evaluation results. For each generated ontology, a manual analysis was conducted, comparing the presence of classes and properties between the generated and reference ontologies. We also assessed the presence of hallucinations. We had a degree of tolerance in the comparison: missing classes or properties may correspond to elements with broader or narrower meanings, which were noted separately. For example, the object property hasLocation is broader than country, while the data property usesCryptocurrency is narrower than currency. Two sets of evaluations were conducted using different metrics: (1) Accuracy, and (2) Precision, Recall, and F1 Score.

Accuracy: The overall proportion of matching classes and properties between the generated and reference ontologies.
Precision: The ratio of concepts or relations in the generated ontology that are present in the reference ontology.
Recall: The ratio of concepts or relations in the reference ontology that are also present in the generated ontology.
F1 Score: The harmonic mean of precision and recall.

‍

Accuracy

As seen in Table 1, the Claude 3.5 Sonnet shows stable results both for ontologies with and without UC. GPT-4o-Mini comes second, particularly for classes, and although more limited in the precision of its properties, also shows an improvement with UC. Finally, GPT-4o shows weaker results, particularly in the generation of properties with UC. These results suggest that the integration of a UC enhances the consistency of ontologies generated by Claude 3.5 Sonnet and GPT-4o-Mini, but less so for those generated with GPT-4o.

Comparing the three models, Claude 3.5 Sonnet stands out for its balanced performance, both with and without a use case, offering stable accuracy in the generation of classes and properties.

Model	Ontology	Type	Yes	Narrower	Broader	No	Total Yes, Narrower, Broader
Claude 3.5 Sonnet	UC 50	Class	13 (35.14%)	2 (5.41%)	11 (29.73%)	11 (29.73%)	70.27%
		DatatypeProperty	17 (35.42%)	0 (0.00%)	16 (33.33%)	15 (31.25%)
		ObjectProperty	8 (14.81%)	0 (0.00%)	27 (50.00%)	19 (35.19%)
		Properties	8 (14.81%)	0 (0.00%)	27 (50.00%)	19 (35.19%)	66.67%
		Properties + classes	38 (27.34%)	2 (1.44%)	54 (38.85%)	45 (32.37%)	67,63%
	Without UC 50	Class	13 (35.14%)	5 (13.51%)	10 (27.03%)	9 (24.32%)	75.68%
		DatatypeProperty	9 (18.75%)	0 (0.00%)	20 (41.67%)	19 (39.58%)
		ObjectProperty	6 (11.11%)	0 (0.00%)	24 (44.44%)	24 (44.44%)
		Properties	15 (14.71%)	0 (0.00%)	44 (43.14%)	43 (42.16%)	57.84%
		Properties + classes	28 (20.14%)	5 (3.60%)	54 (38.85%)	52 (37.41%)	62,59%
GPT-4o	UC 50	Class	13 (35.14%)	1 (2.70%)	11 (29.73%)	12 (32.43%)	67.57%
		DatatypeProperty	1 (2.08%)	7 (14.58%)	0 (0.00%)	40 (83.33%)
		ObjectProperty	4 (7.41%)	0 (0.00%)	0 (0.00%)	50 (92.59%)
		AllProperties	5 (4.90%)	7 (6.86%)	0 (0.00%)	90 (88.24%)	11.76%
		AllEntities	18 (12.95%)	8 (5.76%)	11 (7.91%)	102 (73.38%)	26.62%
	Without UC 50	Class	5 (13.51%)	1 (2.70%)	16 (43.24%)	15 (40.54%)	59.46%
		DatatypeProperty	0 (0.00%)	4 (8.33%)	9 (18.75%)	35 (72.92%)
		ObjectProperty	2 (3.70%)	3 (5.56%)	29 (53.70%)	20 (37.04%)
		AllProperties	2 (1.96%)	7 (6.86%)	38 (37.25%)	55 (53.92%)	46.08%
		AllEntities	7 (5.04%)	8 (5.76%)	54 (38.85%)	70 (50.36%)	49.64%
GPT-4o-Mini	UC 50	Class	12 (32.43%)	1 (2.70%)	13 (35.14%)	11 (29.73%)	70.27%
		DatatypeProperty	0 (0.00%)	0 (0.00%)	9 (18.75%)	39 (81.25%)
		ObjectProperty	1 (1.85%)	0 (0.00%)	28 (51.85%)	25 (46.30%)
		AllProperties	1 (0.98%)	0 (0.00%)	37 (36.27%)	64 (62.75%)	37.25%
		AllEntities	13 (9.35%)	1 (0.72%)	50 (35.97%)	75 (53.96%)	46.04%
	Without UC 50	Class	15 (40.54%)	0 (0.00%)	12 (32.43%)	10 (27.03%)	72.97%
		DatatypeProperty	0 (0.00%)	0 (0.00%)	9 (18.75%)	38 (79.17%)
		ObjectProperty	5 (9.26%)	0 (0.00%)	8 (14.81%)	41 (75.93%)
		AllProperties	5 (4.95%)	0 (0.00%)	17 (16.83%)	79 (78.22%)	21.78%
		AllEntities	20 (14.49%)	0 (0.00%)	29 (21.01%)	89 (64.49%)	35.51%

(Table 1 : Accuracy comparison of 6 ontologies generated using 3 LLMs with respect to a reference ontology)

‍

Precision, Recall, and F1 Score

Based on F1 Scores (Table 2), the Claude 3.5 Sonnet model offers the best overall results, with a clear improvement when UCs are taken into account. It outperforms the other models in class generation (0.76 with UC), although its performance on properties is limited. GPT-4o shows a good F1 Score for classes with UC (0.70), but drops drastically without this option. Finally, GPT-4o-Mini shows stable but weaker results than the other two models, particularly when it comes to property generation. As a result, Claude 3.5 Sonnet stands out especially for ontologies with UC.

Model	Ontology	Precision	Recall	F1 Score	Class Precision	Class Recall	Class F1 Score	Property Precision	Property Recall	Property F1 Score
Claude 3.5 Sonnet	UC 50	0,30	0,30	0,30	0,46	0,68	0,55	0,20	0,17	0,18
Claude 3.5 Sonnet	Without UC 50	0,46	0,46	0,46	0,58	1,14	0,76	0,33	0,22	0,26
GPT-4o	UC 50	0,57	0,36	0,44	0,60	0,84	0,70	0,53	0,18	0,27
GPT-4o	Without UC5 0	0,88	0,10	0,18	1	0,11	0,20	0,83	0,10	0,18
GPT-4o-Mini	UC 50	0,39	0,25	0,31	0,52	0,73	0,61	0,22	0,08	0,12
GPT-4o-Mini	Without UC 50	0,4	0,28	0,33	0,53	0,76	0,62	0,24	0,10	0,14

(Table 2 : Precision, Recall, and F1 score comparison of 6 ontologies using 3 LLMs with respect to a reference ontology)

‍

Conclusion

To sum up, our work shows that Claude 3.5 Sonnet offers stable performance whether or not the use case is specified. GPT-4o-mini ranks second, ahead of GPT-4o. But we also note that adding the use case is not always beneficial: in this case, it improves the consistency of ontologies generated by Claude 3.5 Sonnet and GPT-4o-mini, but has less impact on GPT-4o.

Get started with GraphRAG in 2 minutes

Talk to an expert