Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in extracting knowledge from and generating new content based on various types of resources, particularly text-based ones. Besides unstructured data, LLMs have also shown promising results when leveraging structured but semantically complex resources such as ontologies, schemas, and knowledge graphs. However, the practical use of large-scale semantic artifacts as direct input to LLMs is constrained by prompt size and token limitations. To address this issue, it is necessary to employ Retrieval-Augmented Generation (RAG) systems to preprocess and segment these large resources effectively.
In this paper, we propose a novel RAG-based architecture, which includes LLM-based Named Entity Recognition and Disambiguation (NERD) and Entity Linking (EL) solutions, tailored for large-scale semantic artifacts, using OPC UA information models—an industrial standard—as a foundation. Within this framework, we implement and evaluate three distinct use cases that combine LLMs with the proposed RAG system: (i) semantic artifact validation, (ii) information retrieval, and (iii) information model generation. Each use case demonstrates strong performance, achieving F1-scores of up to 100\%, thereby validating the effectiveness of the approach. Furthermore, we evaluate the generalizability of the system across two different domains, confirming its robustness and applicability in diverse industrial contexts.