ERA-SHACL-Benchmark: A real-world benchmark to assess the performance and quality of in-memory SHACL engines

Tracking #: 3972-5186

Authors: 
Edgar Martinez
Edna Ruckhaus
Jhon Toledo
Daniel Doña
Oscar Corcho

Responsible editor: 
Elena Demidova

Submission type: 
Other
Abstract: 
With the growing use of graph-based data on the web and concerns around the quality of published data, validating knowledge graphs has become increasingly important. The Shapes Constraint Language (SHACL) is a World Wide Web Consortium (W3C) recommendation to validate RDF graphs against predefined constraints. Multiple SHACL engines have been developed that offer overlapping functionalities but also differ in several aspects (in terms of the data formats they can deal with, support of constraints and inference, reporting of constraint violations, and early detection of invalid entities, among others). Some of these engines have been evaluated using performance benchmarks that rely entirely on partial or synthetic datasets, with little to no emphasis on conformance, which limits their applicability to full-scale real-world scenarios. Moreover, as application demands grow in terms of validation processing speed and quality, a good balance between efficiency and reporting correctness, completeness, and comprehensiveness has become critical. In this paper, we present the ERA-SHACL-Benchmark, a comprehensive benchmark for evaluating SHACL engines based on real data and shapes used by the European Agency of Railways (ERA) Register of Infrastructure (RINF) System . Our benchmark includes a suite of tests designed to assess engine correctness by comparing generated reports to expected outcomes, measure performance in terms of load time, validation time, and memory usage, and evaluate the completeness and comprehensiveness of the generated validation reports.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Mar/2026
Suggestion:
Major Revision
Review Comment:

The paper presents the ERA-SHACL-Benchmark, a benchmark for evaluating in-memory SHACL validation engines. The benchmark assesses engine performance (load time, validation time, memory) and conformance (correctness, completeness, comprehensiveness). It involves the European Union Agency for Railways (ERA) KG, together with the corresponding SHACL shapes. This SHACL shapes contains 275 SHACL-core and SHACL-SPARQL shapes for the evaluation of performance, completeness and comprehensiveness dimensions, while another 44 unit test shapes are extracted from the 275 shapes for correctness evaluation. Eight open-source engines are evaluated. The benchmark is released on GitHub. This work is a helpful addition to the existing benchmarking of SHACL engines and will likely be of interest to the SHACL engine development community. The use of a production KG with real-world SHACL shapes addresses the gap that existing benchmarks rely on synthetic or partial datasets and largely ignore conformance. The discovery and upstream reporting of multiple engine bugs further demonstrates the practical value of this work.

- Strengths
1. Real-world KG and SHACL shapes. Unlike all prior benchmarks reviewed in Section 2, ERA-SHACL-Benchmark uses full production KG and SHACL shapes.

2. Multi-dimensional conformance assessment. The metrics, including correctness, completeness, and comprehensiveness, add value over purely performance-oriented benchmarks.

3. Breadth of engines covered. The benchmark evaluates eight engines, offering a representative coverage of the existing SHACL engines.

4. Direct community impact. Five bug reports filed with and acknowledged by upstream engine maintainers demonstrate that the benchmark already has a practical impact.

5. All benchmark resources are publicly available on GitHub and well organized with the README file.

- Major Comments

1. Incomplete coverage of SHACL target declarations and lack of systematic core constraint coverage report.

Based on my inspection of the benchmark repo on Github, the unit tests used for the correctness dimension only involves sh:targetNode, while the shapes used for the completeness and comprehensiveness dimensions involved sh:targetClass. The sh:targetSubjectsOf and sh:targetObjectsOf appear to be absent from the benchmark entirely. This is a substantive gap in coverage that is not adequately discussed/reported in the paper now. The concluding remark that the benchmark is "not exhaustive in terms of constraints variety" is insufficient. The paper needs to have a systematic account of which core constraint components, including target declarations, are covered or not covered. The paper should provide a clear and structured discussion of this limitation, including an explanation of what impact this has for the interpretation of the results.

2. Absence of a ground truth for completeness assessment undermines a core claim.

For the results of the completeness assessment, the paper mentions in Section 4.2 (Page 7) that "the large amount of violations prevents us from having a ground truth". This is a significant limitation for the completeness dimension, which is presented as one of the three conformance dimensions. Without a verified expected number of violations for at least a subset of the shapes and KG, the completeness evaluation reduces to cross-engine consistent evaluation. The paper uses "the most frequent value across engines" as an implicit baseline, but this is not theoretically reasonable since if all engines share the same bug, the most frequent value is wrong. The paper needs to either: (a) provide a manually verified ground truth for a small but representative subset, or (b) reframe the completeness section to be explained that it measures consistency, not true completeness, and discuss its impact on interpretation. The current way of expressing this may mislead readers and cause them to misunderstand the rigor of this dimension.

3. Absence of shape complexity characters.

The experiments conducted on three shapes (tds, core, and core+sparql) but provides no quantitative characterisation of the complexity of these shape. For example, how many JOIN operations are involved in SHACL-SPARQL constraints? What is the maximum depth of the property paths? How many negations are presented? A structured report of such information could help readers to understand why certain engines time out on specific combinations of dataset and shapes.

4. Timeout (TO) and Memory Limit (ML/MV) threshold discussion.

In Figure 2, Tables 8, 9, and 10, some engines triggered TO or ML/MV during the experiments. If the paper can briefly explore/discuss where the "crash threshold" is for those failing engines (e.g., does it crash when processing 5M or 20M triples)? This would be helpful to users choosing engines for production use.

5. Lack of explanation for pySHACL memory exhaustion.

Section 5.4 mentions that “With pySHACL even exceeding the full capacity of the hardware (around 120GB) for allocating the complete knowledge graph” (Page 11), but does not provide any explanation for this issue. It is unclear whether this is attributable to the inherent memory overhead of Python object representations, to the specific graph storage structure used by pySHACL internally, or to its validation algorithm requiring materialisation of large intermediate structures. A brief analysis of the possible cause would substantially improve the interpretability of the results.

6. Ambiguous categorization of shapes in Table 2.

The "Core+SAPRQL" column in Table 2 is confused. The table caption does not clearly define what these counts represent. If it follows the "It consists of 275 shapes, with 215 SHACL-core (19 are sh:NodeShape, 196 are sh:PropertyShape) and 60 SHACL-SPARQL constraints." in Section 3.3, but the node shape and property can contain both the core constraint and SPARQL constraints. It is therefore unclear whether the reported counts refer to the number of shapes only contain core constraints and only contain the SPARQL constraints, or the number of shapes only contain core constraints and contains both the core and SPARQL constraints

- Minor Comments

1. Conformance dimension terminology deviates from common usage.

The usage of correctness, completeness, and comprehensiveness are different from their common usage. For example, "correctness" denotes whether an engine supports the constraint components present in the shapes, rather than whether its outputs are semantically correct. These definitions should be explicitly stated in the paper, before they are first used, to prevent misinterpretation.

2. Graph characterisation metrics in Table 1 are not interpreted.

Table 1 reports a set of graph characterisation metrics including graph density, degree centrality, pseudo-diameter, and maximum PageRank. But does not explain their significance in the context of SHACL validation. A brief interpretive note should be added.

3. The passed/partial/fail states are not defined at first use.

Section 3.5 introduces the correctness metrics as "the passed, partial or fail states of execution" without defining what these states mean. Even though the clarification that they follow "the same logic as the official SHACL test suite" appear in Section 4.2, but it is too late. The definitions should be provided in Section 3.5 where the metrics are first introduced.

4. The constraints reported in Tables 5 and 6 are a subset of those tested, without explanation.

Tables 5 and 6 report completeness results for a subset of constraint components including Pattern, MaxCount, MinIn/MaxExclusive, Datatype, and Class, among others. However, several constraint components that appear in shapes such as MinLength and MaxLength are not reported in the tables. The reason for this selection is not provided and should be explained.

5. The bug reports are mentioned only in footnotes.

The Section 5.1 mentions "during this process, multiple bugs were found, reported, and processed by the engine’s maintainers" but not summarized in the paper, a brief summary of the issues might be helpful.

6. Large tables are placed after the references without being labelled as appendices.

Tables 5, 6, 7, 8, 9, and 10 appear directly after the reference list without any title to indicate that they are the appendix.

7. The manuscript uses a sagej.cls template rather than that required by SWJ.

Overall, this work makes a valuable contribution to the SHACL benchmarking by working on the production KG and real-world SHACL shapes. The metrics, including correctness, completeness, comprehensiveness, and performance, are well-designed, and the practical impact demonstrated through upstream bug reports further strengthens the case for this work. The benchmark is possibly of interest to SHACL engine developers and to users deploying KG validation using SHACL. Addressing the comments raised above can ensure the benchmark and its evaluation are presented in a rigorous and transparent way, and help the users to understand the benchmark.

Review #2
By Paolo Pareti submitted on 17/Mar/2026
Suggestion:
Minor Revision
Review Comment:

Overview:

I really appreciate this article, that presents, in my view, a timely and much needed practical contribution to SHACL research. I believe this could be an impactful article that other research efforts can reuse and build upon. The technical depth of this study is at a sufficient level for a journal publication. The main (though relatively minor) issue I encountered is that, while the writing is usually good and clear in most sections, it becomes unclear and imprecise at a few key technical points, when discussing very important details of the benchmark creation and evaluation. Given that the focus of this article is on presenting a benchmark I think it is crucial that this part is explained well, both in terms of details provided, and in terms of readibility. The article is not overly long, and I think taking the necessary space to make the description of the benchmark clearer would be worth it. An obvious limitation of this study is that it focusses on a single real world dataset, and thus the results are likely biased to this specific use case. However this is not an issue per se, as this is an acceptable and understandable limitation given the scarcity of real world SHACL examples in the public domain.

More specific comments:

Page 2. It is not clear the extent to which there are different capabilities, or lack of compliance. Citations are provided, but a couple of in-text examples of significant difference between some of the mature tools, would go a long way to make this argument convincing.

Page 3. It is not clear what correctness, completeness and comprehensiveness mean when they are introduced. I understand that this might not be the place for a full definition, but I found these one-line descriptions quite confusing. In particular, the terms correctness and completeness are very overloaded terms, and so more prone to generate misunderstandings. For example one could expect that correctness means “is every violation detected really a violation” and completeness “has every violation case been detected”.

How correctness is actually evaluated into the three Passed, Partial and Failed values is not explicitly stated in the paper, and there is just a reference to the SHACL test suite. While the reference is great, having at least one sentence giving the intuition of what these three states mean would make the article more self-contained and readable.

Page 3. The section starting with “The benchmark consists of a suite of 44 unit tests” it is confusing that a number is given for the tests, a number (not in digit format though) for the experiments, and no number is given for the size of the compliance report shapes set.

Page 4. I appreciate the effort to give an idea of the size of the shapes, but the way it is written can be a bit misleading. It seems to conflate the idea of shape and constrain, as the 215+60 “constraints” add up to the 275 “shapes”. It might be better to say that 215 shapes have core constraint components, while 60 contain SHACL-SPARQL components.

It would be useful to state how many constraint components an average shape has. From a quick look at the actual shapes, it looks like most of them have one or two constraint components, which can give a better idea of how complex shape is.

Another important piece of information to share in the text is how many of these have targets. As the entry points for SHACL evaluation, targets give a good indication of how the evaluation actually happens, sometimes better than the number of shapes. In fact, one could have an artificially high number of shapes just by splitting constraints that could have been included in a single shape into multiple ones.

It is stated that Correctness is evaluated both manually and automatically. It wasn’t clear to me why and how.
* When is it manually evaluated and when automatically?
* How does the automatic evaluation work? There are some notes in section 3.3 about how certain shapes were chosen for this task, but it is unclear which ones, and why. Also, how was the ground truth (i.e. the expected output) obtained, and how do we know it’s correct?
* How did manual evaluation work? Who evaluated it and how?

Regarding completeness, the authors might want to consider a way to summarise/aggregate the results, maybe visually, to convey the relative completeness of the tools. I appreciate having the details in the table, but unlike Table 4, which presents a very clear visual overview of the correctness, Tables 5 and 6 are difficult to read. The rationale of having to compare the “relative” completeness of each report, as ground truth would be too difficult to obtain, is acceptable in my view. But I would still like to see this relative comparison shown explicitly. Comparing large numbers in textual format across multiple rows of multiple tables is not easy.

The results for comprehensiveness are also hard to read, as they are relegated to a sideways table after the main text. I think it would be easy to provide an aggregate of the results, for example by simply counting the instances of a passed or failed check.

While github is an acceptable option, I would suggest a permanent repository like Zenodo, as long-term availability of this repository is crucial to ensure this benchmark can actually be reused.