Review Comment:
The paper presents the ERA-SHACL-Benchmark, a benchmark for evaluating in-memory SHACL validation engines. The benchmark assesses engine performance (load time, validation time, memory) and conformance (correctness, completeness, comprehensiveness). It involves the European Union Agency for Railways (ERA) KG, together with the corresponding SHACL shapes. This SHACL shapes contains 275 SHACL-core and SHACL-SPARQL shapes for the evaluation of performance, completeness and comprehensiveness dimensions, while another 44 unit test shapes are extracted from the 275 shapes for correctness evaluation. Eight open-source engines are evaluated. The benchmark is released on GitHub. This work is a helpful addition to the existing benchmarking of SHACL engines and will likely be of interest to the SHACL engine development community. The use of a production KG with real-world SHACL shapes addresses the gap that existing benchmarks rely on synthetic or partial datasets and largely ignore conformance. The discovery and upstream reporting of multiple engine bugs further demonstrates the practical value of this work.
- Strengths
1. Real-world KG and SHACL shapes. Unlike all prior benchmarks reviewed in Section 2, ERA-SHACL-Benchmark uses full production KG and SHACL shapes.
2. Multi-dimensional conformance assessment. The metrics, including correctness, completeness, and comprehensiveness, add value over purely performance-oriented benchmarks.
3. Breadth of engines covered. The benchmark evaluates eight engines, offering a representative coverage of the existing SHACL engines.
4. Direct community impact. Five bug reports filed with and acknowledged by upstream engine maintainers demonstrate that the benchmark already has a practical impact.
5. All benchmark resources are publicly available on GitHub and well organized with the README file.
- Major Comments
1. Incomplete coverage of SHACL target declarations and lack of systematic core constraint coverage report.
Based on my inspection of the benchmark repo on Github, the unit tests used for the correctness dimension only involves sh:targetNode, while the shapes used for the completeness and comprehensiveness dimensions involved sh:targetClass. The sh:targetSubjectsOf and sh:targetObjectsOf appear to be absent from the benchmark entirely. This is a substantive gap in coverage that is not adequately discussed/reported in the paper now. The concluding remark that the benchmark is "not exhaustive in terms of constraints variety" is insufficient. The paper needs to have a systematic account of which core constraint components, including target declarations, are covered or not covered. The paper should provide a clear and structured discussion of this limitation, including an explanation of what impact this has for the interpretation of the results.
2. Absence of a ground truth for completeness assessment undermines a core claim.
For the results of the completeness assessment, the paper mentions in Section 4.2 (Page 7) that "the large amount of violations prevents us from having a ground truth". This is a significant limitation for the completeness dimension, which is presented as one of the three conformance dimensions. Without a verified expected number of violations for at least a subset of the shapes and KG, the completeness evaluation reduces to cross-engine consistent evaluation. The paper uses "the most frequent value across engines" as an implicit baseline, but this is not theoretically reasonable since if all engines share the same bug, the most frequent value is wrong. The paper needs to either: (a) provide a manually verified ground truth for a small but representative subset, or (b) reframe the completeness section to be explained that it measures consistency, not true completeness, and discuss its impact on interpretation. The current way of expressing this may mislead readers and cause them to misunderstand the rigor of this dimension.
3. Absence of shape complexity characters.
The experiments conducted on three shapes (tds, core, and core+sparql) but provides no quantitative characterisation of the complexity of these shape. For example, how many JOIN operations are involved in SHACL-SPARQL constraints? What is the maximum depth of the property paths? How many negations are presented? A structured report of such information could help readers to understand why certain engines time out on specific combinations of dataset and shapes.
4. Timeout (TO) and Memory Limit (ML/MV) threshold discussion.
In Figure 2, Tables 8, 9, and 10, some engines triggered TO or ML/MV during the experiments. If the paper can briefly explore/discuss where the "crash threshold" is for those failing engines (e.g., does it crash when processing 5M or 20M triples)? This would be helpful to users choosing engines for production use.
5. Lack of explanation for pySHACL memory exhaustion.
Section 5.4 mentions that “With pySHACL even exceeding the full capacity of the hardware (around 120GB) for allocating the complete knowledge graph” (Page 11), but does not provide any explanation for this issue. It is unclear whether this is attributable to the inherent memory overhead of Python object representations, to the specific graph storage structure used by pySHACL internally, or to its validation algorithm requiring materialisation of large intermediate structures. A brief analysis of the possible cause would substantially improve the interpretability of the results.
6. Ambiguous categorization of shapes in Table 2.
The "Core+SAPRQL" column in Table 2 is confused. The table caption does not clearly define what these counts represent. If it follows the "It consists of 275 shapes, with 215 SHACL-core (19 are sh:NodeShape, 196 are sh:PropertyShape) and 60 SHACL-SPARQL constraints." in Section 3.3, but the node shape and property can contain both the core constraint and SPARQL constraints. It is therefore unclear whether the reported counts refer to the number of shapes only contain core constraints and only contain the SPARQL constraints, or the number of shapes only contain core constraints and contains both the core and SPARQL constraints
- Minor Comments
1. Conformance dimension terminology deviates from common usage.
The usage of correctness, completeness, and comprehensiveness are different from their common usage. For example, "correctness" denotes whether an engine supports the constraint components present in the shapes, rather than whether its outputs are semantically correct. These definitions should be explicitly stated in the paper, before they are first used, to prevent misinterpretation.
2. Graph characterisation metrics in Table 1 are not interpreted.
Table 1 reports a set of graph characterisation metrics including graph density, degree centrality, pseudo-diameter, and maximum PageRank. But does not explain their significance in the context of SHACL validation. A brief interpretive note should be added.
3. The passed/partial/fail states are not defined at first use.
Section 3.5 introduces the correctness metrics as "the passed, partial or fail states of execution" without defining what these states mean. Even though the clarification that they follow "the same logic as the official SHACL test suite" appear in Section 4.2, but it is too late. The definitions should be provided in Section 3.5 where the metrics are first introduced.
4. The constraints reported in Tables 5 and 6 are a subset of those tested, without explanation.
Tables 5 and 6 report completeness results for a subset of constraint components including Pattern, MaxCount, MinIn/MaxExclusive, Datatype, and Class, among others. However, several constraint components that appear in shapes such as MinLength and MaxLength are not reported in the tables. The reason for this selection is not provided and should be explained.
5. The bug reports are mentioned only in footnotes.
The Section 5.1 mentions "during this process, multiple bugs were found, reported, and processed by the engine’s maintainers" but not summarized in the paper, a brief summary of the issues might be helpful.
6. Large tables are placed after the references without being labelled as appendices.
Tables 5, 6, 7, 8, 9, and 10 appear directly after the reference list without any title to indicate that they are the appendix.
7. The manuscript uses a sagej.cls template rather than that required by SWJ.
Overall, this work makes a valuable contribution to the SHACL benchmarking by working on the production KG and real-world SHACL shapes. The metrics, including correctness, completeness, comprehensiveness, and performance, are well-designed, and the practical impact demonstrated through upstream bug reports further strengthens the case for this work. The benchmark is possibly of interest to SHACL engine developers and to users deploying KG validation using SHACL. Addressing the comments raised above can ensure the benchmark and its evaluation are presented in a rigorous and transparent way, and help the users to understand the benchmark.
|