Review Comment:
This paper presents the process to improve the ShExML processor in order to overcome its limitations regarding performance time for KG construction. It first presents the architecture of the engine, then it identifies the bottlenecks, explains how they are solved, and performs an evaluation of the subsequent versions in which the bottlenecks are addressed to showcase how the performance improves with the versions.
In general I find the paper as a good contribution for the issue, and I specially appreciate the statistical analysis for the evaluation, which is often overlooked in this type of papers. All resources are available and properly published in GitHub releases and Zenodo. However, there are several aspects in which the paper need improvement, which I explain in detail below.
--- 2. Related work ---
I feel that in a certain way, this section misses the point. While it highlights where the ShExML engine has been evaluated with other engines, it misses papers presenting optimizations for KGC in similar engines (see for instance [1-3]). Most of the papers presenting new or improved engines in this field come with a comparison with other engines, the selection made in this section is somehow narrow; the criteria, unclear; and the order, confusing. For instance, the first and third paragraph talk about engine comparisons, second and fourth about benchmarks; paper [5], that presents an engine evaluation, is cited somewhere in the paper but for some reason it is not relevant for this section? I would suggest to reorder the section and enrich it with more similar papers, which there are plenty. As a side note, the challenge in the KGCW workshop has two editions now: https://w3id.org/kg-construct/workshop/2024/challenge
[1] Arenas-Guerrero, J., Chaves-Fraga, D., Toledo, J., Pérez, M. S., & Corcho, O. (2024). Morph-KGC: Scalable Knowledge Graph Materialization with Mapping Partitions. Semantic Web, 15, 1–20.
[2] Iglesias, E., Jozashoori, S., & Vidal, M. - E. (2023). Scaling up Knowledge Graph Creation to Large and Heterogeneous Data Sources. Journal of Web Semantics, 75, 100755.
[3] Iglesias, E., Vidal, M., Jozashoori, S., Collarana, D., & Chaves-Fraga, D. (2022). Empowering the SDM-RDFizer tool for scaling up to complex knowledge graph creation pipelines. Semantic Web.
--- 3. ShExML engine algorithm ---
This section can use some improvements in terms of clarity in descriptions, the author takes for granted too many concepts and it becomes a section hard to follow. For instance, concepts like “pipes and filters architectural pattern” or ANTLR are not explained what they are, the text neither figure 1 explicit which are the inputs and outputs of the process (is the data, the mapping, both, one or the other depending on the component?).
Reaching Section 3.2, I realized the concept of “shape” in terms of the ShExML mappings have not been explained. Maybe this paper could add a brief background section explaining the basics of the ShExML language (like many [R2]RML papers do respectively) so that the reader can better understand the implications and components of the language.
I like the example in the listings, it really helps understand how the engine process data. I also think that adding another listing with the input json data file in the link mentioned in Listing 1 would be even better, so that the reader doesn’t have to go look manually.
--- 4. Profiling the ShExML engine and performance improvements ---
Similarly with the previous section, this one can also improve clarity in descriptions, starting by explaining what is a profiling methodology, and why it is suitable. But my major concern is with the presentation of the bottlenecks, I have mixed feelings. On the one hand, it makes sense to present the changes per version wrt the evaluation presented in the next section. On the other hand, it is claimed along the paper that the improvements could help other engines to improve. However, since the descriptions of the bottlenecks and solutions are encapsuled within the versions, they are harder to distinguish. This claim is fair and solid, this bottlenecks may apply to other different engines, that is why I would recommend to rearrange the section and make a proper description of each bottleneck, and the solution proposed, linking in the end with which version it is addressed. This way the paper can prove more useful in this regard for other practitioners, because now it is more “engineerish” and focused on the versions rather than on the problems themselves. I believe this rearrangement and change of focus can also enrich how the results are discussed and presented.
--- 5. Evaluation ---
In general the evaluation seems solid, I only have a few remarks.
- It would be useful to include the reasons to choose the statistical tests presented in the paper, they are not that common so it is worth explaining the particularities that make them the most suitable for this case.
- Have the author considered measuring not only the performance time, but also the CPU and RAM usage?
- I also miss a general reflection on which bottlenecks were the most critical, or beneficial in terms of balance effort/improvements
- Is the engine, after the improvements, competitive now with similar KGC engines? I understand the scope of the paper is to check that the optimizations work wrt previous versions, but it would be also beneficial to check how it is performing now wrt the state of the art engines.
Minor:
- Figure 1 can improve space economy and make the text bigger, maybe adding numbers to the steps could help the reader follow the process in the text too
- Section 3.2, composeIterationQuery: what happens with tabular data? May worth explaining as well and not only focus on hierarchical data
- Tables 1 and 2 could be box or violin plots, it would make it easier to compare visually, but the table gives all the information in any case. The names of the engines can be shortened to only the version, (ShExML-v0.3.2.jar --> v0.3.2) as the rest remains the same in all.
- Throughout the entire paper, sentences are in general too long, and there is a general lack of commas
|