Review Comment:
Summary:
=======
The paper describes a system that implements the RDF Mapping Language specification in a way that optimises its scalability, better handles duplicates, lowers memory usage, with special attention to processing complex mappings.
The motivation in Sec.2 explains key issues with performant mapping execution and what's required for a KG creation pipeline. The state of the art of the field is covered in Sec.3 with highlight of some of their limitations. Sec.4 describes the proper contribution, namely SDM-RDFizer in its latest version, showing its architectural components and some of the key algorithms used for optimisation. Sec.5 provides experimental details and results. Sec.6 finally analyses the characteristics of the tool that makes it worthwhile.
Main remarks:
============
As a scientific paper, the work is well described, the structure is clear, the state of the art is well covered, experiments are based on a well known benchmark and the conclusions are consistent with the results shown. The technical description of the tool components and algorithms is a little hard to follow, but with careful attention it is clear enough. The final analysis provides evidence that the work is worth publishing as a "system paper" at SWJ.
As a tool, the files provided are available at Zenodo, which should make them stable. They are well organised, with clear documentation. Installing the tool is straightforward. However, when I executed it with the example files, I got an error saying "FileNotFoundError: [Errno 2] No such file or directory: './files/sampleSource1.csv'" because the "files" directory was not in the current directory but inside the "main_directory" provided in the config.ini file. This should be easy to fix and does not prevented me from being able to test the tool.
However, I only tested the tool on the sample mappings in the data files, which are trivial. The resource files themselves do not describe how to reproduce the experiments of the paper, but can be looked up from a separate Github repository. Reproducing the experiments completely require a number of operations and manipulations that are not straightforward. Reproducibility relies on the existence and maintainance of the two benchmarks that are independent from this paper. This does not provide a perfect guarantee of the reproducibility in the long term but is ok in medium term, and maybe exact reproduction of these experiments in the distant future is irrelevant anyway.
Minor comments:
==============
- Intro:
* SPARQL Generate: the "official" spelling is "SPARQL-Generate" with a dash
* last paragraph: "Section[new line]3" -> make space non-breaking "Section~3"
- Sec.2:
* 2.1: "into an RDF knowledge graph data" -> "into an RDF graph"
* 2.2: "referencing to" -> "referring to"
* Fig.2: instead of putting "Time (secs)" at the top of the table, there could be explicit durations in the cells with, e.g. "11,961.81 s". "Timeout" is not a duration in seconds, so this would avoid a semantic mismatch (also, at first, I did not notice the numbers were in seconds)
- Sec.3:
* 3.1: "virtual knowledge graph creation process (formerly known as ontology-based data access)" -> I disagree with this phrasing: OBDA is a different concept than virtual KG creation process. OBDA involves different things and techniques that may or may not include virtual KG creation (although it is typically part of it)
* 3.1 "MorphCSV [8] propose" -> proposes
- Sec.4:
* 4.3.1: "a hash table where the key is the RDF resource" -> the key is the IRI or literal, not the resource itself (which could be a physical object or an abstract thing)
* 4.3.1: "A PPT" -> "A PTT"
* Algorithm 1, line 3 and line 4; Algo 2, line 3 and 4; Algo 3, line 6 and 7: "PPT" -> "PTT"
- Sec.5:
* 5.2: "The Figure 7a" -> "Figure 7a"
- References:
* Ref 5 and 20 are for the same paper.
* The level of details of the references vary much from one ref to another. This should be homogenised and some references should be completed
|