Review Comment:
TL;DR:
- Very important idea and promising implementation with some serious limitations
- Relatively well documented.
- Good adoption, but the tool should be demonstrated with one of the use cases in more detail.
- The capabilities and limitations are not adequately described. Reformulating and substantiating the requirements may help.
# (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).
The system aims to address an important gap for the lifecycle of knowledge graphs. A configurable framework for supporting the entire lifecycle of knowledge graphs would be a great contribution to the literature. So the importance of the system (at least the intention of the system) is quite high. However, I have some rather major points I'd like to make about the presented tool:
- In the abstract, the authors make a claim that Helio reduces the effort required to perform knowledge graph lifecycle tasks. This is not supported by any qualitative or quantitative evaluation. By this, I do not necessarily mean a table with numbers about the performance of the tool. Its demonstration with a "real-world" use case would be very effective (see also my next point).
- The primary argument for tools impact is that the tool satisfies a list of requirements and is adopted by various projects and academic works. The second part of the argument is substantiated to some extent with a large list of projects making use of the tool. All appear to be research projects in which authors (or their research groups) are involved. If there are use cases outside of this circle, it should be prominently stated as it turns out to be an important criterion. I think it would be beneficial to describe one use case in detail to demonstrate the impact of the tool for implementing the entire lifecycle. DELTA use case looks like a good candidate for this.
- A rather subjective point: The first paragraph of the introduction gives the impression the KGs can be only in the RDF data model, which rules out the significant work on property graphs.
# (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
Overall the paper is very well written and easy to follow.
There are some issues with the description of the requirements, the first part of the impact argument mentioned above. These requirements are not properly substantiated in terms of their source. Are they coming from the use cases provided in the paper? Moreover, the wording of the requirements is quite strong and I was not completely convinced that the tool covers all of them as claimed. For example, R06 gave me the impression that the tool allows plugging different mapping engines in, but in fact, it just channels externally created RDF data into a triplestore. Similarly, R11 says that the system must support at least one mechanism for various knowledge curation tasks, but it does not actually provide any mechanism intrinsically but provides a SPARQL endpoint that allows interaction with other tools that support SPARQL. The formulation of the requirements also has an effect on the literature review as it leaves out curation tools completely because they are not part of the system. Moreover, many triple stores have SHACL supports which would make them satisfy R11 to some extent.
Overall, I think the capabilities and especially limitations of the tool are not adequately defined at this stage:
* The potentially interesting parts like linking with different knowledge graphs are very briefly mentioned. Does it also allow linking with knowledge graphs that are outside of the influence of Helio? How does it work? These can be explained a bit more. The focus is too much on creation, which I think misses the intention of the tool.
* How the curation tools are integrated into the lifecycle not described (how is it different than running them on a triplestore with a SPARQL endpoint? Can I run them periodically or after an RDF import?).
* No mention of what to do with conflicting property values during generation (in case an instance comes with a URI that already exists in the KG)
* No mention of quality assessment (which in my opinion is an important part of the lifecycle)
Some other minor issues:
* Page 3 lines 45-47: How is embedded RDF data in an HTML page targeting humans and not machines? Search engines extract annotations from web pages all the time.
* Linking is presented first as part of the curation task, then as part of the creation module.
* page 6 line 36: Why "despite"?
* page 6 line 29 Stardog is also a triplestore.
# (3) Assessment of the data file provided by the authors under “Long-term stable URL for resources”.
## (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
The documentation of the tool and toy examples for getting started is very well documented.
## (B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
There is an important part missing though: use cases. Use case descriptions are listed as "to do" which I think is the most important point to verify the arguments made by the authors.
## (C) whether the chosen repository, if it is not GitHub, Figshare, or Zenodo, is appropriate for long-term repository discoverability,
GitHub is used.
|