Helio: a framework for implementing the life cycle of knowledge graphs

Tracking #: 2984-4198

Authors: 
Andrea Cimmino
Raúl García-Castro

Responsible editor: 
Aidan Hogan

Submission type: 
Tool/System Report
Abstract: 
Building and publishing knowledge graphs (KG) as Linked Data, either in the Web or in private companies, has become a relevant and crucial process in many domains. This process requires that users perform a wide number of tasks conforming the life cycle of a KG and these tasks usually involve different unrelated research topics, such as RDF materialisation or link discovery. There is already a large corpora of tools and methods designed to perform these tasks; however, the lack of one tool that gathers them all leads practitioners to develop ad-hoc pipelines which are not generic, and thus, non re-usable. As a result, building and publishing a KG is becoming a complex and resource consuming process. In this paper a generic framework called Helio is presented. The framework aims at covering a set of requirements elicited from the KG life cycle and providing a tool capable of performing the different tasks required to build and publish KGs. As a result, Helio reduces the effort required to perform this process and prevents the development of ad-hoc pipelines. The Helio framework has been applied in a wide number of contexts, from European projects to research work.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Jan/2022
Suggestion:
Minor Revision
Review Comment:

--------
Overview
--------

The paper proposes the Helio framework as a standard pipeline that generalizes the building and publishing of KGs around the following main tasks: KG Creation, Hosting, Curation, and Deployment. The paper elicits a set of core requirements for the Helio framework around the KG lifecycle which are also shown to be incorporated. The usage of Helio in various endeavors including research projects, scholarly articles, and bachelor thesis shows its impact and usefulness in the community. Such a framework is also the first of its kind.

---------
Strengths
---------

*It explicitly elicits the stages of knowledge graph creation and concretizes it as software that implements the pipeline. This work is somehow the first-of-its-kind.

*The pipeline handles the following the five stages: Knowledge graph creation; Knowledge graph hosting; Knowledge graph curation; and Knowledge graph deployment. Each step carefully addresses and satisfies the set of requirements elicited in prior work [1] from their surveyistic observations of the KG life cycle.

*The software is quite user-friendly even though it handles a set of tasks that have generally been implemented via adhoc implementations in the community. This goes to say that the authors have well surveyed the field to engineer and design the essential characteristics for this software.

------------------------
Questions to the authors
------------------------

*Could it be explicitly specified somewhere the input file formats that will hold the specifications for the various modules? It would be beneficial to see a few samples (maybe in Appendix) of use-cases when one module is already satisfied, i.e. perhaps no data conversion is needed, and just one other module needs to be implemented, e.g. data publishing, etc.

*Are there some modules that are indispensable to each other where one could work without the other but not the other way around? E.g., data hosting and data publishing, where data hosting would work without data publishing, but not vice versa.

*Are there some modules in Helio that are mandatory to specify?

*Are there are some test units in place to check that the specifications provided conform to those programmed in Helio? If so, a new section discussing the Helio Test Suite would also be very informative to the reader in my view.

-----------------
Typos and writing
-----------------

The paper would need to be proof-read for the final version. I noted a few typos in the text. Some of them are listed below.

Line 38, column 1, page 1: “growth” -> “grown”
Line 6, column 1, page 2: “pipe lines” -> “pipelines”
Line 41, column 1, page 2: “relayed” -> “relied”
Line 25, column 2, page 2: the semicolon seems misplaced
Line 1, column 1, page 3: “during” -> “During”
Line 8, column 1, page 3: “practitioners” -> “Practitioners”
Note Line 3, column 1, page 7 reads “Hosting Module,” but Figure 2 reads “Host Module.” Similarly, Line 12, column 1, page 7 reads “Curation Module,” but Figure 2 reads “Curator Module.” My suggestion would be for the corresponding labels to match.
Line 39, column 2, page 8: “uniquely” -> “unique”
I stop here, but again my recommendation to the authors would be to have their paper proof-read thoroughly for minor writing grammar errors and typos.

----------
References
----------
1. Fensel, Dieter, et al. Knowledge graphs. Springer International Publishing, 2020.

Review #2
Anonymous submitted on 12/Jan/2022
Suggestion:
Minor Revision
Review Comment:

The authors present the Helio framework for building and
publishing KGs as Linked Data. The framework sets its pillars on top of several requirements that establish the life-cycle of the KGs, meeting these requirements and also allowing practitioners
to publish KGs following the Linked Data principles.
Furthermore, the framework counts with a plugin system that prevents the generation of ad-hoc code that is
not reusable to address novel challenges identified in
new scenarios.

--This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

As discussed in the Discussion section of the paper, the tool seems to have had impact already and has been adopted. The tool is clearly of importance, and the work is of high quality. On criteria (2) I think the paper could use some proofreading. In particular, too much passive voice has been used, which can make it hard to follow the writing sometimes. Also, there is the occasional typo or grammatical error e.g., 'cicle' rather than 'cycle' in the conclusion, or phrases like 'conforming the life cycle' in the abstract. Therefore, I encourage the authors to correct these problems and proofread the article as a minor revision.

Finally, although the authors explain the lack of experimentation, I would like to see (perhaps in a future work section) how this could potentially be addressed.

--Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The long-term stable link to resources is adequate from what I have been able to see.

Review #3
By Umutcan Serles submitted on 28/Jan/2022
Suggestion:
Major Revision
Review Comment:

TL;DR:

- Very important idea and promising implementation with some serious limitations
- Relatively well documented.
- Good adoption, but the tool should be demonstrated with one of the use cases in more detail.
- The capabilities and limitations are not adequately described. Reformulating and substantiating the requirements may help.

# (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).

The system aims to address an important gap for the lifecycle of knowledge graphs. A configurable framework for supporting the entire lifecycle of knowledge graphs would be a great contribution to the literature. So the importance of the system (at least the intention of the system) is quite high. However, I have some rather major points I'd like to make about the presented tool:

- In the abstract, the authors make a claim that Helio reduces the effort required to perform knowledge graph lifecycle tasks. This is not supported by any qualitative or quantitative evaluation. By this, I do not necessarily mean a table with numbers about the performance of the tool. Its demonstration with a "real-world" use case would be very effective (see also my next point).

- The primary argument for tools impact is that the tool satisfies a list of requirements and is adopted by various projects and academic works. The second part of the argument is substantiated to some extent with a large list of projects making use of the tool. All appear to be research projects in which authors (or their research groups) are involved. If there are use cases outside of this circle, it should be prominently stated as it turns out to be an important criterion. I think it would be beneficial to describe one use case in detail to demonstrate the impact of the tool for implementing the entire lifecycle. DELTA use case looks like a good candidate for this.

- A rather subjective point: The first paragraph of the introduction gives the impression the KGs can be only in the RDF data model, which rules out the significant work on property graphs.

# (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

Overall the paper is very well written and easy to follow.

There are some issues with the description of the requirements, the first part of the impact argument mentioned above. These requirements are not properly substantiated in terms of their source. Are they coming from the use cases provided in the paper? Moreover, the wording of the requirements is quite strong and I was not completely convinced that the tool covers all of them as claimed. For example, R06 gave me the impression that the tool allows plugging different mapping engines in, but in fact, it just channels externally created RDF data into a triplestore. Similarly, R11 says that the system must support at least one mechanism for various knowledge curation tasks, but it does not actually provide any mechanism intrinsically but provides a SPARQL endpoint that allows interaction with other tools that support SPARQL. The formulation of the requirements also has an effect on the literature review as it leaves out curation tools completely because they are not part of the system. Moreover, many triple stores have SHACL supports which would make them satisfy R11 to some extent.

Overall, I think the capabilities and especially limitations of the tool are not adequately defined at this stage:

* The potentially interesting parts like linking with different knowledge graphs are very briefly mentioned. Does it also allow linking with knowledge graphs that are outside of the influence of Helio? How does it work? These can be explained a bit more. The focus is too much on creation, which I think misses the intention of the tool.

* How the curation tools are integrated into the lifecycle not described (how is it different than running them on a triplestore with a SPARQL endpoint? Can I run them periodically or after an RDF import?).

* No mention of what to do with conflicting property values during generation (in case an instance comes with a URI that already exists in the KG)

* No mention of quality assessment (which in my opinion is an important part of the lifecycle)

Some other minor issues:

* Page 3 lines 45-47: How is embedded RDF data in an HTML page targeting humans and not machines? Search engines extract annotations from web pages all the time.

* Linking is presented first as part of the curation task, then as part of the creation module.
* page 6 line 36: Why "despite"?
* page 6 line 29 Stardog is also a triplestore.

# (3) Assessment of the data file provided by the authors under “Long-term stable URL for resources”.

## (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,

The documentation of the tool and toy examples for getting started is very well documented.

## (B) whether the provided resources appear to be complete for replication of experiments, and if not, why,

There is an important part missing though: use cases. Use case descriptions are listed as "to do" which I think is the most important point to verify the arguments made by the authors.

## (C) whether the chosen repository, if it is not GitHub, Figshare, or Zenodo, is appropriate for long-term repository discoverability,

GitHub is used.