Review Comment:
In this paper, the authors observe the steady take-off of cultural heritage knowledge graphs (CHKG), and look at the typical problems of using them in more general contexts. In particular, they argue that KGs are hard to query with SPARQL. To address this, they propose a novel approach based on exploiting CHKGs through virtual assistants (VA), formulated as a KGQA (knowledge graph query answering) task. Therefore, they propose a CH KGQA methodology for VA, where the goal is to enable direct, voice-driven QA for users to CH KG typically exposed through SPARQL endpoints. In addition, they propose a lightweight framework for developers for automatically and manually building VA interfaces for KGQA. The authors, after including an extensive survey on the publication of CHKGs, quantitatively evaluate their KGQA over VA approach through the QALD-9 benchmark challenge; and qualitatively through a number of user cases that highlight VA-driven user interaction with CHKGs in SPARQL endpoints.
The paper is well written, and is relevant to the special issue in (i) providing an extensive survey of CHKGs so far published on the Web; and (ii) providing a methodology for interrogating CHKGs automatically and manually via voice commands, instead of the canonical, and expert-based, execution of SPARQL queries. In this sense, the paper is highly novel and addresses a problem that is especially profound in CH and DH, given that these fields have a traditional need for more supportive tools. The KGQA approach is based on the survey and work by [38], which is the reference in query answering over Linked Data (QALD). The survey of published CHKGs is also an interesting and much needed contribution.
However, this survey is also the first important issue of the paper, along with the following ones:
- Survey. It is completely unclear what the need for Section 3 is, in the light of the contents of the rest of the paper. How is this extensive survey needed, or related, to test the KGQA approach proposed later on? If the purpose was to gather CHKG data for the experiments, just any subset would have been a perfectly valid option. Performing such a large-scale survey opens up many questions that have very little to do with the (already challenging) task of enabling automated CHKG QA over VA. For example, why was [34] (which is quite outdated) chosen as a method? What drives the whole methodology of the Section? Why is the availability of data in SPARQL endpoints deemed so important? E.g. the LOD-a-lot dump [3] is also queryable through SPARQL. An alternative would have been to just consider a CH aggregator, e.g. Europeana. I think a motivation for neglecting all these alternatives, as well as for motivating the survey, is very much needed in the paper
- Methodology. The authors should clarify if the methodological framework of the paper is taken entirely from e.g. [38], or whether it is partly adapted, or is a proper original contribution of the paper. In addition to this, I think the paper spends a great deal of effort in justifying technical choices very much in a technical report style, but it lacks a deeper intellectual justification of choices. For example: in general I agree that focusing on Alexa as VA works without losing generality, but what is the motivation of this choice vs plausible alternatives? On the same note: how do these the various VA services achieve interoperability on the ‘skills’ VA extensions? Is this a standard, an ontology? A small paragraph clarifying this would be very useful. It is also mentioned that “few CH KGs are provided with APIs”, and this is not entirely true. Some tools like [1] make this very easy, and have been used to publish a great deal of KGs APIs, in particular CH KGs (see e.g. [2])
- Evaluation. The evaluation is perhaps the most concerning part of the paper in its current state. Regarding the design of the Q questions: where do these come from? If these are research questions: why and how are they relevant for research in KGQA in KGQA over VA in particular? In Q1: what does “comparable” mean? Is it response time, accuracy of results, coverage? This should be clarified. Besides this, an analysis of results at the end of Section 6, and more specifically an interpretation of Tables 9 and 10, are missing and very much needed. How do these answer the posed questions? What implications does the evidence have in terms of how we do KGQA (and specifically for CH and over VA)? The same point applies to Section 7. It is interesting to learn the details and configurations of the use cases, but without a commentary on the result (preferently coming from the users/receptors of the use case themselves) it is very hard to see how this section provides any relevant evidence. Moreover, beyond these points I am not entirely sure these are the relevant experiments to do in the light of the proposed contribution: this is not just KGQA system, but a CH KGQA over VA system. As such, shouldn’t the evaluation test the CH and VA aspects of the system, via e.g. user surveys, rather than the KGQA aspect which is apparently not the main contribution of the paper?
Minor issues:
- p4, line 40 “they query proprietary”
- p5, line 19 undefined reference
- p12, line 37 “you have to define” -> “one has to define” (and similar subsequent informal uses of “you”)
- Fig. 5 is really small; consider spanning it to both columns
- The title suggests something different, towards portable device (i.e. smartphone) applications. Although smartphones do include VAs, that is not the only context in which VAs operate
- QALD should be referenced, at least the first time it is mentioned
[1] Meroño-Peñuela, A. and Hoekstra, R., 2016, May. grlc makes GitHub taste like linked data APIs. In European Semantic Web Conference (pp. 342-353). Springer, Cham.
[2] https://github.com/CLARIAH/wp4-queries
[3] Fernández, J.D., Beek, W., Martínez-Prieto, M.A. and Arias, M., 2017, October. LOD-a-lot: A Queryable Dump of the LOD Cloud. In International semantic web conference (pp. 75-83). Springer, Cham.
|