Move Cultural Heritage Knowledge Graphs in Everyone's Pocket

Tracking #: 2764-3978

Authors: 
Maria Angela Pellegrino
Vittorio Scarano
Carmine Spagnuolo

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
Abstract: 
In the last years, we have witnessed a shift from the potential utility in digitization to a crucial need to enjoy activities virtually as an alternative to in-person experiences. The Cultural Heritage domain heavily invested in digitization campaigns, mainly modeling data as Knowledge Graphs by becoming one of the most successful application domains of the Semantic Web technologies. Despite the vast investment in Cultural Heritage Knowledge Graphs, the syntactic complexity of RDF query languages, e.g., SPARQL, negatively affects and threatens data exploitation, risking leaving this enormous potential untapped. Thus, we aim to support the cultural heritage community (and everyone interested in cultural heritage) in querying knowledge graphs without requiring technical skills in Semantic Web technologies. We desire to propose an engaging exploitation tool accessible to all without losing sight of developers' technological challenges. This article, first, analyzes the effort invested in publishing cultural heritage knowledge graphs to quantify data on which developers can rely in designing and implementing data exploitation tools in this domain. Moreover, we point out data aspects and challenges that developers may face in exploiting them in automatic approaches. Second, it presents a domain-agnostic knowledge graph exploitation approach based on virtual assistants as they naturally enable question-answering features where users formulate questions in natural language directly by their smartphones. Then, we discuss the design and implementation of this approach within an automatic community-shared software framework (a.k.a. generator) of virtual assistant extensions and its evaluation on a standard benchmark of question-answering systems. Finally, according to the taxonomy of the cultural heritage field introduced by UNESCO, we present a use case for each category to show the applicability of the proposed approach in the Cultural Heritage domain. In overviewing our analysis and the proposed approach, we point out challenges that a developer may face in designing virtual assistant extensions to query knowledge graphs, and we show the effect of these challenges in practice.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Jul/2021
Suggestion:
Major Revision
Review Comment:

In this paper, the authors observe the steady take-off of cultural heritage knowledge graphs (CHKG), and look at the typical problems of using them in more general contexts. In particular, they argue that KGs are hard to query with SPARQL. To address this, they propose a novel approach based on exploiting CHKGs through virtual assistants (VA), formulated as a KGQA (knowledge graph query answering) task. Therefore, they propose a CH KGQA methodology for VA, where the goal is to enable direct, voice-driven QA for users to CH KG typically exposed through SPARQL endpoints. In addition, they propose a lightweight framework for developers for automatically and manually building VA interfaces for KGQA. The authors, after including an extensive survey on the publication of CHKGs, quantitatively evaluate their KGQA over VA approach through the QALD-9 benchmark challenge; and qualitatively through a number of user cases that highlight VA-driven user interaction with CHKGs in SPARQL endpoints.

The paper is well written, and is relevant to the special issue in (i) providing an extensive survey of CHKGs so far published on the Web; and (ii) providing a methodology for interrogating CHKGs automatically and manually via voice commands, instead of the canonical, and expert-based, execution of SPARQL queries. In this sense, the paper is highly novel and addresses a problem that is especially profound in CH and DH, given that these fields have a traditional need for more supportive tools. The KGQA approach is based on the survey and work by [38], which is the reference in query answering over Linked Data (QALD). The survey of published CHKGs is also an interesting and much needed contribution.

However, this survey is also the first important issue of the paper, along with the following ones:

- Survey. It is completely unclear what the need for Section 3 is, in the light of the contents of the rest of the paper. How is this extensive survey needed, or related, to test the KGQA approach proposed later on? If the purpose was to gather CHKG data for the experiments, just any subset would have been a perfectly valid option. Performing such a large-scale survey opens up many questions that have very little to do with the (already challenging) task of enabling automated CHKG QA over VA. For example, why was [34] (which is quite outdated) chosen as a method? What drives the whole methodology of the Section? Why is the availability of data in SPARQL endpoints deemed so important? E.g. the LOD-a-lot dump [3] is also queryable through SPARQL. An alternative would have been to just consider a CH aggregator, e.g. Europeana. I think a motivation for neglecting all these alternatives, as well as for motivating the survey, is very much needed in the paper

- Methodology. The authors should clarify if the methodological framework of the paper is taken entirely from e.g. [38], or whether it is partly adapted, or is a proper original contribution of the paper. In addition to this, I think the paper spends a great deal of effort in justifying technical choices very much in a technical report style, but it lacks a deeper intellectual justification of choices. For example: in general I agree that focusing on Alexa as VA works without losing generality, but what is the motivation of this choice vs plausible alternatives? On the same note: how do these the various VA services achieve interoperability on the ‘skills’ VA extensions? Is this a standard, an ontology? A small paragraph clarifying this would be very useful. It is also mentioned that “few CH KGs are provided with APIs”, and this is not entirely true. Some tools like [1] make this very easy, and have been used to publish a great deal of KGs APIs, in particular CH KGs (see e.g. [2])

- Evaluation. The evaluation is perhaps the most concerning part of the paper in its current state. Regarding the design of the Q questions: where do these come from? If these are research questions: why and how are they relevant for research in KGQA in KGQA over VA in particular? In Q1: what does “comparable” mean? Is it response time, accuracy of results, coverage? This should be clarified. Besides this, an analysis of results at the end of Section 6, and more specifically an interpretation of Tables 9 and 10, are missing and very much needed. How do these answer the posed questions? What implications does the evidence have in terms of how we do KGQA (and specifically for CH and over VA)? The same point applies to Section 7. It is interesting to learn the details and configurations of the use cases, but without a commentary on the result (preferently coming from the users/receptors of the use case themselves) it is very hard to see how this section provides any relevant evidence. Moreover, beyond these points I am not entirely sure these are the relevant experiments to do in the light of the proposed contribution: this is not just KGQA system, but a CH KGQA over VA system. As such, shouldn’t the evaluation test the CH and VA aspects of the system, via e.g. user surveys, rather than the KGQA aspect which is apparently not the main contribution of the paper?

Minor issues:
- p4, line 40 “they query proprietary”
- p5, line 19 undefined reference
- p12, line 37 “you have to define” -> “one has to define” (and similar subsequent informal uses of “you”)
- Fig. 5 is really small; consider spanning it to both columns
- The title suggests something different, towards portable device (i.e. smartphone) applications. Although smartphones do include VAs, that is not the only context in which VAs operate
- QALD should be referenced, at least the first time it is mentioned

[1] Meroño-Peñuela, A. and Hoekstra, R., 2016, May. grlc makes GitHub taste like linked data APIs. In European Semantic Web Conference (pp. 342-353). Springer, Cham.
[2] https://github.com/CLARIAH/wp4-queries
[3] Fernández, J.D., Beek, W., Martínez-Prieto, M.A. and Arias, M., 2017, October. LOD-a-lot: A Queryable Dump of the LOD Cloud. In International semantic web conference (pp. 75-83). Springer, Cham.

Review #2
Anonymous submitted on 14/Jul/2021
Suggestion:
Major Revision
Review Comment:

SWJ criteria/advise for evaluation:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Evaluation:

This research paper presents a system for NL question answering related to different kind on cultural heritage contents. The goal is to create a question answering (QA) system for the public that understand natural language queries - the authors argue that learning complex query languages such as SPARQL is then not needed.

Section 1 introduces the research domain and goal of the paper. After this related works on QA are discussed (Section 2).

Section 3 presents a substantial analysis of CH linked data repositories available, based on manually inspecting 710 datasets registered in Datahub https://datahub.io/.

In Section 5, the idea of mapping Amazon Alexa QA system intent to SPARQL templates is presented. Eight question types listed in Table 8 have been implemented. An open question is, does this approach generalize to what extent to real questions made by the end users in practice? Section 6 extends the idea by presenting an approach to automatically customizing/extending the system to new use cases. This section remains a bit generic as no real examples of using the system are given.

Section 6 presents an evaluation of the systems using three questions. The code is available in Gibhub. The most important question, whether the system fit for its purpose from an end user point of view is, however, is not addressed or evaluated.

Summary

(1) originality. The paper has some originality in its attempt to map Alexia intents to SPARQL templates. However, the paper should document in more detail how this kind of system was actually implemented, a reference to Github is not enough.

(2) significance of the results. The wide analysis of CH linked dataset available in SPARQL endpoints was interesting. However, it remains unclear how useful the QA system would actually be to end users. How well does the template approach generalize to free questions set by the public? A more focused study related to one particular dataset would be useful and provide the reader with some deeper understanding on how the system actually works. I have doubts on how well the system would perform in real life, as there are several deep challenges in using linked data even when SPARQL is used by professional users. Cf. e.g. the papers in JASIST and ISWC 2021 regarding the MMM data used as a case study, and the Semantic Web Journal paper on the WarSampo knowledge graph used as example datasets in the evaluation. Even if many QUALD questions can be transformed into the eight basic queries, it is not clear how well this can be done for free end user questions is a real use case; real Digital Humanities questions far more complex that those used in the paper. However, NL understanding is important topic even if challenging, and in this sense the papers has some significance.

(3) quality of writing. In general the paper is well written, but there were typos etc.

Review #3
By Jacco van Ossenbruggen submitted on 09/Aug/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

General:

The paper presents a software framework to automatically generated extensions for Alexa to enable it to answer questions over a sparql endpoint, and provides this software framework with use case data on github and zenodo. It evaluates the framework by comparing the manually configured and auto configured skills against other systems on the QALD question sets. I also appreciate the overview of currently available KGs in the CH domain in section 3, this is very useful!

However, the current paper seems divided in two halves, where the first contains the title, abstract, intro, related work and overview of available CH knowledge graphs. This part is about cultural heritage KGs as you expect from the title and fits the special issue. This part made be moderately enthusiastic.

But than the second part is about a software framework for general QA over KGs, has no specific CH aspects (even the use cases in 7 are just pretending to be about CH) and nothing CH-specific has been evaluated. There seem to be no real user testing on the use of the framework to generate skills, nor on the use of the skills (while you promise the reader "an engaging exploitation tool accessible to all without losing sight of developers' technological challenges"). This part made me feel really disappointed, as it read as a badly written QA over KGs workshop paper.

Recommendation: I think in the revision, you need to either to tone down your claims and make it clear you developed an auto-configurable QA system for open-ended KGs, evaluated with QALD on general purpose linked data, or you need to do a user study that proves that your system is usable by cultural heritage users on specific cultural heritage tasks. If you opt for the first, I would re-think if this special issue is the best venue. If you decide to resubmit, you still need to motivate why CH users need your system: for which common CH tasks is a system as proposed by [30] insufficient and why?

Novelty/Originality
While the authors mention some other work on extending commercial VAs to make them work with linked data, I think the topic is sufficiently novel to be in scope for this special issue.

Significance of the results:

I have some doubts about your Unesco classification, since this is not the definition provided by the unesco thesaurus, as you suggest. This is not just a small error, because the thesaurus is one of your datasets! I assume your definition is taken from the unesco page on "illicit-trafficking-of-cultural-property", which explains the special attention for heritage in the event of conflict. However, this is not a category orthogonal to the other categories and cannot be used as such.

I fail to understand why the two processes depicted in Fig 4 are not identical... Also, in table 8, all SPARQL queries seem to be based on general QA templates, nothing specific for either VA or CH ... (?!) and in the evaluation you use the generic QALD-7/9 question sets. Also the MMM use case depicted in Fig 7 seems not very specific to tangible CH ... This all suggest you are doing generic QA over KG, with a very small VA and CH skin... or am I missing something?

You seem to have cherry-picked the questions from QALD that fit your query patterns, and then hand-crafted the corresponding utterances ...

The example use cases both suggest end-users need to construct rather schema-driven not-so-natural-English utterances ("Which database has modified equals to 2020?", "what is the creation with the maximum value of had participant" ... who came up with these spontaneous questions?)

"4.3. Discussion of Strengths and Limitations" discusses only strengths, no limitations ...!

Quality of writing:
Motivation: The authors claim that the main "novelty" in their contribution is the software framework that enables non-technical users to create Alexa skills for QA over CH KS. But even after reading the first 2 sections, it is not clear to me why these users need to be able to do so. Which type of questions cannot be answered by the state of the art VA technology?

Minor:
- General: text needs checking by a native speaker (e.g. check all sentences with "We desire" or "behave as")
- p1: I'm not a fan of "scientifically empty" sentences like the first sentence in the abstract: "In the last years, we have witnessed a shift from the potential utility in digitization to a crucial need to enjoy activities virtually as an alternative to in-person experiences." What is a "crucial need to enjoy activities" and how do I check there indeed is such a need?
- p1, L47: [5] suggests a link to the unesco thesaurus, but it is not (!).
The unesco thesaurus defines CH quite differently than this article: http://vocabularies.unesco.org/thesaurus/concept269

- p3,L3: "The process starts with a SPARQL endpoint provided by users." I assume it is just the url of the endpoint that has to be provided, not the endpoint
- p3: readers not familiar with Alexa might not understand your usage of the word "skill", also because you use it in the more general sense on page 2
- p3, L45: I do not understand this sentence:"We noticed a particular interest in taking care of CH terminology and modeling approaches by thesaurus and model by the performed analysis."
- p5, L19, citation ref missing [? -> 34]