Making Linked-Data Accessible: A Review

Tracking #: 3463-4677

Authors: 
Omar Mussa
Omer Rana
Benoît Goossens
Pablo Orozco-terWengel
Charith Perera

Responsible editor: 
Katja Hose

Submission type: 
Survey Article
Abstract: 
Linked Data (LD) is a paradigm that utilises the Resource Description Framework (RDF) triplestore to describe numerous pieces of knowledge linked together. When an entity is retrieved in LD, the associated data becomes immediately accessible. SPARQL, the query language facilitating access to LD, contains a complex syntax that requires prior knowledge due to the complexity of the underlying concepts. End-users may experience a sense of intimidation when faced with using LD and adopting the technology into their respective domains. Therefore, to promote LD adoption among end-users, it is crucial to address these challenges by developing more accessible, efficient, and intuitive tools and techniques that cater to users with varying levels of expertise. Users can employ query formulation tools and interfaces to search and extract relevant information rather than manually constructing SPARQL queries. This paper investigates and reviews existing methods for searching and accessing LD using query-building tools, identifies alternatives to these tools, and highlights their applications. Based on the reviewed works, we establish 22 criteria for comparing query builders to identify the weaknesses and strengths of each tool. Subsequently, we identify common usage themes for current solutions employed in accessing and searching LD. Moreover, we explore current techniques utilised for validating these approaches, emphasising potential limitations. Finally, we identify gaps within the literature and highlight future research directions to further advance LD accessibility and usability for end-users.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/Jul/2023
Suggestion:
Accept
Review Comment:

The paper gives a comprehensive overview of user interfaces to support the creation of SPARQL queries. The authors define 22 criteria based on a review of relevant papers, which were identified by a well documented selection process. Based on these criteria a brief introduction and comparison of papers, published between 2006 and 2021, is provided. Additionally, the paper increases the scope of the paper by providing examples of alternative user interfaces, which do not use SPARQL, and presents selected Semantic Web solutions. This overview is followed by a brief summary of common evaluation methods and a discussion of challenges not addressed in the summarized works.

* Suitability as introductory text: This work is well suited to be a good introductory text, because it succeeds in summarizing the existing state of the art. Additionally, the discussion of evaluation methods and challenges encourages new and well evaluated work in this area.
* Comprehensiveness and balance of the presentation: The given overview of interfaces to support the creation of SPARQL queries is comprehensive and seems to cover the state of the art in this field. The presentation of Semantic Web solutions seems to be more limited in its scope. Of course giving an extensive overview would not make any sense due to the amount of solutions. However, the identification of common usage themes could be explained. What is the reason for these five usage themes?
* Readability and Clarity: The readability is excellent and supported by well-structured figures and tables.
* Importance for broader Semantic Web community: This work discusses the usability of SPARQL and even more general ways to access the semantic web for a range of user groups (lay user, experts). Consequently, it has a high importance for the broader community. Increasing the usability of SPARQL as one of the fundamental standards to access the semantic web, helps to improve the accessibility of the whole semantic web.

Comments:
* Not only the Web of Things uses many-to-one relations. It might be better to reorder this challenge and first mention the kind of patterns which are currently not supported and then give the sensor data from Web of Things applications as example. In the given order it sounds like a very specific requirement for Web of Things applications, but in fact this is way more generic.
* The research challenge mentions the idea of avoiding empty results. For me this is immediately connected to the drawback of a reduced expressivity. In some cases an empty result is useful, e.g. a query asking about the people who have been to Mars should return no entities. It might be useful to explicitly states this limitation.

Minor comments:
* The reference [43] has a formatting issue.
* Page 2 line 16+17 contains a unnecessary repetition. The sentence states "have been numerous efforts to develop a" and the following sentence repeats "Several efforts have been made to". I would suggest removing the first sentence.
* The description of Figure 4 uses the term "subject" to refer to predicates in the resulting RDF triple pattern. It would be useful to use a different term to avoid confusion, especially after the RDF triple pattern subject - predicate - object is introduced later on.
* On page 14 in line 34 the tool name is misspelled. Instead of "NITELIGHT" it is spelled "INITELIGHT".
* On page 24 in line 19 the term "single-based classifier" is not entirely clear to me. What kind of classifier is this? Adding a brief explanation might be helpful.
* On page 30 the title "Avoiding Non-empty Results" seems to contain a unwanted negation. I think it should be "Avoiding empty results", because the users are especially interested in non-empty results.

Overall, the paper is well written and provides a good summary of interfaces used to access the Semantic Web with a special focus on SPARQL queries. As such this work is clearly useful for the community and should be published.

Review #2
Anonymous submitted on 22/Jul/2023
Suggestion:
Reject
Review Comment:

The paper present a review of visual query builders and tools that can help users formulating SPARQL queries.
The focus is mostly on query available prototypes of query builders.
Overall, it is limited in the contribution since (a) it is mostly a list of existing software tools and does not provide an in depth comparison of their methodologies or to abstract commons winning strategies (b) it misses on a wide range of research directions and methods that have less known prototypes or are still mostly theoretical but which can have impact if integrated in existing systems.

More detailed comments follow.

C1) The motivating example seems out of place, since it is presented from the side of the data publisher and instead it misses to provide clear examples of different users and different search/information need: exploratory, search, analytics

C2) The criteria for the survey are too restrictive. Keywords like "knowledge graph" or "query suggestions" are missing, venues are also limited like EDBT, VLDB, ICDE, SIGMOD, CIKM, Web Conference. The work fails to recognize important surveys and tutorials that can be relevant to identify methods to survey, for example:

[A] Martins, Denis Mayr Lima. "Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities." Information Systems 83 (2019): 89-100.

[B] Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2019. Exploring the Data Wilderness through Examples. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 2031–2035. https://doi.org/10.1145/3299869.3314031

This leads to the survey missing seminal works like:

[C] Arenas, Marcelo, Gonzalo I. Diaz, and Egor V. Kostylev. "Reverse engineering SPARQL queries." Proceedings of the 25th International Conference on World Wide Web. 2016.

[D] Metzger, Steffen, Ralf Schenkel, and Marcin Sydow. "QBEES: query-by-example entity search in semantic knowledge graphs based on maximal aspects, diversity-awareness and relaxation." Journal of Intelligent Information Systems 49 (2017): 333-366.

And many more! Note that for each of the above and more there also relevant associated demo papers

C3) It does not explain concretely the difference with previous surveys and the gap this survey closes w.r.t. to [19, 20, 22–24]

C4) The 22 criteria are very interesting, but then it is hard to obtain a clear summary on which options are actually actionable to address those. Figure 8 is not easy to understand in this regard.

C5) The work does not provide an in-depth taxonomy and clear guidelines (a) for practitioners, to choose which method to use based on what, which method could be considered the 'state of the art'? Which limitations exist? (b) for researchers, what are actually open research problems? What are the best datasets and procedure to evaluate future solutions? Which methods are promising but not implemented in existing tools?

C6) the current analysis of efficiency in terms of scalability but also how easy are these methods to "adapt" when the data changes is too limited, it is fundamental for these methods to be able to provide fast answers but also to be always available when the data is updated, these requirements should be cross-referenced with general design decision that go above/beyond a single method but accomunates multiple classes of methods.

Review #3
Anonymous submitted on 27/Aug/2023
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

* Making Linked-Data Accessible: A Review

In this paper the authors present a review, or survey, of user interfaces for accessing Linked Data, or RDF databases, on the Web. The authors present a list of criterion for comparing these interfaces, and an exhaustive list of the existing systems, describing and classifying them within the previous category list. At the end of the paper the authors provide a summary of the methods used to evaluate the interfaces and some challenges that arise from the analysis of these interfaces.

*** Strengths
- S1 Exhaustive list of systems (even though some famous ones are missing, such as the QA system Aqua or SemFacet)
- S2 Clear criterion for comparing the systems
*** Weaknesses
- W1 Some tool analysis seem shallow, since it seems the authors do not look at the actual interface but at the paper describing it.
- W2 I miss looking at specific user interface theory, such as the relation with mental models and user interfaces, understanding how the users actually interact with the interface, etc. That would make a really set of interesting conclusions.

Detailed review:

Regarding the Shallowness of the systems analysis I miss several questions about some specific systems, that make me think that this is a shallow analysis:
- Konduit VKB: why is considered the system focuses at expert and tech users? why is potentially too cryptic for lay users? Did the authors ask to the users? I think there is no proof about it, specially looking at the Konduit paper.
- BioGateway is a plugin for Cytoscape, which is a system for biological data visualization. This is missing in the paper.
- VizQuery: this system seems a student's project, it just asks for a vocabulary property and the interface does not check anything else. I do not understand why this interface is in the list. It could be that the system is importan for historical reasons (i.e. was the first in doing X), however I am not sure about that.
- From SPARQLFilterFlow I miss what data it is possible to visualize (i.e. what data can you load into the system), it seems that the . Also, you load in SPARQLFilterFlow? What tasks did the users worked on? were they hard? They lookd at it, quickly they used the first result from the interface (Food).
- Are tabular queries faceted browsing systems?
- Where is the linked data in the systems? most of them only query one single SPARQL endpoint or dataset, and thus these systems only access a single dataset without linking these data to others.
- ExConQuer seem to me a faceted browsing system, wouldn't faceted browsing be an entire category since it is widely used in commercial systems? [1,2]. Also there is no code nor a demo.

Regarding graph based systems, I also have comments about some of them:
- GQL is an old system without code and demo. What is the LD ontology the authors are referring to? also, the authors say that "The authors thus claim that the tool can be used to create very complex queries" can you verify that? From my point of view you are only stating what is in the paper, however and from my point of view, in a survey paper I would expect some verification. If there is no verification I still have to go to the original paper and read it through.
- SPARQLinG: from my point of view the point of this work is that it defines a complete visual query language, rather than a user interface, which is the demo they provide. I think that this is one of the few exceptions that actually provide a language. It should be in the paper due to that for not providing a working system, since I think it is not available. Also, was this paper cited by others that later also provide a language?
- QueryVOWL: the tool clearly says that "The web demo provides a prototypical implementation of QueryVOWL. It has mainly been developed to demonstrate the QueryVOWL approach and should not be considered a mature tool (e.g., it contains some known bugs).". I think that the authors should highlight that.
- RDFExplorer: the process the authors describe for drawing queries is not true. If we look at the paper it says "the user must then start by adding a new node, be it a variable node (η(G)) or a constant node (η(G, x));" and it is possible to query DBpedia or other datasets too (https://dbpedia.rdfexplorer.org/). Also, the difficulty from using the interface, comes from the data or due to the interface? the authors claim at the beginning of the paper that it is due to the data, however they claim now that "any difficulty using the tool initially tends to decrease over time". In the end the most important question to me is to know whether the difficulty comes from the data and why. Also, the idea of RDFExplorer is "to simultaneously navigate and query knowledge graphs". Also, this system poses a question, do the other systems scale with data?

NLP based
Regarding these systems, why the authors do not classify them by the use of modern NLP architectures (i.e. using neural networks) vs the rest? the accuracy using these techniques is way higher, and if there is none, that means a possible improvement over the state of the art, which is what I personally look for in this type of papers. Also, I guess that the more accurate the tool is the more satisfaction the users will have when using it.

Alternative User Interfaces
In this section I do not understand why YASGUI is compared WQS since WQS is also in another section. Yasgui is the most popular user interface for SPARQL endpoints since it comes shipped with Apache Fuseki and is part of WQS. From my point of view Yasgui should be in another section, of WQS in this section.

Linked Data browsers
Again, why the authors do not classify the systems by those using neural networks, which offer better identification of sentence components than others? same for Question Answering systems, which are also part of virtual assistants.

Why Web APIs are present in this paper? these are not User interfaces. I think this section should be in another paper.

User study
From my point of view this is the most important caregory in the paper, since it allows to understand why one interface is usable, i.e. why users actually use it. Rather than listing whether a specific technique was used to evaluate the paper it would be great to understand whether the evaluation helps in that regard. For instance, NASA-TLX helps in understanding the user's workload using the interface. What interfaces measured that and how they compare with others that did not use them? Furthermore, only a few interfaces were actually evaluated, and thus the comparison should be brief. The same apply for other evaluation techniques.

Findings and discussion
This section is more a summary of what the interfaces do in an aggregate manner. It is a nice to have, however I do not see much findings nor discussion. Also, the term user-friendly should be replaced by the term usable [3].

Summary
In general this is a very interesting work, it compiles in a single place many of the user interfaces so far for accessing RDF data, not linked data since none of the previous interfaces accessed more than a single dataset (this should be corrected in the paper).

However I think the system's descriptions are too shallow (as I described before). What I would like to learn from a survey paper are the problems the surveyed interfaces have, and why. Statements like "the authors from X paper say.." are not for a survey paper since I think that there should be a bit more of analysis.

Regarding the systems in general I miss knowing what datasets can they access, whether I can load DBpedia or Wikidata on them (or point to these datasets) or there are some limitations regarding the amount of data they can visualize. I would like to know whether the systems asked end users during the interface design phase, i.e. if the systems are actually solving someones problem and whether the users provided some feedback about the interface.
Also, the authors assume that the difficulty for accessing and querying RDF data is due to SPARQL and the intricate structure of the data. Do any interface look at these weaknesses? specifically about how to overcome the intricate data structure? Did any of the studies described asked users about the difficulty of accessing and understanding the data?
And last but not least, how many of these systems are available online? do they provide any code? any license? User interfaces are for using them, thus, can I?

Another point of view that could be useful is to present the systems in a "historical context" showing how the interfaces have evolved during the last 15 years.

[1] Marti Hearst, Ame Elliott, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ka-Ping Yee. 2002. Finding the flow in web site search. Commun. ACM 45, 9 (September 2002), 42–49. https://doi.org/10.1145/567498.567525
[2] Hearst, M. (2006, August). Design recommendations for hierarchical faceted search interfaces. In ACM SIGIR workshop on faceted search (pp. 1-5).
[3] Nielsen, J. (1996). Usability metrics: Tracking interface improvements. IEEE software, 13(6), 1-2.