A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing

Tracking #: 2814-4028

Authors: 
Ruben Taelman
Ruben Verborgh

Responsible editor: 
Sabrina Kirrane

Submission type: 
Full Paper
Abstract: 
The societal and economical consequences surrounding Big Data-driven platforms have increased the call for decentralized solutions. However, retrieving and querying data in more decentralized environments requires fundamentally different approaches, whose properties are not yet well understood. Link-Traversal-based Query Processing (LTQP) is a technique for querying over decentralized data networks, in which a client-side query engine discovers data by traversing links between documents. Since decentralized environments are potentially unsafe due to their non-centrally controlled nature, there is a need for client-side LTQP query engines to be resistant against security threats aimed at the query engine’s host machine or the query initiator’s personal data. As such, we have performed an analysis of potential security vulnerabilities of LTQP. This article provides an overview of security threats in related domains, which are used as inspiration for the identification of 10 LTQP security threats. Each threat is explained, together with an example, and one or more avenues for mitigations are proposed. We conclude with several concrete recommendations for LTQP query engine developers and data publishers as a first step to mitigate some of these issues. With this work, we start filling the unknowns for enabling querying over decentralized environments. Aside from future work on security, wider research is needed to uncover missing building blocks for enabling true decentralization.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Aug/2021
Suggestion:
Reject
Review Comment:

Summary
=======
The authors make the case for 10 points/vulnerabilities/security threats in the context of link-traversal-based querying that can lead to problems during query execution. The authors do so by making arguments based on literature and technologies from semantic-web and non-semantic web technologies. They conclude with recommendations for developers.

Strong points
=============
* The paper is timely, because with Solid, the substrate for which link-traversal-based querying, starts to become deployed
* The compilation of the points is original
* Link-traversal-based querying on the web, that is on data at least partially from potentially untrusted sources, indeed has peculiarities that need mitigation strategies.

Weak points
===========
* I miss a definition of the main point of this paper: security, with a delineation from safety. Some definitions of the two terms I found on the web make the difference in whether what goes wrong has been *deliberate*. Some of the paper's _security_ vulnerabilities are more safety problems for me, like Section 5.8 syntax errors (unintentionally put in by a human), or Section 5.6 links gone wrong (unintentionally put in by a RDF export from a database, see e.g. the DyLDO study [1]).
* Unclear attack surface / vectors:
* For instance, Section 5.5 assumes a query engine that executes JavaScript before RDFa and JSON-LD is extracted from HTML. As far as I know, such engines are rare. Maybe it would be good to introduce different classes of engines first and their components. Engines that, e.g., only operate on Turtle/RDF+XML/N-Triples, would not be affected.
* In the introduction the authors say that the paper is about the integrity of a user's data. What is the notion of integrity applied here? Which vulnerabilities would result in changes to a user's data?
* Are the authors talking about the vulnerability of engines for federated SPARQL (Section 5.2) or engines for dereferencing URIs (Section 5.8)?
* Missing related work or state-of-the-art technologies
* Section 5.8 talks about document corruption, to which the authors add un-available sources. Here, the authors again refer to the DyLDO study [1]. Thus, we have two points here: the meaning of an unavailable source, the meaning of a syntactically wrong document, both require research in my opinion, and a pointer to HTML browsers' Quirks modes is missing. On top, the authors miss a third point: inconsistent data, for which there are mitigation strategies as well.
* Section 5.1 talks about unauthoritative statements. There are papers on authoritative statements that have not been cited in the paper, e.g. [2]. I would be interested in the notion of trust that the authors apply in this section.
* Crawling the web of data has been investigated in [3,4,5], which would be relevant in related work. Moreover, The claim "HTTP delays typically for the bottleneck in LTQP" should be substantiated, e.g. using those works. What is an HTTP delay? From my experience HTTP can be fast, and PLD starvation (I think see [6]) is a bigger issue if you want to crawl politely. FWIW, polite crawling is mentioned at the end of 2.1.
* The authors extensively refer to their Comunica engine, but miss out on other engines that could process queries and follow links, including their own RESTDesc [7] and Linked Data-Fu [8]
* Agents that work on Linked Data, e.g. [9], also make use of link following
* Maybe, federated SPARQL with and without automated source selection is also in the scope for related work (Section 2.1)
* Target audience of the paper are not researchers but developers and data publishers (see the recommendations in the conclusion and the abstract), so maybe the paper should be submitted to a non-academic venue?
* How did the authors assess the difficulty of their mitigation strategies?
* Regarding writing quality: While the English is well-written, for my taste the paper could use a more down-to-earth style.

Verdict
=======
I recommend the editors to reject the paper, as it is in a very premature stage. To me, most points raised are not really security vulnerabilities, though they are interesting points. Maybe involving a security expert and clearly defining security and safety and then working on the attack vectors for different engines for different interfaces helps to find out what are "real" threats to system security and what are "just" very important and interesting peculiarities in link-traversal-based querying that need mitigation. Moreover, important related work is missing. Yet, I highly encourage the authors to continue their work on *all* 10 points.

Minor points
============
* "its own personal" -> "their own personal" (p.1)
* I am unsure why query processing helps to *find* data (p.1)
* "LTQP is a relative young are of research" vs. "More than a decade ago, [...] LTQP has been introduced" (both on p.2 - a contradiction?)
* What are the "global semantics" of RDF? Please add a reference or explain (Section 2.1).
* Reference [23] does not support the paragraph in which it had been mentioned (Section 2.2)
* Reference [58] does not support the paragraph in which it had been mentioned (Section 5.3)
* "Unauthorized Statements" (Table 4) vs. "Unauthoritative Statements" (the corresponding heading of Section 5.1)
* The open-world assumption in my opinion does not imply free speech (Section 5.1)
* HTTP GET Parameters would need a reference or a definition (Section 5.4)
* "limit duration" -> "limited duration" (Section 5.5)
* To keep track of all visited URIs is commonly done in crawlers (Section 5.6)

[1] Käfer et al. "Observing Linked Data Dynamics", ESWC 2013
[2] Hogan et al. "Scalable authoritative OWL reasoning for the web" IJSWIS 2009
[3] Isele et al. "LDSpider" P&D ISWC 2010
[4] Röder et al. "Squirrel", ISWC 2020
[5] Käfer et al. "DyLDO", LDOW 2012
[6] Hogan et al. "SWSE", JWS 2011
[7] Verborgh et al. WS-REST, 2012
[8] Stadtmüller et al. "Data-Fu", WWW 2013
[9] Käfer et al. "Programming User Agents...", LDOW 2018

Review #2
Anonymous submitted on 07/Sep/2021
Suggestion:
Reject
Review Comment:

The authors present a study about potential security concerns while using the LTQP query engine on the web where documents can be noisy and malicious. Authors have drawn the similarity between the LTQP and web browsers and categorized a few of the attacks common on the web browser and how such attacks can potentially be executed in the LTQP framework. Overall the manuscript is readable with few typos. As LTQP  is a new research area and still under exploration (as mentioned by the authors), there is no quantitative or qualitative result present in the manuscript to judge the impact of the work. 

As described below overall the work needs to be improved to (1) position it well with respect to the security aspect, and (2) provide motivations for some of the choices. 

Weaknesses:

0. Work seems to be hypothetical in nature and to support the work the authors draw the connection between LTQP and the web browser. Most of the work is borrowed from the already known flaws in web browsers with its potential mitigation. Due to the high resemblance with already existing work and its hypothetical nature, the contribution seems incremental.

1. As mentioned by the author on (page 6) one of the assumptions is that the query engine makes use of Alice identify for authentication. Such a "known identity" assumption is too strong. In absence of such an assumption, the query should not be executed nor it should have access to the data. But, how this authentication will potentially be handled in the LTQP framework is unclear.  There is mention of authentication work under related work but there is no clarity on how that would be used by the LTQP engine.

2. There is no mention of the details about what criteria were used to decide the difficulty level (high/low/medium) of the attacks as well as the mitigations. Moreover, how does the "high", 'low', and 'medium' translate quantitatively?

3. Lacks motivation about why a fixed set of 10 attacks were chosen from a pool of attacks. What criteria were used for selection? Are these attacks more dangerous than the others and hence have high priority compared to others? Such high priority might hold true for web browsers but does it equally translate to LTQP? Is the presented list of attacks exhaustive with respect to LTQP?

Strength:

0. Depending on the popularity and community willingness to accept the LTQP framework, the current manuscript can serve as an initial work/baseline/reference.

Some suggestions:

I believe the following point might be helpful to the authors to make the manuscript stronger. 

0. To organize and prioritize attacks for the LTQP engine, authors can refer to cybersecurity data sources mentioned in Unified Cybersecurity Ontology [1]. These data sources are publically available and are of good quality. Attacks have a score associated with them to determine the nature and severity of it along with many other additional features.

1. Judging the difficulty of the mitigation can be hard, as there is no known implementation. Judging based on perception level might be vague. As a suggestion, it would be fine to list only the difficulty of the attack using established data sources. 

Typos:

1. "more decentralized" -> "decentralized" . as it not clear how to compare decentralization system vs less decentralized system

2. "true decentralization" -> "decentralization"

3. On page 1 Second sentence has a gaurdian.com link associated but not cited nor is hyperlinked? I am not sure if that was intentional or by mistake. If intentional then it would be nice to make it more explicit

4. "Solid leads to a a " -> "Solid leads to a"

5. Page 2, "illustrate difference threats with." -> "illustrate difference threats."

6. What does "..." means for Table 1, 2 and 3?

References 

[1] Syed, Zareen, et al. "UCO: A unified cybersecurity ontology." Workshops at the thirtieth AAAI conference on artificial intelligence. 2016. https://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/viewPaper/12574

Review #3
Anonymous submitted on 16/Oct/2021
Suggestion:
Major Revision
Review Comment:

The paper tackles the problem of link traversal-based query processing and presents a prospective assessment of the potential security vulnerabilities of this type of query processing. The authors analyze ten security threads and propose mitigation strategies with a level of difficulty. Each thread is discussed in a use case where data vaults store data of any type, published on the Web, and are completely controlled by the owner. One of the users has malicious intentions which are unknown from the others, and the vulnerabilities are defined based on existing cases in other domains. The article concludes with recommendations for linked traversal query processing developers and data publishers.

Positive Points
-) An exhaustive analysis of vulnerabilities that may exist whenever linked traversal query processing is performed over distributed linked data.
-) A clear illustration of each analyzed case with the running example.
-) Conscientious recommendation to avoid and mitigate the discussed vulnerabilities.

Negative Points
-) Although the paper resorts to simple examples to explain the potential security issues, the reported analysis relies on a group of vague concepts. For example, in section 2.2, the vulnerability of RDF query processing is presented in terms of injection attacks, parameterized queries, and query parse trees. A detailed description of these concepts is required to enhance readability.

-) There is no justification of methodology followed to identify these ten vulnerabilities. It is not clear if a systematic literature reviewed process was followed to uncover them. The evaluation methodology must be defined to ensure reproducibility and understanding of the levels of completeness of the analyzed cases,

-) Despite the paper refers to linked traversal query processing, it does not concretely show which of the existing approaches is in danger of these vulnerabilities. The authors should also indicate if these vulnerabilities threaten existing real-world methods, e.g., SPARQL federated query engines or SPARQL endpoints. If so, include references.

-) Criteria followed in deciding the degree of difficulty are not discussed. Moreover, the meaning of the values: Easy, Medium, and Hard is not defined. It is required to clearly describe the process to be followed to mitigate each vulnerability and how the values of difficulty are determined based on these processes.

-) The execution of SPARQL queries comprising the triple pattern ?s ?p ?o, usually is limited by timeouts specified in the configuration of the SPARQL endpoint. The categories of injection attacks considered in this analysis are not clear. Also, it is not justified in which type of query engines this query is an injection attack. Please, indicate concrete examples.

-) The article contains several unprecise statements and ignores related work conducted for several decades in graph databases. For example, the paper states that “LTQP is a relatively new area of research”. However, query processing over graph databases has been studied for more than four decades. Please, check the vast amount of work by Alberto Mendelzon or Claudio Gutierrez. The authors should postulate a more specific statement about the query processing problem to which they refer in this paper.

Giansalvatore Mecca, Alberto O. Mendelzon, Paolo Merialdo:
Efficient Queries over Web Views. IEEE Trans. Knowl. Data Eng. 14(6): 1280-1298 (2002)

Gustavo O. Arocena, Alberto O. Mendelzon, George A. Mihaila:
Query Languages for the Web. QL 1998

Alberto O. Mendelzon, George A. Mihaila Tova Milo: Querying the World Wide Web. PDIS 1996: 80-91