Review Comment:
Summary
=======
The authors make the case for 10 points/vulnerabilities/security threats in the context of link-traversal-based querying that can lead to problems during query execution. The authors do so by making arguments based on literature and technologies from semantic-web and non-semantic web technologies. They conclude with recommendations for developers.
Strong points
=============
* The paper is timely, because with Solid, the substrate for which link-traversal-based querying, starts to become deployed
* The compilation of the points is original
* Link-traversal-based querying on the web, that is on data at least partially from potentially untrusted sources, indeed has peculiarities that need mitigation strategies.
Weak points
===========
* I miss a definition of the main point of this paper: security, with a delineation from safety. Some definitions of the two terms I found on the web make the difference in whether what goes wrong has been *deliberate*. Some of the paper's _security_ vulnerabilities are more safety problems for me, like Section 5.8 syntax errors (unintentionally put in by a human), or Section 5.6 links gone wrong (unintentionally put in by a RDF export from a database, see e.g. the DyLDO study [1]).
* Unclear attack surface / vectors:
* For instance, Section 5.5 assumes a query engine that executes JavaScript before RDFa and JSON-LD is extracted from HTML. As far as I know, such engines are rare. Maybe it would be good to introduce different classes of engines first and their components. Engines that, e.g., only operate on Turtle/RDF+XML/N-Triples, would not be affected.
* In the introduction the authors say that the paper is about the integrity of a user's data. What is the notion of integrity applied here? Which vulnerabilities would result in changes to a user's data?
* Are the authors talking about the vulnerability of engines for federated SPARQL (Section 5.2) or engines for dereferencing URIs (Section 5.8)?
* Missing related work or state-of-the-art technologies
* Section 5.8 talks about document corruption, to which the authors add un-available sources. Here, the authors again refer to the DyLDO study [1]. Thus, we have two points here: the meaning of an unavailable source, the meaning of a syntactically wrong document, both require research in my opinion, and a pointer to HTML browsers' Quirks modes is missing. On top, the authors miss a third point: inconsistent data, for which there are mitigation strategies as well.
* Section 5.1 talks about unauthoritative statements. There are papers on authoritative statements that have not been cited in the paper, e.g. [2]. I would be interested in the notion of trust that the authors apply in this section.
* Crawling the web of data has been investigated in [3,4,5], which would be relevant in related work. Moreover, The claim "HTTP delays typically for the bottleneck in LTQP" should be substantiated, e.g. using those works. What is an HTTP delay? From my experience HTTP can be fast, and PLD starvation (I think see [6]) is a bigger issue if you want to crawl politely. FWIW, polite crawling is mentioned at the end of 2.1.
* The authors extensively refer to their Comunica engine, but miss out on other engines that could process queries and follow links, including their own RESTDesc [7] and Linked Data-Fu [8]
* Agents that work on Linked Data, e.g. [9], also make use of link following
* Maybe, federated SPARQL with and without automated source selection is also in the scope for related work (Section 2.1)
* Target audience of the paper are not researchers but developers and data publishers (see the recommendations in the conclusion and the abstract), so maybe the paper should be submitted to a non-academic venue?
* How did the authors assess the difficulty of their mitigation strategies?
* Regarding writing quality: While the English is well-written, for my taste the paper could use a more down-to-earth style.
Verdict
=======
I recommend the editors to reject the paper, as it is in a very premature stage. To me, most points raised are not really security vulnerabilities, though they are interesting points. Maybe involving a security expert and clearly defining security and safety and then working on the attack vectors for different engines for different interfaces helps to find out what are "real" threats to system security and what are "just" very important and interesting peculiarities in link-traversal-based querying that need mitigation. Moreover, important related work is missing. Yet, I highly encourage the authors to continue their work on *all* 10 points.
Minor points
============
* "its own personal" -> "their own personal" (p.1)
* I am unsure why query processing helps to *find* data (p.1)
* "LTQP is a relative young are of research" vs. "More than a decade ago, [...] LTQP has been introduced" (both on p.2 - a contradiction?)
* What are the "global semantics" of RDF? Please add a reference or explain (Section 2.1).
* Reference [23] does not support the paragraph in which it had been mentioned (Section 2.2)
* Reference [58] does not support the paragraph in which it had been mentioned (Section 5.3)
* "Unauthorized Statements" (Table 4) vs. "Unauthoritative Statements" (the corresponding heading of Section 5.1)
* The open-world assumption in my opinion does not imply free speech (Section 5.1)
* HTTP GET Parameters would need a reference or a definition (Section 5.4)
* "limit duration" -> "limited duration" (Section 5.5)
* To keep track of all visited URIs is commonly done in crawlers (Section 5.6)
[1] Käfer et al. "Observing Linked Data Dynamics", ESWC 2013
[2] Hogan et al. "Scalable authoritative OWL reasoning for the web" IJSWIS 2009
[3] Isele et al. "LDSpider" P&D ISWC 2010
[4] Röder et al. "Squirrel", ISWC 2020
[5] Käfer et al. "DyLDO", LDOW 2012
[6] Hogan et al. "SWSE", JWS 2011
[7] Verborgh et al. WS-REST, 2012
[8] Stadtmüller et al. "Data-Fu", WWW 2013
[9] Käfer et al. "Programming User Agents...", LDOW 2018
|