On Link Traversal Querying for a diverse Web of Data

Paper Title: 
On Link Traversal Querying for a diverse Web of Data
Jürgen Umbrich, Aigan Hogan, Axel Polleres, Stefan Decker
Traditional approaches for querying the Web of Data involve centralised warehouses that replicate remote data. Conversely, Linked Data principles allow for answering queries live over the Web by dereferencing URIs to traverse remote data sources at runtime. A number of authors have looked at answering SPARQL queries in such a manner; these link-traversal based query execution (LTBQE) approaches for Linked Data offer up-to-date results and decentralised (i.e., client-side) execution, but must operate over incomplete dereferenceable knowledge available in remote documents, thus affecting response times and “recall” for query answers. In this paper, we study the recall and effectiveness of LTBQE, in practice, for the Web of Data. Furthermore, to help bridge data heterogeneity in diverse sources, we propose lightweight reasoning extensions to help find additional answers. From the state-of-the-art which (1) considers only dereferenceable information and (2) follows rdfs:seeAlso links, we propose extensions to consider (3) owl:sameAs links and reasoning, and (4) lightweight RDFS reasoning. We then estimate the recall of link-traversal query techniques in practice: we analyse a large crawl of the Web of Data (the BTC’11 dataset), looking at the ratio of raw data contained in dereferenceable documents vs. the corpus as a whole and determining how much more raw data our extensions make available for query answering. We then stress-test LTBQE (and our extensions) in real-world settings using the FedBench and DBpedia SPARQL Benchmark frameworks, and propose a novel benchmark called QWalk based on random walks through diverse data. We show that link-traversal query approaches often work well in uncontrolled environments for simple queries, but need to retrieve an unfeasible number of sources for more complex queries. We also show that our reasoning extensions increase recall at the cost of slower execution, often increasing the rate at which results returned; conversely, we show that reasoning aggravates performance issues for complex queries.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Major Revision

Solicited review by Luciano Serafini:

This paper describes a massive work in the area of query processing in the semantic web. So I think that it provides really an interesting message to the semantic web community. The main features introduced by the semantic web and linked open data is the fact that data are managed by different repositories, and in principle they can be accessed only via sparql endpoints. So the hypothesis of collecting all the relevant data before answering a question is unfeasible in the semantic web. The authors take seriously this perspectives and start working on it.

The negative point of this paper, is that it is very difficult to read. The first reason is because it's too long. I think that the message could be encoded in half of the pages. I believe that the authors should fine a more coincide way to state things.

Two examples, on what I mean. The description of the closure w.r.t., the rules is basically three times the same thing, first deref, then seeAlso, then sameAs, then subClass and subProperty, but apply essentially the same principle. So why not finding a way to put things in a simpler manner. The formalization is quite awkward, and I believer it can be simplified and reduced. After all this is an experimental paper, and the objective of the formalization is to clarify what they are computing.

A second example is the description of the benchmark. The authors provides so many details, like the number of uri in the object, property and subject, but it's not clear to me how these data affects the performance of the algorithm.

Finally I was expecting a comparison between the approach proposed by the authors and the one that determines the set of resources relevant fro the query at the beginning, it downloads them, and then apply the standard query algorightms.

My suggestion is to improve the paper, in the readibility by making it more compact and focused.

Solicited review by Günter Ladwig:

In this paper the authors present LiDaQ, an approach for Linked Data
query processing based on link traversal. Compared to previous link
traversal approaches, LiDaQ also incorporates reasoning extensions
that help increase the recall of Linked Data queries. LiDaQ and its
various extensions are thoroughly evaluated in an uncontrolled
environment, which is a first in Linked Data query processing

Overall, I think the paper is an important contribution. The reasoning
extensions are well motiviated and are shown to significantly improve
the recall (as well as query time, unfortunately). The evaluation is
systematic and provides insights into the effects of various design
decisions. It also validates some assumptions that were made in
earlier works, but never evaluated explicitly (such as that
dereferencing predicates or classes does not help much).

One thing I think is missing and would like to see added is a
discussion of early result reporting, i.e. the time it takes to reach
the first or some percentage of all results. Given that the overall
query times are quite high, focusing on early results might make
Linked Data query processing approaches more feasible to use. The
evaluation also measures the time for the first query result, but it
is never discussed.

On that note, does the source selector perform any kind of source
ordering/ranking? Do you use multiple threads to retrieve sources,
e.g. one per pay-level domain?

Section 5.2 says that LiDaQ uses Hartig's iterator-based algorithm as
well as a symmetric hash join, which seems to contradict
itself. Please make this more clear, in particular does the LiDaQ
implementation have the completeness issues of the non-blocking

The work on completeness classes by Harth et al. [1] should be added
to the related work.

Minor issues:

- Query shapes: entity query usually means that the query asks for
entities, here the it is used the other way around (i.e. it asks for
attributes of a given entity). This was a bit confusing at first.
- In Section 8.1 the measure 'lookups' is introduced, but the result
tables all use 'HTTP' for the number of HTTP lookups.

[1] http://www.dblp.org/rec/bibtex/conf/aaai/HarthS12

Solicited review by Saeedeh Shekarpour:

This paper extends the approach employed by Hartig as Linked Traversal Based Query Execution (LTBQE). Basically in this approach SPARQL queries are run in a live mode (using dereferenceable URIs).
The extensions are as follows: 1. Following rdfs:seeAlso. 2. Following owl:sameAs and 3. Applying lightweight reasoning. The paper is well-motivated and written. The state-of-the-art is up to date.
In general, this paper lacks any significant novelty and originality but the analysis on the extension directions is interesting. It revealed that using owl:sameAs and a lightweight rdf reasoning can increase the number of results while following rdfs:seeAlso is not so effective. Obviously runtime significantly degrades.
Interesting discussion points missed which should be added are:
1. Up to how far can following owl:sameAs links or rdfs reasoning increase the recall?
2. What is the redundancy introduced through added triples by following owl:sameAs links or rdfs reasoning?
The main rationale for the approach is that centralized datasets contain stale data and are not always updated. So, this raises the question (which would be interesting to be addressed in the paper) whether the cost of running a crawler to refresh the knowledge base is lower than running live queries?
Section 6:
In the first experimental studies, authors measured the derefrensability of LTBQE based on corpus of Billion triple challenge 2011. Also they showed that the derefrencability of rdfs:seeAlso is quite trivial. Conversely, following owl:sameAs and RDFS inference to some extend increases the availability of raw data. Also URIs in the predicate or rdf:type object positions are unlikely to be derefrencable. These results are the basis of the variations of the extended version LTBQE. As they skip dereferencing URIs bound to either predicate or objects of triple patterns.
Explanation of figure 4, 5 and 6 is unclear. It has been stated that these figures "plot distribution of the number of triples in the dereferenced document that contain the dereferenced term in the predicate (object) positions", while axis titles are referring to the number of URIs versus number of relevant dereferenceable triples. Authors should provide better explanation or axis titles.
Section 7:
At the beginning authors surveys different SPARQL benchmark frameworks
The evaluation framework contains 3 datasets and 10 different configurations.
1. FedBench contains 11 queries. From the DBpedia benchmark just 5 queries can be processed by this approach and finally – queries generated by random walk based on introduced query shapes. Table 7, 9 show comparisons between result size over running queries for different configurations. As it observed using RDF reasoning yields a larger result set.
The benchmarks chosen for evaluation are Fedbench, DBpedia Benchmark and a set of queries generated by Randomwalk approach.

Minor issues:
• Linked Data principles are repeated in the introduction and section 3.3. I would suggest removing them from introduction.
• With respect to the query shapes, entity queries shape seems is a special kind of star queries.
• Definition 4, if L stands for literal, subject of a triple pattern can not be literal and VUL * VU * VUL is wrong.