Warehousing Linked Open Data with Today’s Storage Choices

Tracking #: 1573-2785

Timm Heuss

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
This paper compares the performance of current storage technologies when warehousing Linked Open Data. This involves common CRUD operations on relational databases (PostgreSQL, SQLite-Xerial and SQlite4java), NoSQL databases (MongoDB and ArangoDB) and triple stores (Virtuoso and Fuseki). Results indicate that relational approaches perform well or best in most disciplines and provide the most stable operation. Other approaches show individual strengths in rather specific scenarios, that might or might not justify their deployment in practice.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 11/Jul/2017
Review Comment:

The submitted manuscript proposes another RDF benchmark with a predefined dataset and adaptive workload.
The benchmark evaluates RDF NoSQL and RDBMS systems and provides some interesting insights about their performance for different CRUD operations.

Overall, the idea of the benchmark is nice. The author propose a new dataset which was not used before, provide a clear setup and the experiments are reproducible and documented in git.
Adding many different systems and comparing them across different workloads is also a nice advantage.

However, for a full fledge benchmark description the author should add details about the schema and specifics for loading RDF data in the RDBMS and NoSQL systems.
I could not find much detail beside the array data type for RDBMS systems.
For instance, how are the statements stored in MongoDB, etc…?
It should be also discussed how the schema affect the queries ( e.g. number of self-joins, or how does a query look in MongoDB)?

Another improvement should be to add the missing relation to existing benchmark and a coverage of related work.
The author should review existing benchmarks and outline how this benchmark differ.

The dataset itself seems to be also problematic. It seems that there are syntax errors in the dataset which prevent the loading into each store.
The author should cleanup the dataset and make sure that the same number of statements can be loaded in each system.
Otherwise, the benchmark is only testing the used parser and further benchmarks are not comparable.

Considering that this submission is a full paper I would suggest to reject the paper and encourage the author for a resubmission.
The originality is rather low ( beside comparing different systems) the relevance of the results are hard to judge without knowing how the data was stored in the RDBMS and NoSQL systems (e.g. what indices were created, not all data loaded).

Some minor details:

Section 2.9
The benchmark should be executed in an isolated setup. Using a personal laptop with 16GB Ram and setting the maximum RAM to 16GB does not work out. I assume that many other programs and services are running on the same laptop which disturb the benchmark.
As such, I recommend to use a dedicated server and check before each run that the amount of available memory is equals for all setups

Figure 1. The readability of the figure is rather low. I would suggest to use different line styles. Also unclear is why there are marks start, 2.5m and end. Is that related to the dataset size?

Review #2
By Giorgos Stoilos submitted on 08/Aug/2017
Review Comment:

The paper presents a benchmark and an evaluation of storage systems for linked data. The paper attempts to compare different storage models and approaches for RDF (linked) data like well-established and modern RBBMs, graph DBs, and triple-stores.

Although the proposed topic is indeed quite important (contrasting all these different storage models) is important and interesting and quite some engineering work has been conducted in collected data and evaluation various systems, in my opinion the paper does not succeed in providing sufficient fundamental or scientifically deep comparison or results.

On the one hand, the proposed benchmark is ill described. There are no details about the queries that have have been designed (how they have been designed, how large or complex they are, how many joins, etc.). A similar description is missing about the data model and complexity of the RDF dataset. Is the dataset highly interconnected or is it just small disconnected parts? On the other hand, with the massive number of RDF and SPARQL benchmarks out there and the evluations that have been conducted it is very difficult to justify why this is not just another collected dataset together with some queries and to show originality and novelty.

Another key issue missing is a description about how the RDF data were converted and stored in the RDBMS and MongoDB. This is an important issue that needs a better description. It is especially important for MongoDB since key-value pairs are significantly weaker than triples.

In the evaluation one should also present the number of tuples returned by each system. Perhaps I missed it but are they all returning the same number of answers for all queries.

Comparing RDBMs systems with triple-stores is slightly unfair in the sense that the latter are also supposed to perform some kind of RDFS-reasoning either at loading or at query time (or at least they are supposed to be able to query interconnected graph-like data) hence it is not surprising that the RDBMs systems are faster. Especially if the dataset is quite loosly interconnected and the data can be easily mapped to the relational model then this is indeed the case. Overall it is not clear what are the results of this experiment. If one required some kind of RDFS reasoning then definatelly the RDBMs systems would be useless (even though faster) but if one does not need any kind of reasoning then obviously the RDBMs systems are the choice to go.

It is not clear to which figures the observations in section 4 are referring to. Some conclusion is made but it is hard figure out out how and why this conclusion is produced. It would be good to add pointers, e.g., Fuseki did this (see Fig 4 (x)).

Equation on page 6 should be clarified and made more precisely. What is the difference between queryscenario and testseries?

Review #3
By Mirko Spasić submitted on 02/Sep/2017
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed
along the usual dimensions for research contributions which include :

(1) originality: The paper is not outstandingly original but provides
some insights into the benchmarking topics

(2) significance of the results: Results presented in the article are
almost useless (will be explained later)

(3) quality of writing: Quite good, couple of typos

General remarks: The paper proposes the new technology-agnostic
benchmark that tests fundamental data operations that are of interest
for a warehouse scenario, and that should be used for evaluation of
the different storage solutions. So, the main feature of this
benchmark should be fairness and justice in order to facilitate the
developers' selection of the most suitable technology. In order to
achieve that goal, the benchmark should not favor any of the storage
solution by wrongly chosen performance metric and selected queries
that are not equivalent among storage solutions. As the benchmark is
designed to evaluate relational DBMSs, No-SQL, and triple stores, so
the queries have to be in different languages (SQL, SPARQL), but still
equivalent in their semantic and complexity. This is not the case
here, and will be explained in details.

[Sec 1]:
In the Introduction, there is a statement that there is no benchmark
that combines 4 mentioned properties. Actually, there is: LDBC Social
Network Benchmark. It test fundamental data operations, it is
technology agnostic, evaluates relations DBMSes, No-SQL, triple
stores, graph database systems, etc, and it operates on synthetic
datasets, that mimic all real-world characteristics.

[Sec 2.8]:
If a computer has 16GB of RAM, it is not good idea to give all of them
to the database system.

In order to start Virtuoso server, it is necessary to have the
virtuoso.ini file in the current directory. If that is not the case,
and you start the server in foreground (just like author mentioned
with +foreground option), it is not true that there is no error
message. You will see: "There is no configuration file virtuoso.ini".
Some of the parameters are used with '+', but some of them are
supposed to be used with '-', e.g. (-f which is the same as

[Sec 3]: The performance metric doesn't make sense. I don't see the
reason why the preparation time will affect performance score in the
following equation: performance(database, queryscenario, testseries) =
(prepare + execution1[+execution2 + execution3])/3. For example, in
the RDBMS, in the preparation step we have creation of the indices,
and there is no such use case scenario where we will drop index before
execution of each query, and build it over and over again. Usually,
these indices are build once, before or after loading the data, and
these times should affect loading times, not query execution
times. But, on the other side, the preparation phase for triple stores
for almost all query scenarios does not exist, and all of these
measurements for Fuseki and Virtuoso are almost 0. Building indices
will take a lot of time (couple of seconds for MEDIUM test
series). This is not fair and it is triple-store biased. This is the
reason why author considered Virtuoso as "the best aggregation
performer" in Section 4.5, and it is not true at all that "Virtuoso
already stores atomic field information instead of complete records",
as the author stated. For example, in
execution times are:

SQLite-Xerial 1112.13 ms
PostgreSQL 1592.18 ms
Virtuoso 3018.93 ms

but in figure 4b you presented PostgreSQL as the best performer (1.0),
followed by Virtuoso (1.11) and then by SQLite-Xerial (2.18). The
reason for this is the preparation time. It is very similar in all the
other query scenarios. For example, in
faster than SQLite-Xerial, and for one order of magnitude faster than
ArangoDB, but that cannot be seen from the performance metric:
Virtuoso (1.0), SQLite-Xerial (3.63) and ArangoDB (7.05).

[Sec 4]:
A lot of observations from this section cannot be valid because of
the wrongly chosen performance metric.

[Sec 4.1]:
Errors_Virtuoso_SMALL.txt: This is not a bug in Virtuoso, this is
the configuration issue. You should increase max vector length setting
in virtuoso.ini file. It is the same problem reported in
Errors_Virtuoso_MEDIUM.txt. Virtuoso is well known because of its
scalability, so the issue reported in Errors_Virtuoso_LARGE.txt stops
it from competition on this scale factor. It would be better to fix
the syntax of RDF file, and repeat the experiment than excluding
Virtuoso from this part of game.

[Sec 4.3]:
In the entity retrieval query scenario, there are two main
problems. The first one lies in the fact that the SQL queries executed
against relational DBMSs are not equivalent to the SPARQL queries,
while the second one is the use of DESCRIBE query statement, which is
not strictly specified in the W3C specification. DESCRIBE may produce
quite different results depending on describe-mode. I would not
recommend using constructs that are not strictly defined by the
standard. The author uses the following query:

describe * where {
?s ?p ?o .
?s ?identifier .
FILTER( ?identifier IN ( ##ids## ))

This is similar to:

select ?s ?p ?o where
?s ?p ?o .
?s ?identifier .
FILTER( ?identifier IN ( ##ids## ))
?s ?p ?o .
?o ?identifier .
FILTER( ?identifier IN ( ##ids## ))

which is much more complicated than the relational query:

select * from justatable where dcterms_identifier in (?);

So, this is unfair against triple stores, and favors relational
DBMSs. The equivalent query should be:

select ?s ?p ?o
where {
?s ?p ?o .
?s ?identifier .
FILTER( ?identifier IN ( "011363517" ))

All of these queries will be executed by Virtuoso (on my computer
which has similar power to the used one, same configurations, Test
Series MEDIUM) in 1-2ms, while the author's proposed SELECT statement
in Listing 1, will take about 7s. So, this is very unfair to
Virtuoso. In this query scenario, the ordering is not mentioned
anywhere, so the Virtuoso's bug referenced in [9] doesn't affect this
query at all.

[Sec 4.4]:
In the Conditional Table Scan scenario, the relational DBMSs are
favored at the same way as in the previous section. The needed query
should be:

select ?s ?p ?o
where {
?s .
?s ?p ?o

instead of:

describe *
?s ?o ?p .
optional { ?s ?type . }
?s .

The first query will run by Virtuoso in 300s (on my computer, as
explained before), which is comparable to the relational systems.

The second conditional query should be:
select ?s ?p ?o
where {
?s ?title .
filter regex(?title, 'stud(ie|y)', 'i') .
?s ?p ?o.
which will run much faster than the query executed against Virtuoso.

Queries executed against Fuseki, are not correct either. The pattern:
optional { ?s ?type . }
is not needed at all, while the pattern
optional { ?s ?title . }
should not be optional, as there is the following filter:
filter regex(?title, 'stud(ie|y)', 'i') .

Similar remarks stay in the 3rd conditional query.

[Sec 4.5]:
In the Aggregation section, queries are comparable, but the
conclusions are not (see remarks about performance metric)

[Sec 5]:
Because all of the aforementioned remarks, this section is quite
wrong. The author said that Virtuoso was well in the certain deletion
MEDIUM, but the reason for that lies in the fact that
UPDATE_LOW_SELECTIVITY_PAPER_MEDIUM finished with an error, and there
was no triple that should be deleted in this scenario.

Minor technical issues:
page 3: Do not reference pages (e.g. see page 4), instead
of that use tables, figures, etc...
page 5: rephrase the following: "Table 3 provides an overview of
characteristic properties these databases"