Speeding up on-disk RDF index lookups using B+Hash Trees
Solicited review by Peter Boncz:
This paper proposes to add a hash table to B+ tree indexes used to store RDF triples, in order to reduce the log(N) lookup cost to constant cost.
One of the prime arguments used to underline the relevance of the distinction between log(N) and constant cost is the supposed larger key space found in RDF (the authors state that the size of this keyspace is |s|*|p|*|o|).
This argument, however, is flawed, as the lookup cost in B+ trees does not depend on the key space but on the actual occurring amount of keys (N).
The amount of tree levels, and the cost off lookup (both O(N)), does therefore not increase more in RDF applications compared to other uses of this lookup structure.
The authors in the start of the paper are not clear about what lookup cost they focus on: CPU cost, CPU cache miss cost, or I/O.
In section 3.3 finally they state that reducing I/O (disk lookup cost) is their main aim. This is puzzling, since normally, both the B+tree and the Hash-enriched variant should have the same I/O cost: namely one I/O per lookup. The cost of B+tree lookup (without Hash) is also one I/O because B+-trees have a wide fan-out: a few hundred triples per page. This means that all layers of the B+tree except the leaves take less than 0.5% of the total space and in all practical configurations therefore are memory resident. Hence, only the leaf level access causes an I/O: one I/O per lookup.
The advantage of this hash-based index is reduced CPU computation cost - not I/O cost.
This observation is immediately the main criticism of the complexity analysis: it does not take into account RAM residency and is therefore not representative of reality.
Additionally, there is amzing little detail on the kind of hashing and its parameters that the authors propose. For instance, the amortized cost of rehashing depends on
the sizing and resizing policy of the hash table, which is not described. Also, hash table performance is very well understood in literature, but hard to talk about without
discussing the load factor (load factor is especially crucial in linear hashing).
(type: 3.3.3 "for cases" => four)
Formula (4) which is supposed to describe the size of the hash map and which might have shed light on its shape and configuration is an empty shell, as it states that it
takes pagesize * nchunks bytes, but fails to define how one could compute nchunks.
In 3.4 the argument is made that re-hash cost could be mitigated by using two disks, servicing requests from one, while the other is being rebuilt. This is not really a good ar ument, ecause competing approaches would gain a two-fold advantage of using the two disks in RAID such that twice as many requests per second could be serviced when compared to the proposed approach.
The evaluation section regrettably also has some problems:
* very small data sizes. The authors state that they could only fit a 30M dataset in 72GB, which amounts to 2KB per triple. Even if they mean that they follow the (not very realistic) approach of 15 permuted indices, we are still consuming more than 120 bytes per triple. This is a factor 10 too much. Normally this should be 12 bytes (24 bytes if one uses 8 bytes ID's which is not needed on these small data sizes). Best practice, also in the described as "simple" RDF-X system, make aggressive use of compression, such that the number of bytes per triple is very low. All experiments in this paper, also the supposedly "large" ones on 150M triples should easily fit the 72GB memory.
* unclear description of implementation. The authors suggest that they have created two different implementations of B+ Hash Trees but they do not describe how they are different. It also not clear in what the the classic B+ tree is different (are all based on Tokyo Cabinet?).
* non-standard queries. The authors make use of traces derived from a SPARQL system; which allows them to test the performance of the lookup generated by SPARQL systems without having to test their trees in a full system. This is fine. It does not explain however, why the creation of special queries was necessary. If this new index has some benefits, than we would hope these would be visible in the normal BSBM and LUBM queries. Now, the reader is left wondering why the authors choose not to tell us about the performance of these accepted queries that belong to these benchmarks which does not make the story more convincing.
* "one can observe that the speedup factor consistently grows as the dataset size increases", but the speedupo in BQ3, LQ1, LQ2 and LQ5 actually decreases slightly from 100M and 150M
* the on-disk experiments are supposed to measure disk interaction. The authors could have used performance measurements to support this, and to break down cost in I/O cost and CPU cost. I already pointed out that this would be crucial for understanding the (true) benefits of the proposed approach. Like this, the results are impossible to interpret, though it is obvious that what is measured is not much I/O because many average results are close or even significantly below a milliscond, while the quoted hardware would lead to at least 10msec per I/O (and the authors argued that each lookup in the normal B+ tree case takes quite a few, log(N), I/Os); plus the fact that these queries even though they are simple lookups consist all of multiple lookups and thus should cause multiple I/Os.
(typo: inballance)
As a whole, this paper does not describe the proposed idea in sufficient depth, and the evaluation has flaws, as well as the (often hand-waiving) reasoning around it. There might be some advantages in having a quick hash-lookup, but these would be mostly in reducing CPU work rather than I/O. What the authors do not touch upon much is memory consumption. They do admit that their approach does take extra memory (but not to which extent - the "nchunk" issue I pointed out). This can be very bad as it is well-known in RDF workloads that avoiding I/O is a matter of life and death, and in case of large datasets therefore the compactness of a representation is crucial. Thus the trade of less CPU for more space is not a wise one, espcially if one thinks that normal B+ trees are highly compressed, and compression is hard to realize in a B-tree as there is no oreder locality in the bucket array that can be exploited
Solicited review by Spyros Kotoulas:
In this paper, the authors are proposing an extension of B+Trees with a hash-table. In principle, the authors propose using a hash-table to avoid having to traverse the nodes of the B+Tree to reach the leaf node containing the data. To keep the hash-table to a manageable size, a prefix of the keys is used.
In general, this is a simple solution, which is, by all means, a positive thing. Although B+Trees present good scalability properties (O(log(n))), reducing this to (O(1)) is very positive, even at the expense of additional storage space. The presentation of the paper is good.
Nevertheless, I have some significant objections to the evaluation of this work:
-Why is the B+Hash Tree compared with a no-sql key-value store (Tokyo Cabinet)? How can we know that the no-sql store does not introduce overhead? In addition, since the approach of the authors includes a B+Tree, why did they not evaluate against that B+Tree?
-Being able to store only 30M triples in a machine with 72GB of RAM seems very memory-inefficient. Each (dictionary-encoded) triple requires more than 2KB of memory, which is very high. Why is this?
-It would be very interesting to report the loading times of the two approaches presented.
-It is worth noting that the B+Tree and the Hashtable in the authors' approach compete for memory. How is this resolved? This is not mentioned anywhere in the paper and the datasets are very small compared to the size of the memory. I would assume that, in this case, the entire hash-table fits in memory.
-The datasets are quite small. They authors claim that their approach is efficient for large datasets, yet only present results for small datasets, although they use a machine with quite a high amount of RAM. I would suggest that they include experiments for larger datasets.
-It would be very useful to include results for the entire query execution life-cycle, since the B+Hash Tree may have effects on system level. This would be easy to do, using, for example, the BSBM driver code.
-The authors do not report on the warm-up procedure. Which queries are posted and how long does it take?
Overall, this is a solid contribution, and I recommend accepting the paper, provided that the evaluation section is improved, as stated above.
Minor issues:
-In the abstract, it is mentioned that Linked Open Data amount to 25 billion triples while in the introduction to 31 billion triples.
-The introduction would benefit from additional references. For example, the authors describe the ways RDF can be mapped to relational databases, but do not provide enough references.
-On page 2, the sentence "Despite the fact... ...dismissed" is very odd. Please rephrase.
-The section 3.3.2 is not understandable; please rewrite. To the "update complexity" of what are the authors referring to?
-On page 11, "reads less disk pages" -> "reads fewer disk pages".
Solicited review by Gerd Groener:
Summary of the Paper
The paper presents an index structure called B+ Hash Trees,
which is an extension of B+ trees.
The aim of this index is to combine the benefits of both data structures:
(i) the constant lookup time of hash functions, and
(ii) the sorted representation of elements in B+ trees combined with the relatively low update costs.
Besides the formal description of the index,
the paper discusses the costs of operations like
insert, delete, lookup, as well as limitations of the approach.
The index is evaluated with four data sets, comparing the B+ Hash
Tree with the B+ Tree.
Evaluation
The paper is well written and organized. It presents an interesting approach, which is simple but seems to be quite efficient.
The strong points of the paper are the comparison of the approach with related work, complexity considerations and the elaborated
valuation.
Furthermore, I appreciate the discussion of limitations.
Unfortunately, several terms and notations are either not properly explained or not introduced at all,
i.e., clear definitions or descriptions are missing.
This could be better addressed in Section 3,
especially in Section 3.1.
- clarify the index structure compared to RDF-3X [10]:
How many B+ Trees and how many Hash Maps are contained in the index in total (permutations of s,p,o)?
- explain in more detail the architecture in Fig.1 (nodes in the B+ Tree)
Grammar:
p.3 second column: an approach which --> an approach, which
p. 5 second column: However in --> However, in

