Review Comment:
In this paper, authors address an interesting problem that handles scalable RDF graphs using Pig and Hadoop. Standard Pig and DFS for Hadoop are extended to support storing RDF data, answering RDF queries and performing rule-base inference. This work is promising, but yet not ready for publication, due to the reasons as follows:
The most critical problem is its novelty. There are several work [1-3] considering how to handle large scale RDF graphs in a distributed environment. The proposed storing scheme is vertical partitioning, which was originally proposed in [4], and was criticised and improved by [5,6]. [1] employs [6] as the underlying storage system, and I doubt whether the system presented in this paper can outperform [1]. I suggest the author to conduct a comparison experiment.
The optimization should be another important issue in this paper, but the presentation is very vague. For example, in the second paragraph of Section 3.3.2, it says "any other extended optimization for RDF data processing can be applied to the original Pig query engine". However, there is no citation about these optimization, and this statement is suspicious. Join ordering optimization is also discussed in a vague way. I also suggest to compare [7] with the join ordering optimization employed in this work.
This paper claims to support rule-based reasoning, but only transitive closure operation is presented in this work. Existing work [8-11] have shown that MapReduce is a powerful tool for reasoning tasks. A discussion about the relation between this work and the existing work is recommended.
Finally, the experiment is unconvincing. For example, in query performance evaluation, the authors compare their prototype with several single-node RDF databases. However, the prototype is run on a cluster of 13 nodes, each of which has an 8-GB memory, but the single-node RDF databases are run on a machine with only 3GB memory. This is not fair. Furthermore, the proposed method should be compared to other distributed RDF stores such as AllegroGraph, [1], and [2].
Minor:
The distributed file system for Hadoop is HDFS. Even though similar, HDFS and GFS are different.
References:
[1] Spyros Kotoulas, Jacopo Urbani, Peter Boncz and Peter Mika. Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig. In Proc. of ISWC, 2012
[2] Jiewen Huang, Daniel J. Abadi, and Kun Ren. Scalable SPARQL Querying of Large RDF Graphs. In PVLDB, 2011
[3] Alexander Sch‰tzle, Martin Przyjaciel-Zablocki, and Georg Lausen. PigSPARQL: mapping SPARQL to Pig Latin. In Proc. of SWIM, 2011
[4] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate Hollenbach. Scalable semantic web data management using vertical partitioning. In Proc. of VLDB, 2007
[5] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. Hexastore: sextuple indexing for semantic web data management. In PVLDB, 2008
[6] Thomas Neumann, Gerhard Weikum. RDF-3X: a RISC-style Engine for RDF. In PVLDB, 2008
[7] Foto N. Afrati, Jeffrey D. Ullman. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Transaction on Knowledge and Data Engineering. 23(9): 1282-1298 (2011)
[8] Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, Henri E. Bal. WebPIE: A Web-scale Parallel Inference Engine using MapReduce. Journal of Web Semantics. 10: 59-75 (2012)
[9] Jacopo Urbani, Frank van Harmelen, Stefan Schlobach, Henri E. Bal. QueryPIE: Backward Reasoning for OWL Horst over Very Large Knowledge Bases. In Proc. of ISWC, 2011
[10] Jacopo Urbani, Spyros Kotoulas, Eyal Oren, Frank van Harmelen. Scalable Distributed Reasoning Using MapReduce. In Proc. of ISWC, 2009
[11] Chang Liu, Guilin Qi, Haofen Wang, Yong Yu. Reasoning with Large Scale Ontologies in Fuzzy pD* Using MapReduce. IEEE Computational Intelligence Magazine. 7(2):54-66 (2012)
|