Signal/Collect: Processing Large Graphs in Seconds

Tracking #: 820-2030

Philip Stutz
Daniel Strebel
Abraham Bernstein

Responsible editor: 
Andreas Hotho

Submission type: 
Full Paper
Both researchers and industry are confronted with the need to process increasingly large amounts of data, much of which has a natural graph representation. Some use MapReduce for scalable processing, but this abstraction is not designed for graphs and has shortcomings when it comes to both iterative and asynchronous processing, which are particularly important for graph algorithms. This paper presents the Signal/Collect programming model for scalable synchronous and asynchronous graph processing. We show that this abstraction can capture the essence of many algorithms on graphs in a concise and elegant way by giving Signal/Collect adaptations of algorithms that solve tasks as varied as clustering, inferencing, ranking, classification, constraint optimisation, and even query processing. Furthermore, we built and evaluated a parallel and distributed framework that executes algorithms in our programming model. We empirically show that our framework efficiently and scalably parallelises and distributes algorithms that are expressed in the programming model. We also show that asynchronicity can speed up execution times. Our framework can compute a PageRank on a large (>1.4 billion vertices, >6.6 billion edges) real-world graph in 112 seconds on eight machines, which is competitive with other graph processing approaches.
Full PDF Version: 

Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sang-Goo Lee submitted on 09/Oct/2014
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors have made necessary changes to resolve the concerns I had with the original manuscript. Mainly, comparison with other models and a detailed discussion about the asynchronous mode. The changes are satisfactory.

Review #2
By Jacopo Urbani submitted on 22/Nov/2014
Minor Revision
Review Comment:

I’m pleased to see that the authors have addressed most of my previous points and that the new version of the paper has improved significantly.

I believe there are still a few problems, which are either related to the new additions, or refer to points that I missed in my previous review. Some of them are rather important (esp. the ones about the pseudocode), but they should all be quite easy to fix. Therefore, my recommendation is for a minor revision.

Below I describe them in more detail:

- page 4, l: I suggest to remove the text within brackets “(for practical reasons an implementation…”. It’s not necessary and it’s quite confusing to refer to implementation guidelines when you describe the model.

- page 4, r: I appreciate the example, but I think friendship is not the best example since it is best modelled with undirected edges. I would replace it with family relations (like fatherOf), which can be modelled with directed edges.

- page 5, r: The paragraph that starts with “Note: “ is also quite confusing, and I’m not sure it’s really needed.

- page 5, r: “A scheduler can assume…” I think you are referring to the implementation of your default scheduler, but I’m not sure which scheduler you are talking about. I would rephrase this sentence, adding an initial “For example, a scheduler can assume” or by removing the “can”.

- algorithm 2: if the user gives unreasonable values to s_threshold and c_threshold, then the signal/collect program can miss some signals/collects and therefore avoid some computation. I believe that the user/reader should be informed that arbitrary values of these two parameters might lead to incomplete or erroneous computation.

- algorithm 2: according to your explanation, scoreSignal() returns 0 if there are no changes in the state of the vertex. If I follow the pseudocode, at the very beginning v.scoreSignal() returns 0 because no state has yet changed (because it’s the first iteration) and consequently the program ends. You need to:
a) explain how states are initialised when the computation starts. For example, do they receive a default value? (notice that on sect. 5.1 you explicitly refer to initial states)

b) make sure that scoreSignal() returns 1 not only when the state changes, but also when a state is in its initial state. Otherwise, following your pseudocode the program will exit immediately.

- page 7, end of sect 3.3: how do you handle modifications in asynchronous settings? A brief sentence would make the explanation more complete.

- page 7, section 3.4: I suggest to remove the subsection about modularity. It does not add much to the discussion, and it is quite short.

- page 8, l: “There are implemntations that support features such as” do you refer to your framework or to other frameworks? Here either more explanation or some citations are needed.

- page 8,r : “To improve on this we could adopt some of the optimizations”. Did you implement them or not? If so, then clarify. If not, then add this point to the limitations section.

- page 18, l: I would explicitly mention why you chose the synchronous computation of Powergraph instead of the asynchronous. Looking at the experiments, I believe that you chose it because it is the fastest one, right?

- page 19, r: why did you measure the memory consumption on the coordinator machine, which should be only marginally involved in the computation? You should measure it on the workers since these are the ones that do the real work, and use this value to compare against powergraph.

- page 21, fig. 20: To improve readability, I would add a citation next to the name of the approaches mentioned in the figure.

Minor points:
- (general): some of the URIs reported on the footnotes are very long. I suggest to use a URI shortener to reduce the space.
- (page 2, l): “lies in the realization, that” -> remove comma
- (page 2, r): “ - We implemented and evaluated a framework that implements” -> “we evaluated a framework that implements” (implementing an implementation is obvious)

Review #3
By Michael Granitzer submitted on 17/Dec/2014
Review Comment:

The manuscript improved from its first version and i see all my remarks answered. The convergence of algorithms is discussed including more details on the score guided asynchronous mode. The comparison with other approaches, although hard to compare directly, shows the validity and competitiveness of the approach.

Hence i recommend to accept the paper.

Some small issues:
- The formatting of algorithm 3 should be improved. The while loop head does not look nice.
- pg 7. Column1, Data-Graph Vertex: "If a multiple signals" singular vs. plural.