Signal/Collect: Processing Large Graphs in Seconds

Tracking #: 566-1772

Authors: 
Philip Stutz
Daniel Strebel
Abraham Bernstein

Responsible editor: 
Andreas Hotho

Submission type: 
Full Paper
Abstract: 
Both researchers and industry are confronted with the need to process increasingly large amounts of data, much of which has a natural graph representation. Some use MapReduce for scalable processing, but this abstraction is not designed for graphs and has shortcomings when it comes to both iterative and asynchronous processing, which are particularly important for graph algorithms. This paper presents the Signal/Collect programming model for scalable synchronous and asynchronous graph processing. We show that this abstraction can capture the essence of many algorithms on graphs in a concise and elegant way by giving Signal/Collect adaptations of algorithms that solve tasks as varied as clustering, inferencing, ranking, classification, constraint optimisation, and even query processing. Furthermore, we built and evaluated a parallel and distributed framework that executes algorithms in our programming model. We empirically show that our framework efficiently and scalably parallelises and dis- tributes algorithms that are expressed in the programming model. We also show that asynchronicity can speed up execution times. Our framework computes a high-quality PageRank on a large (>1.4 billion vertices, >6.6 billion edges) real-world webgraph in merely 136 seconds – achieved with only twelve commodity machines.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jacopo Urbani submitted on 20/Feb/2014
Suggestion:
Major Revision
Review Comment:

The paper presents a new framework -- signal/collect -- for graph
processing in a parallel and distributed scenario. The article first
describes the programming model, then moves to the description of the
implementation, and finally shows the evaluation over standard graph
problems.

Overall I liked the work. Clearly there is much engineering inside the
framework which is reflected in a fairly good implementation. However,
the paper lacks significantly in its presentation. In my opinion, the
biggest problems are in the formalization of the programming model,
and in the evaluation of the system. In the first case the paper does
not formalize precisely enough what are the basic building blocks of
the computation, and often the feeling is that the authors are simply
describing the APIs of the system. In the second case, the paper
suffers from not showing what are the performance against other
systems. There are plenty of mature systems that do the same
operations of signal/collect and the paper should compare its approach
with at least some of them to give the reader a complete picture of
the contribution.

I suggest that the authors significantly revise their manuscript and
re-submit an improved version. I report below some concrete
points which should be fixed or addressed in such new revision.

Programming model:

- One very important missing part that you haven't defined is what a
signal is. This is a crucial concept and it is fundamental that you
define it, since other parts of your definitions require it
(e.g. "v.signalMap").

- You state that every edge has several edges, but you don't say what
an edge "is". Is perhaps a pair of the type (e.sourceId, e.targetId)?
But then what is the difference between "e.source" and "e.sourceId"?

- Similarly, also the definition of the vertex is incomplete. First of
all, you don't define how a vertex is represented in your model. Is
perhaps defined by a object? a variable? Also, which are the
computational intermediate results contained in "v.state"?

- Furthermore, the paper mentions that there is a "default vertex
type", which contains abstract methods. From this I assume that a
vertex is represented by a predefined object, but this is not stated
anywhere. At the end of this section, I had the vague idea that a a
signal/collect program consists in a number of specified vertices,
each having a "signal" and "collect" methods. However, it is
completely unclear how I should map a graph in your data
structures. For example, suppose that I have a friend-of-a-friend
network where each node contains the name of a person. Where do I
store such information? In v.id?

- Section 3.2 describes the computational model, but already the first
paragraph refers to an attribute "target" which is not specified
anywhere. Then, why is it written in italics? Shouldn't be written in
bold?

- The sentence "Note that...message bus" is completely unclear to
me. I re-read it a few times, but still I could not understand which
are the operations/messages you are talking about.

- You continue this discussion defining two additional methods
(doSignal and doCollect) on every vertex. Why haven't you defined
them in section 3.1, next to the signal/collect methods?

- One more concern about these two methods: can these methods by
overriden by the user? If not, then why do you put them as part of
the model? Aren't they an implementation detail?

- You mention "uncollectedSignals". I really could not find the
definition of it (I guess this is because "target" is not
defined). The same applies for "lastSignalState".

- Section 3.2: The advantages of having an asynchronous computation
are purely an implementation matter (lower latency, etc.). They have
nothing to do with the description of the programming model. I would
move such discussion in another part of the paper.

- I disagree that an asynchronous system better handles system
overloads. You should either elaborate better or cite some works
about it (and anyway this concerns the execution, and not the
programming model).

- It is not so clear to me the difference between data-graph vertex
and data-flow vertex. From the text, I understand that both takes
the signals (the first from v.signals, while the second from
v.uncollectedSignals) and process them according to some logic. What
is the difference between the two?

- About section 3.4. I fail to see how the "extract" and "reduce"
functions fit in your programming model. I mean: signal/collect is a
'vertex-centric' approach where most of the computation is defined
inside the vertex. Now, these two functions are applied to the
entire graph, but it is unclear who executes them and how. To me, it
appears like you wanted to include the map/reduce primitives in your
environment. If that is the case, then this should be a feature
offered by the system, and not by the programming model.

System Implementation:

- First of all, I question your decision to use Scala. You obviously
target problems where there is a large amounts of data and heavy
computation. In this case performance is important, if not
crucial. In this context why did you choose a language that is
objectively slower than others that are currently used to develop
similar frameworks (e.g. C++)?

- I really miss a figure in section 4.1 where it is shown all the
actors that contribute during the execution of a program. You
describe such figure with words, so it should be fairly easy to
translate them in a picture (which is in my opinion much more readable).

- Why do you store your vertex in a hashmap? It is true that the
average access time is O(1), but the space consumption is much
higher. Can you at least provide details on the implementation for
such data structure? Did you optimise it? (I assume so, since it's
such a crucial component in your system)

- "...For large graphs it usually leads to balanced
partitions...". Doesn't a partitioning based on vertex generates
high load unbalances? For example, the vertex "foaf:Person" might
have much more work to do than an less popular concept... you must elaborate more on this,
perhaps reporting some numbers about it.

About the evaluation:

- I really don't know what you mean with "simplicity of a
program". Also later on to explain what you implemented "simplified"
versions of the algorithms. What does this mean? Did you drop some features? Are they
incomplete/not-sound/approximations?

- Your implementation of a transitive closure in really not
efficient. I would drop this example: there is little value in
showing that your programming model is useful to implement
inefficient algorithms.

- Your analysis of the scalability is incomplete. You do not report any
evaluation when you only increase the size of the input and keep the
resources constant. These experiments are necessary to show up to
what limit signal/collect is capable of handling large
inputs. (e.g. if I use only one machine, what is the largest graph I
can process?)

- "We are well-aware that comparing different systems on different
hardware is problematic". Why haven't you tested them on your
machines? Graphlab/powergraph is opensource, and fairly
well-documented. Also GPS is opensource. Nothing stops you from
installing it on your machines, testing the performance against your
approach, and showing the results. I see the lack of comparison
against other systems as one of the major problems of this article
(especially because most of them provide already the implementation
of standard algorithms like pagerank).

- "The synchronous model of Signal/Collect is similar to BSP". I see
the syncronous implementation of signal/collect exactly as an
implementation of BSP. Under which circumstances do you differ?

MINOR COMMENTS (l stands for left column, r for right):

- page 1, abstract: what is a web graph?
- page 1, abstract: "only twelve machines" -> I would remove the word "only". In certain scenarios 12 machines is a very high number.
- page 1, l: retweet-> perhaps "tweets" is better?
- page 1, l: "graphs are one of the most versatile data structures." => I find it a rather strong statement. I suggest to downgrade it...
- page 1, r: "graph processing program scales only [...] takes advantage additional resources" => This is only one definition of the many definitions of scalability. I would rephrase it.
- page 2, r: "We implemented a open source framework" => I don't see this as a contribution of the paper.
- page 3, r: you define "compute graph" in section 3.1, but use it before.
- page 3, r: what does it mean that the vertices are the computational units? I would rephrase this sentence...
- page 4, l: why sometimes terms are in bold while other times they are in italics? Do you follow a specific convention?
- page 4, r: In section 3.2.1 you should add a citation for BSP
- page 5, l: At the end of section 3.2.1 I would mention that scoring will be the topic of the next section. It was not completely obvious to me until I read it.
- page 5, l: the method v.scoreSignal returns a double, but it is either 0 or 1. Why don't you use a boolean?
- page 5, l: similarly as before, v.scoreCollect() returns an integer. They why do you use a double?
- page 5, l: "now that we have extended the basic model with scoring _,_ we specify"
- page 5, l: "computational steps" => loops?
- page 6, l: what is an "execution environment"? Do you mean a program? operating system?
- page 7, r: I find the term commodity cluster quite informal. I suggest to drop the commodity, which does not add much to the sentence.
- page 8,r: "...and even query processing". Remove even.
- page 8, r: "without much boilerplate" is too informal. Rephrase it.
- page 9, l: "have a look" => I would rephrase it with another expression.
- page 11, r: "Granovetter uses..." I would repeat the citation or use another mechanism to make clearer who Granovetter is.
- page 13, r: "The presented algorithms were chosen". This sounds like you are trying to sell your approach. I will stick to the facts, and remove this paragraph entirengly.
- page 13, r: "6174 processors"? Is this a typo? What kind of processors are you talking about?
- page 16,r: unless I missed something I see figure 16 and 17 referenced here for the first time after figure 18.
- page 18, r: I know what the actor model is, but perhaps a citation would be good, given the community you target with this work.

Review #2
By Michael Granitzer submitted on 27/Feb/2014
Suggestion:
Minor Revision
Review Comment:

The paper presents the signal/collect programming model for large
scale graph processing. The processing model has been implemented in Scala (using
Akka). As evaluation the authors present implementations of a broad
range of graph problems in their framework as well as scalability
estimates for Page Rank and Single Source Shortest Path.

The paper is very well written and structured. I really enjoyed
reading it. The evaluation results are impressive with calculating
Page Rank on a billion vertices web graph in under 2 minutes on a
cluster of commodity hardware. So i see a high impact contribution in
term of future large scale graph processing here. Although the paper
seems a little bit off topic for the Semantic Web Journal, it makes
contributions in possible future large-scale inferencing using the framework.

There are two minor points the authors should address from
my point of view:

- Although the authors show a broad range of algorithms, the guarantee
of algorithmic properties like convergence remains unclear. This is
especially true when using score-guided evaluation, where some
updates get lost. The methods should be extended with a paragraph
each on the algorithmic aspects

- The framework seems to be efficient for sparse/scale-free
graphs. For dense graphs it remains unclear to me, how the framework
improves in terms of performance over a pure matrix based
representation which is simply more efficient. The authors should discuss more explicit the
properties of graphs where their framework show good/bad behavior.

Two syntactical things:

- pg 9.: "Another interesting feature for Linked Open Data"....LOD and
graph processing is not motivated before. Better transition from
page rank to LOD would be needed here (Just 1 sentence or so)
- Syntax: Sec. 6. "One way is to interpret a graph is as "

Review #3
By Sang-Goo Lee submitted on 05/Mar/2014
Suggestion:
Minor Revision
Review Comment:

The paper presents Signal/Collect framework, a programming model for graph processing. A diverse set of example codes and discussions on its implementation are presented. The implementation is a practical graph processing system with impressive performance.

The strong points of the paper are;
- the programming model and its extensions
- a good collection of example codes and implementation details
- relatively clear presentation

The weak points of the paper are;
- it is unclear where the novel contributions are. General contributions, such as what this model can do, are presented. But what are the strength of the new model compared to other models such as Pregel? In section 6, there is a part discussing Pregel and other models, but the authors fail to explicitly argue how their model is better than these.
- It seems the asynchronous mode is a feature that differentiates the proposed model to Pregel. There needs to be a more detailed discussion on its need, challenges and risks, and how they are overcome in their implementation.
- If there are other aspects that differentiate the proposed model from previous works such as Pregel, they should be explicitly stated and discussed.
- The experiments should be designed to verify these novelties, as well as showing the model/system works correctly.

Overall, the authors should focus more on presenting, with justifications, the features that are novel to their model/system.