Review Comment:
The paper presents a new framework -- signal/collect -- for graph
processing in a parallel and distributed scenario. The article first
describes the programming model, then moves to the description of the
implementation, and finally shows the evaluation over standard graph
problems.
Overall I liked the work. Clearly there is much engineering inside the
framework which is reflected in a fairly good implementation. However,
the paper lacks significantly in its presentation. In my opinion, the
biggest problems are in the formalization of the programming model,
and in the evaluation of the system. In the first case the paper does
not formalize precisely enough what are the basic building blocks of
the computation, and often the feeling is that the authors are simply
describing the APIs of the system. In the second case, the paper
suffers from not showing what are the performance against other
systems. There are plenty of mature systems that do the same
operations of signal/collect and the paper should compare its approach
with at least some of them to give the reader a complete picture of
the contribution.
I suggest that the authors significantly revise their manuscript and
re-submit an improved version. I report below some concrete
points which should be fixed or addressed in such new revision.
Programming model:
- One very important missing part that you haven't defined is what a
signal is. This is a crucial concept and it is fundamental that you
define it, since other parts of your definitions require it
(e.g. "v.signalMap").
- You state that every edge has several edges, but you don't say what
an edge "is". Is perhaps a pair of the type (e.sourceId, e.targetId)?
But then what is the difference between "e.source" and "e.sourceId"?
- Similarly, also the definition of the vertex is incomplete. First of
all, you don't define how a vertex is represented in your model. Is
perhaps defined by a object? a variable? Also, which are the
computational intermediate results contained in "v.state"?
- Furthermore, the paper mentions that there is a "default vertex
type", which contains abstract methods. From this I assume that a
vertex is represented by a predefined object, but this is not stated
anywhere. At the end of this section, I had the vague idea that a a
signal/collect program consists in a number of specified vertices,
each having a "signal" and "collect" methods. However, it is
completely unclear how I should map a graph in your data
structures. For example, suppose that I have a friend-of-a-friend
network where each node contains the name of a person. Where do I
store such information? In v.id?
- Section 3.2 describes the computational model, but already the first
paragraph refers to an attribute "target" which is not specified
anywhere. Then, why is it written in italics? Shouldn't be written in
bold?
- The sentence "Note that...message bus" is completely unclear to
me. I re-read it a few times, but still I could not understand which
are the operations/messages you are talking about.
- You continue this discussion defining two additional methods
(doSignal and doCollect) on every vertex. Why haven't you defined
them in section 3.1, next to the signal/collect methods?
- One more concern about these two methods: can these methods by
overriden by the user? If not, then why do you put them as part of
the model? Aren't they an implementation detail?
- You mention "uncollectedSignals". I really could not find the
definition of it (I guess this is because "target" is not
defined). The same applies for "lastSignalState".
- Section 3.2: The advantages of having an asynchronous computation
are purely an implementation matter (lower latency, etc.). They have
nothing to do with the description of the programming model. I would
move such discussion in another part of the paper.
- I disagree that an asynchronous system better handles system
overloads. You should either elaborate better or cite some works
about it (and anyway this concerns the execution, and not the
programming model).
- It is not so clear to me the difference between data-graph vertex
and data-flow vertex. From the text, I understand that both takes
the signals (the first from v.signals, while the second from
v.uncollectedSignals) and process them according to some logic. What
is the difference between the two?
- About section 3.4. I fail to see how the "extract" and "reduce"
functions fit in your programming model. I mean: signal/collect is a
'vertex-centric' approach where most of the computation is defined
inside the vertex. Now, these two functions are applied to the
entire graph, but it is unclear who executes them and how. To me, it
appears like you wanted to include the map/reduce primitives in your
environment. If that is the case, then this should be a feature
offered by the system, and not by the programming model.
System Implementation:
- First of all, I question your decision to use Scala. You obviously
target problems where there is a large amounts of data and heavy
computation. In this case performance is important, if not
crucial. In this context why did you choose a language that is
objectively slower than others that are currently used to develop
similar frameworks (e.g. C++)?
- I really miss a figure in section 4.1 where it is shown all the
actors that contribute during the execution of a program. You
describe such figure with words, so it should be fairly easy to
translate them in a picture (which is in my opinion much more readable).
- Why do you store your vertex in a hashmap? It is true that the
average access time is O(1), but the space consumption is much
higher. Can you at least provide details on the implementation for
such data structure? Did you optimise it? (I assume so, since it's
such a crucial component in your system)
- "...For large graphs it usually leads to balanced
partitions...". Doesn't a partitioning based on vertex generates
high load unbalances? For example, the vertex "foaf:Person" might
have much more work to do than an less popular concept... you must elaborate more on this,
perhaps reporting some numbers about it.
About the evaluation:
- I really don't know what you mean with "simplicity of a
program". Also later on to explain what you implemented "simplified"
versions of the algorithms. What does this mean? Did you drop some features? Are they
incomplete/not-sound/approximations?
- Your implementation of a transitive closure in really not
efficient. I would drop this example: there is little value in
showing that your programming model is useful to implement
inefficient algorithms.
- Your analysis of the scalability is incomplete. You do not report any
evaluation when you only increase the size of the input and keep the
resources constant. These experiments are necessary to show up to
what limit signal/collect is capable of handling large
inputs. (e.g. if I use only one machine, what is the largest graph I
can process?)
- "We are well-aware that comparing different systems on different
hardware is problematic". Why haven't you tested them on your
machines? Graphlab/powergraph is opensource, and fairly
well-documented. Also GPS is opensource. Nothing stops you from
installing it on your machines, testing the performance against your
approach, and showing the results. I see the lack of comparison
against other systems as one of the major problems of this article
(especially because most of them provide already the implementation
of standard algorithms like pagerank).
- "The synchronous model of Signal/Collect is similar to BSP". I see
the syncronous implementation of signal/collect exactly as an
implementation of BSP. Under which circumstances do you differ?
MINOR COMMENTS (l stands for left column, r for right):
- page 1, abstract: what is a web graph?
- page 1, abstract: "only twelve machines" -> I would remove the word "only". In certain scenarios 12 machines is a very high number.
- page 1, l: retweet-> perhaps "tweets" is better?
- page 1, l: "graphs are one of the most versatile data structures." => I find it a rather strong statement. I suggest to downgrade it...
- page 1, r: "graph processing program scales only [...] takes advantage additional resources" => This is only one definition of the many definitions of scalability. I would rephrase it.
- page 2, r: "We implemented a open source framework" => I don't see this as a contribution of the paper.
- page 3, r: you define "compute graph" in section 3.1, but use it before.
- page 3, r: what does it mean that the vertices are the computational units? I would rephrase this sentence...
- page 4, l: why sometimes terms are in bold while other times they are in italics? Do you follow a specific convention?
- page 4, r: In section 3.2.1 you should add a citation for BSP
- page 5, l: At the end of section 3.2.1 I would mention that scoring will be the topic of the next section. It was not completely obvious to me until I read it.
- page 5, l: the method v.scoreSignal returns a double, but it is either 0 or 1. Why don't you use a boolean?
- page 5, l: similarly as before, v.scoreCollect() returns an integer. They why do you use a double?
- page 5, l: "now that we have extended the basic model with scoring _,_ we specify"
- page 5, l: "computational steps" => loops?
- page 6, l: what is an "execution environment"? Do you mean a program? operating system?
- page 7, r: I find the term commodity cluster quite informal. I suggest to drop the commodity, which does not add much to the sentence.
- page 8,r: "...and even query processing". Remove even.
- page 8, r: "without much boilerplate" is too informal. Rephrase it.
- page 9, l: "have a look" => I would rephrase it with another expression.
- page 11, r: "Granovetter uses..." I would repeat the citation or use another mechanism to make clearer who Granovetter is.
- page 13, r: "The presented algorithms were chosen". This sounds like you are trying to sell your approach. I will stick to the facts, and remove this paragraph entirengly.
- page 13, r: "6174 processors"? Is this a typo? What kind of processors are you talking about?
- page 16,r: unless I missed something I see figure 16 and 17 referenced here for the first time after figure 18.
- page 18, r: I know what the actor model is, but perhaps a citation would be good, given the community you target with this work.
|