Trace-Based Analysis of Large Rule-Based Computations

Tracking #: 577-1783

Authors: 
Terrance Swift

Responsible editor: 
Guest Editors RR2013 Special Issue

Submission type: 
Full Paper
Abstract: 
Knowledge representation systems based on the well-founded semantics can offer the degree of scalability required for semantic web applications and make use of expressive semantic features such as Hilog, frame-based reasoning, and defeasibility theories. Such features can be compiled into Prolog tabling engines that have good support for indexing and memory management. However, due both to the power of the semantic features and to the declarative style typical of knowledge representation rules, the resources needed for query evaluation can be unpredictable. In such a situation, users need to understand the overall structure of a computation and examine problematic portions of it. This problem, of {\em profiling} a computation, differs from debugging and justification which address why a given answer was or wasn't derived, and so profiling requires different techniques. In this paper we present a trace-based analysis technique called {\em forest logging} which has been used to profile large, heavily tabled computations. In forest logging, critical aspects of a tabled computation are logged; afterwards the log is loaded and analyzed. As implemented in XSB, forest logging slows down execution of practical programs by a constant factor that is often small; and logs containing tens or hundreds of millions of facts can be loaded and analyzed in minutes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By François Bry submitted on 12/Dec/2013
Suggestion:
Accept
Review Comment:

The article makes a very good impression. It describes a specific form of run time analysis or logic programs evaluated after SLG resolution.

The srticle is well written, pleasant to rad, and addresses all issuers a reder would expect to be addressed: Description of the approach, recal of the underlying techniques/evaluation method, üpresentation of the analysis, report on an experimental evaluation.

Review #2
By Maurice Bruynooghe submitted on 03/Feb/2014
Suggestion:
Major Revision
Review Comment:

Review of
Trace-Based Analysis of Large Rule-Based Computations
paper by Terrance Swift

This paper describes a logging tool for XSB Prolog (forest logging)
that can help users in profiling a computation and in identifying the
parts of code that are at the origin of bottlenecks in the
computation. The relevance for the Semantic Web Journal is that tools
intended for the semantic web, such as Flora, Silk, and Fidji have XSB
Prolog underneath and hence that these tools can help users of those
systems in analysing applications that consume unreasonable amounts of
computational resources. The paper includes some convincing examples of
the usefulness of the tool.

That being said, this is a technical paper about the XSB system and
SWJ is not the first place I would look for a paper about profiling
XSB programs. However, I leave the judgement about the scope for the
editors.

Overal, the paper is a valuable technical contribution that addresses
a problem that is relevant for XSB users and deserves to be published
somewhere. However, there are some glitches here and there and the
paper needs a careful revision. Below I give a list of detailed
comments.

One more high level remark: In section 3 a morphism H is defined on a
forest, however, it looks as the concept is not used to reconstruct a
forest from a log file (theorem 3.2). This needs clarification. More
general it is unclear whether every analysis starts with
reconstructing the forest from the log, or whether many questions can
be answered by a more direct interpretation of the info in the
log. Moreover, it looks as the proof (and the code used in it) of
theorem 3.2 is incomplete (or I missed something). This need to be
clarified in a second reviewing round.

DETAILS

Perhaps add "forest logging" to the title, e.g., "Forest Logging: A
Trace-Based Analysis of Large Rule-Based Computations".

p1 right bottom:
effort for the SIlk project.
-->
effort for the Silk project.

In ex 1.1: What stands "AP" for in AP biology questions ?

p 2 right top:
and A analyze logs-->
and analyze logs

footnote 3:
While it is fair to refer to other sources for more details about SLG,
I would recommend to recall in one sentence the idea of local
scheduling.

p3 right
(e.g. children -->
(e.g., children (twice)

Ex 2.1, Fig 1:
After step 16: why 17 is next and not 21?
There is a mutual dependency between both trees: how is it decided in
which tree an extra branch is started? Is it determined by the
scheduling or is it arbitrary in this case?

p4 left def 2.1
is there is a path from -->
if there is a path from

p4 right
unique maximal independent SCC
By "maximal" you mean maximal in the partial order over SCC's?

in ex 2.2
scheduling; and as --->
scheduling; as

Fig 2 (p6)
In the body of the second p clause: nor --> not

The code of the tree seems to be inconsistent with the forest:

In the third clause, to be consistent with the forest, the body should be?
:- t(X,Y,Z), not p(Y), not p(Z). (there is also a superfluous "(")

and the facts for t/3 should be (wrong order and incomplete)?
t(a,a,b).
t(a,b,a).
t(a,c,b).

This makes example 2.2 hard to follow

In the tree for p(b):
If you ignore the early completion and create node 9, you should
continue with it and extend it with a fail node? Only then, p(b) is
completed?

The failure of nodes 6 (node 11) should be before the creation of node
10?

End of example 2.2 (p5) "Since p(a) is completely evaluated with no
answers, conditional or otherwise, the evaluation determines it to be
failed"

This is the "ANSWER COMPLETION" operation of p7?

p5 right
top
"If S_called is the first tabled subgoal called in an evaluation, then
S_called is set to null"
It is S_caller which is set to null.

ANSWER RESOLUTION
you refer to the answer S_called\theta by A but A was not introduced.

NEGATIVE REGURN -->
NEGATIVE RETURN

bottom
(i.e. N is -->
(i.e., N is

p6
COMPLETION
By SCC_index, you mean a unique identifier for the SCC to which S
belongs?

p 7 left
ANSWER COMPLETION
An example of this is the failure of p(a) at the end of ex 2.2.
As said above, use the term ANSWER COMPLETION in the example.

Fig 3 (the log file)
As far as I understand, some things go wrong beyond node 15:

next should be
- answer resolution on node 11, leading to the creation of node 16
- na([1],reach(3,_v0),11) is associated with node 18, not node 17
- answer resolution on node 11, leading to the creation of node 19
- na([1],reach(1,_v0)... associated with node 19
....

p7 right
e.g. computations -->
e.g., computations

p8 def 3.2
"closest parent-child ancestor"
the concept has not been defined.
Why not saying "the closest ancestor of n whose selected literal is
tabled" And give at least one example e.g. in fig 1: H(2) = 1 or
perhaps show the whole condensed forest H(F)

The paragraph "In order to reconstruct an SLG ..."

"the parent of each logged fact f needs to be determined"
you mean "each entry in the log file"?

The whole paragraph is not very clear, I wonder whether you can use Figure 1
and its log file in Figure 3 to point out more precisely what is the
issue.

Also, the theorem claims you can reconstruct the trees of the original
forest and does not mention H(F). Neither does the proof. So, what is
the point of introducing H(F) if there is not a two phase process that
first uses the log to reconstruct H(F) en then H(F) to reconstruct(F)?

Definition 3.3
It is not very intuitive and hence not very elegant to say that two
empty bodies are distinguishable.
you would better say e.g. "and the sequences L2, ..., Ln and L′2, ..., L′n
are either distinguishable or empty" ??

case 1 and 2 in the definition: this is "either 1 or 2" (?)

Perhaps you can illustrate with an example what can go wrong when Li
unifies with a literal in Body'?
(as to motivate the requirement "distinguishable")

Theorem 3.2: You are using S without introducing it (Guess it refers
to the root node)

So, the theorem says that, (if enough is tabled) the forest can be
reconstructed.
But in the next sections, it is unclear for which parts of analysis it
is needed to reconstruct the full forest, or whether many answers can
be given by a more direct analysis of the log.

p 9 right
Proposition 3.2 -->
Theorem 3.2

p10
fig 4
582754 were calls to new subgoals
4460609 were calls to incomplete subgoals
3594936 were calls to complete subgoals
right aligning the numbers ie.
582754 were calls to new subgoals
4460609 were calls to incomplete subgoals
3594936 were calls to complete subgoals
would enhance readability

Discussion of analyze_an_scc(39):
one can read between the lines that edges correspond to calls to
predicates with incomplete tables, hence does not refer to the forest,
but it is not said to what it refers: to the subgoal dependency graph?
to H(F)?

p11 left
conists -->
consists

p12 left
it can be useful understand -->
it can be useful to understand

right
written into a internal buffers -->
written into internal buffers

p13 right
for the both the-->
for both the

p14
itself, when logging was turned on the left-recursive-->
itself, when logging was turned on, the left-recursive

footnote
but its queries not as instantiated. -->
but its queries are not as instantiated.

Was not Mireille Ducasse one of the pioneers of trace based analysis
for Prolog?
starting with: Mireille Ducassé: Opium+, a Meta-Debugger for
Prolog. ECAI 1988: 272-277
She deserves a few more refs?

p15
"under certain conditions has the
information available to construct a homomorphic im-
age of an SLG forest. "
I do not understand: none of your theorems mentions "homomorphic
image"

ref [12]
Justificatinos -->
Justifications

p18 right
you are mentioning "Body'" (four times)
You should introduce Body' as the part of the body following L
Body' is also used on p 20.

The code on p 19 has similar problems:
Body_{Right} is introduced, but further on Body' is used

line 33: N should be Node
line 36 and line 38:
you are adding an element to a set, so you should write
S :- S \cup \{element\}

line 42: "Delay - Lit"
You mean "Delays \setminus \{S_called\}" ?

l 45 should be S := S \setminus \{f\}

p20
The next case (lines 24-32 of Figure 8) L is
-->
In the next case (lines 24-32 of Figure 8), L is

I see two cases in the code on p20:
- there is a leftmost tabled literal in the body
- the body is empty

But the case where the body has only non-tabled literals is
missing (and at some point, only those literals are left in the
created "Child", they can become further instantiated but are never
removed?)

If you really reconstruct the original forest as the theorem claims,
you should handle the non-tabled calls and use resolution with the
program clauses to solve them? If the aim is to reconstruct H(F),
things might be different but need also clarification.

Review #3
Anonymous submitted on 06/Feb/2014
Suggestion:
Accept
Review Comment:

In this article, the author introduces so-called forest logging, a means to analyze tabled Prolog computations upon log information acquired during evaluation. In particular, in SLG resolution an evaluation amounts to a sequence of forests of SLG trees. The central contribution of this work is the design of forest logs, that is, a selection of logging information that is (i) amenable to analysis in Prolog itself, (ii) as informative as possible about the computation without (iii) slowing down the latter too much.

The proposed approach achieves (i) resorting to Prolog facts. As for (ii) formal results establish isomorphisms between the forest log and so-called Subgoal Dependency Graphs, as well as between the log and SLG trees for a (sub-)goal. By means of an implementation in XSB, experimental evidence is provided for (iii) based on (scaled) versions of the running examples, and a small set of Prolog rules for personal preference reasoning which serve as additional benchmarking examples (and are publicly available).

The problem addressed is a significant one, as the analysis of long lasting computations not only regarding termination but also regarding efficiency in case of eventual termination is important. The adoption of the presented techniques in the XSB system may be considered another witness of significance of the work, while I would rate the foundation of the log design in terms of formal properties establishing isomorphisms with computational structures most appealing. In this regard I just regret a bit the still-a-bit-sketchy nature of the proofs provided in an Appendix which, in my point of view, might have been stated at more technical detail. Also concerning technical quality, I also wonder whether a more formal account of (iii) could be given in terms of computational analysis (respectively stated more prominently).

The work is put into context of most closely related work in a corresponding section. While providing a correct digest of the state-of-the-art, this section might also serve the purpose of giving account of related issues and problems in a broader context, which I think readers of the journal would appreciate. Maybe it would also be a good place to expand a bit on the differences between profiling and debugging.

To sum up, the work in this article constitutes a significant and original contribution in the area of analyzing tabled Prolog computations. It is highly readable and the presented results appear to be technically sound, respectively plausible in case of experiments. I therefore recommend to accept the paper and in addition to my remarks above just list some minor comments and suggestions for improving a final version below.

Minor remarks and typos:

p1: * 'unpredictability': why not use 'undecidability'
* 'SIlk' should be 'Silk'

p2: * 'queries like those of Example 1': give an instance (repeat the query)
* profiling problem: the differences to debugging or justification
remains unclear, please exemplify. Can it be stated more formally?
* 'overhead of logging is a constant factor': which result establishes
this; what is the factor; partial logging or not?

p4: * In Definition 2.1 'is there is' should be 'if there is'
* 'number associated with each [...] correspond' should be
'number associated with each [...] corresponds'

p9: * In Definition 3.3: Clarify the connective for Conditions 1. and 2.
(they only make sense as alternatives, i.e., disjunctively
connected, but the way the definition is stated suggests a conjunctive
reading)
* In Theorem 3.2: 'reconstruct_tree(S)', here 'S' should be 'Subgoal'
p13: * 'KEs': presumably Knowledge Engineers but please introduce the
abbreviation.

p16: * 'Theorem 1' should be 'Theorem 3.1'; respectively 'Theorem 3.2' should
be used for subsequent occurrences of 'Theorem 2' (on p. 17).

p17: * 'a results' should be 'a result'
* 'such and any' should be 'such that any'

Review #4
By Erwan Jahier submitted on 16/Mar/2014
Suggestion:
Minor Revision
Review Comment:

* General remarks

This article is well-written and structured, albeit the numerous
typos.

It tackles the following important and difficult problem: how to
instrument a high-level programming language run-time system to
obtain an insightful trace of what's going during a program
execution, while being faithful to the operational semantics, and not
sacrificing performance. It does so on a tabled logic programming
language, and the result seems convincing (I've never programmed
using such kind of languages tough). This why I think this article
should be published.

However, I have one concern, that at least should be discussed: do
you really need to store the trace? In other words, why didn't you
try to perform all your analysis on the fly?

Indeed the performance results are sometime a little bit scary: 3.6GB
of log for a 30s execution! So what would happen for execution that
last for several minutes (not to mention hours)? Not sure your
partial logging solution would be enough.

Suggestion: instead of logging your events (or facts), you could talk
(e.g., via sockets) to another process that would run in co-routine
with your program. This process would be made of a
read-eval-print-loop command interpreter that would let one inspect
the program internals, or ask to move one or several facts
forwards. You would then have for free an embryonic debugger. You
could even base your read-eval-print-loop command interpreter on an
existing one (say, your favorite Prolog one). Now your debugger is
programmable, and you can perform all the analysis mentioned in this
article on-the-fly from this Prolog process.

nb : I'm not saying you should do that for this article.

* Minor remarks

- all acronyms (SLG,SDG,KE) should be expanded at least once
- you made inconsistent use of "Figure " versus "Fig. "
- p2,c2,ex2.2: it would be nice to have figure 2 in page 4 or 5
(versus 6) to ease the reading.
- p4,c2,l-10: "(e.g., in node 2 and 10)" : there is no multiple literals in node 2.
- p5 : please expand "tc" and "nc" (tabled call and negative call
I've finally guessed).
- p5 : "If L = not(S_called), a fact" : in my first reading, I could not
parse this sentence. Well, but finally I get it. reword ?
- p5: Note that what you call a Forest log is often referred to as a
trace, and facts as events. Maybe it is worth mentioning this
alternative terminology (?).
- p9,c2,sec4.1, par 2: I'd like to see the exec time of the call to
"forest_log_overview/0"
- p10,c2: lookupSentence/3, forwardSentence/3: not explained
- p12,c2:
"
tc(T1,T2,incmp,_Ctr),
check_variant(cmp(T1,S,_),1),
check_variant(cmp(T2,S,_),1)
"
why do you (seem to) suggest it is not Prolog code ?

* Some (quite a lot!) typos:
- p2,c2,l1 (ie, page2, column, line 1) : a weird " A " appears in the
middle of the sentence.
- there is a paragraph 2.1.1 but no 2.1.2, which looks weird. Maybe
you meant \subsection instead of \subsubsection ?
- p4,c1,def2.1+10l : "S2 in F is" -> "S2 in F if"
- p4,c2,l13: "Figure 1 correspond"->"Figure 1 corresponds"
- P4,c1,sec2.2+2l : "and produce" -> "and to produce"?
- p5,sec 2.2: "A call to a tabled subgoal When" -> "A call to a tabled subgoal. When"
and ditto for all items (\paragraph(?) of sec 2.2
- p5,c2: "REGURN"->"RETURN" please, use a spell checker. From now on,
I don't mention typos catchable by a spell checker.
- p6, Fig. 2 : "p(c):- nor p(a)." -> "p(c):- not p(a)."
"p(X):- t(X,Y,Z),not p(X),not(p(Y)." -> "p(X):- t(X,Y,Z),not p(X),not(p(Y))."
also 3 terminating "." are missing.
- p7,c1, ex2.3: "Table 3" -> "Fig. 3"
- p10,c2, last sentence is weird ; maybe you meant "to provide" -> "that provide"?
- p11,c1: wrong line space + "conists"->"consists"
- p12,c1,l19: "understand"->"to understand"
- p13,c1,l3: "need be"->"need to be"
- p13,c1,l7: "may"->"may be"
- p13,c1,sec5.1: "Figure 1"->"Table 1"
- p15,c1,sec7: "show how"-> "shows how"
- p15,c2: "Acknowledgements"->"Acknowledgments." (2 typos)