Reasoning with Data Flows and Policy Propagation Rules

Tracking #: 1347-2559

Authors: 
Enrico Daga
Aldo Gangemi
Enrico Motta

Responsible editor: 
Guest Editors Linked Data Security Privacy Policy

Submission type: 
Full Paper
Abstract: 
Data oriented systems and applications are at the centre of current developments of the World Wide Web. In these scenarios, assessing what policies propagate from the licenses of data sources to the output of a given data-intensive system is an important problem. Both policies and data flows can be described with Semantic Web languages. Although it is possible to define Policy Propagation Rules (PPR) by associating policies to data flow steps, this activity results in a huge number of rules to be stored and managed. In a recent paper, we described how it is possible to reduce the size of a PPRs database by using an ontology of the possible relations between data objects, the Datanode ontology, and applying the (A)AAAA methodology, a knowledge engineering approach that exploits Formal Concept Analysis (FCA). In this article we check whether this reasoning is feasible in realistic scenarios. To this purpose, we study the impact of compressing a rule base associated with an inference mechanism on the performance of the reasoning process. Moreover, we report on an extension of the (A)AAAA methodology that includes a coherency check algorithm, that makes this reasoning possible. We show how this compression, in addition to being beneficial to the management of the rule base, also has a positive impact on the performance and resource requirements of the reasoning process for policy propagation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Simon Steyskal submitted on 07/May/2016
Suggestion:
Major Revision
Review Comment:

=== Summary ===
In the present article, the authors elaborate on a strategy for compressing rule bases containing policy propagation rules by utilizing ontologies and particular knowledge engineering methodologies. More specifically, they report on performance results obtained when using the Datanode ontology for expressing relations between data objects as well as applying the (A)AAAA methodology, where the latter was extended by a coherency check algorithm that is required to allow reasoning over the compressed rule base.
The authors further argue that reported results demonstrate that using a compressed rule base allows for more efficient reasoning over stored policy propagation rules.

== Final Remarks ==
First of all, I definitely think that present article has potential and discusses some interesting aspects. Especially when it comes to making use of semantic relations/dependencies among actions (= relations between data objects) which is somewhat similar to the idea of explicit/implicit dependencies among ODRL actions reported in [5]. However, I'm not really convinced by the article's originality nor its "quality". Although the authors clearly state that their article is based on [0] enriched with a coherency check algorithm and evaluations, I'm not really sure whether those extensions consitute a significant advancement wrt. [0]. It reads like [0] with [0]'s Sections 3 & 4 merged together including all the problems such a merge (when performed not thoroughly enough) comes with. Additionally, there are various parts throughout the entire article that contain either colloquial phrases, typos, or simply wrong statements that could have been easily detected if it would have been proof-read at least once (and/or more thoroughly).
Especially, Section 5 (Experiments) needs some heavy reworking/polishing given that it represents one of the two major contributions of the present article.

Long story short, I suggest the authors (i) to proof-read/revise their article, (ii) especially extend/polish Section 5, and (iii) to focus more on the extended functionality wrt. [0] (including "validation of data flows with respect to policies") rather than rehashing (and sometimes poorly paraphrasing) the content of [0].

In the following, you find detailed comments for each section:

== Detailed Comments ==
------------------
0) Abstract
------------------
0.1) "In a recent paper, we described how it is possible to reduce.." -> "In a recent paper, we introduced strategies for reducing.."
0.2) "In this article we check .." -> coll.; use investigate/elaborate/...
0.3) "To this purpose, .." -> "For this purpose, ..."

------------------
1) Introduction
------------------
1.1) "Developers can access a large variety of (open) data, and publish the result of their processing on the Web." -> What's " the result of their processing"? rephrase!
1.2) ".. collection, integration, processing, and redistribution" -> oxford comma
1.3) ".. in order to satisfy the needs of users through remote applications[9]" -> what are "needs" of users? would appreciate an example here
1.4) "Differently from a closed.." -> "Different .."
1.5) ".. on the WWW the ownership and licensing of the data" -> "of data"
1.6) ".. the ownership and licensing of the data do not belong to .." -> someone can be authorized to license data, but "licensing of data" can't belong to someone
1.7) ".. do not belong to the owner of the end user application, and sometimes even to the entity .." -> ".. and sometimes not even to the entity .."
1.8) I found the transition to the paragraph starting with "In this complex scenario, .." a bit sharp. Where are the terms "policy" and "policy propagation" suddenly come from? How do those two terms relate to previously introduced "complex scenario"?
1.9) The entire following paragraph is missing a clear motivation for using/defining PPRs. Why do I need PPRs? How could PPRs support resolving previously introduced "complex scenario"?

------------------
2) Related Work
------------------
2.1) First paragraph is redundant wrt. Section 1, however, the latter two sentences could be used for addressing 1.8) & 1.9) if introduced earlier.
2.2) "Providers of Smart City Data Hubs .. with data resulting from complex pipelines [..]" -> may want to have a look at [1,2] in that context too.
2.3) "It is important to develop technologies that allow policies to be negotiated [3]" -> why is that so? just citing [3] isn't sufficient; "that allow policies to be negotiated, since [3],"
2.4) ".. on the web [21]." -> ".. on the Web [21]."
2.5) "Policies can be represented on the Web." -> too short; what does "can be represented on the Web." mean? In a machine-readable format?
2.6) ".. ODRL Community Group work.." -> "works"
2.7) ".. software, services and data" -> oxford comma
2.8) "ODRL is an emerging information model .." -> only an information model? Although ODRL itself states that "The ODRL Policy Language provides a flexible and interoperable information model to support transparent and innovative use of digital assets .." [3] it clearly defines ODRL's aim to be ".. to develop and promote an open international specification for Policy Language expressions." which in my opinion goes beyond simply providing an information model.
2.9) In this context, I think it would be worth mentioning recent advancements wrt. an official W3C standard for defining permissions and obligations - the W3C Permissions & Obligations Expression Working Group [4].
2.10) ".. to support the exchange of formal descriptions of policies [15]." -> Although ODRL is also available as ontology, it generally defines semantics only in terms of natural language descriptions (see e.g. [5]), so I wouldn't claim that ODRL supports the exchange of "formal" descriptions of policies per se.
2.11) "establish a database of licence descriptions based on RDF and the ontology provided by ODRL (among others)." -> among what? other ontologies? other license descriptions? other providers? // licence, license, .. use one consistently
2.12) ".. what we aim to do here is to study the reasoning on the .." -> coll.; ".. in the present paper we aim at .."
2.13) "Defeasable logic is necessary to reason with deontic statements .." -> Defeasible; not necessary but certainly helpful.
2.14) "This problem has been extensively studied in the literature.." -> what problem?
2.15) "A Policies Propagation Rule (PPR)" -> sing.; "A Policy Prop. Rule"
2.16) "Reasoning on Horn rules is one way of dealing with policies, .." -> "Reasoning over/with Horn rules is one way.."
2.17) ".. , particularly because they allow a tractable defeasable reasoning [1]." -> which means? again, also include justification of claim, not only citation; defeasible; "allow tractable"
2.18) ".. and this is exactly how we decided to tackle our problem here." -> too coll.; rephrase
2.19) "More recently, problem solving methods have been studied in relation to the task of understanding process executions [11]" -> so what? how does this relate?
2.20) "The problem of .. has been deeply studied, and focused on.." -> not anymore?
2.21) ".. our problem is not one of policy enforcement but one of providing.." -> ".. not one of policy enforcement, but providing.."

------------------
3) Reasoning on policies propagation
------------------
3.1) "Reasoning on policies propagation" -> "Reasoning on policy propagation"
3.2) "In this Section we.." -> "In this section, we .."
3.3) "We define the problem of policies propagation as the one of identifying the set of policies associated with .." -> "We define the problem of policy propagation as identifying the set of policies associated to .."
3.4) ".. and they have an RDF representation by the means of the ODRL ontology." -> ".. and that they are expressed in RDF according to the ODRL ontology [cite https://www.w3.org/ns/odrl/2/ODRL21]".
3.5) "A policy expressed with the ODRL model includes.." -> "A policy expressed in ODRL includes.."
3.6) ".. a deontic aspect .." -> only one? an ODRL policy can contain 1..* permissions or prohibitions
3.7) As of ODRL 2.1, an odrl:duty can only be expressed as part of a odrl:permission (cf. relevant discussions on the ODRL mailing list [6,7]).
3.8) ".. associated to a set of odrl:Actions." -> "which are defined for a set of.." to clarify that only permissions/prohibitions are directly associated with assets, not the policies themselves.
3.9) "Set of policies can be associated with assets. " -> Neither is there any concept of "policy sets" in ODRL, nor are assets directly associated with policies.
3.10) I'm not sure why some concepts/properties are in \tt (e.g. odrl:duty, rdfs:range, ..), in \emph (e.g.describes/describeBy,..), and others having no style at all (e.g. relatedWith);
3.11) ".. under the same umbrella." -> coll.; rephrase
3.12) "containmnet" -> "containment"
3.13) "PPR Definition" -> 1. use \begin{definition}; 2. why is there a definition of PPRs all of a sudden? respective transition that's introducing the need for a definition of PPRs is missing.
3.14) ".. when a policy holds for a data object, and this is linked to another with that relation, .." -> linked to another policy? linked to another data object?
3.15) ".. then the policy will also hold for the second one." -> the second data object? what was the first one?
3.16) ".. is a Horn Clause .." -> ".. is a Horn clause .."
3.17) Again, how does the paragraph starting with "EventMedia[16].." relate to the previous one? Sharp transition..
3.18) "Figure 1 displays.." -> depicts/illustrates/exemplifies/...
3.19) "Table 1 lists the licenses or terms of use documents associated with the data sources." -> which data sources? Why is there suddenly a table listing data sources and their T&C? explanation is missing.
3.20) "..being the Upcoming service not available" -> ".. with the Upcoming service not being available.."
3.21) Footnotes of Table 1 are missing
3.22) Listing 1 -> There is no "odrl:asset" property in ODRL, only "odrl:target"; "odrl:sublicense" never made it into the final vocabulary specification; odrl:duty can only be part of an odrl:permission (see 3.7); use an appropriate listing environment such as \usepackage{listings}.
3.23) ".. and produces as output an ODRL set like the one in Listing 2." -> if you differentiate between odrl:Policy and odrl:Set you have to elaborate on their differences.
3.24) Listing 2: see 3.22)
3.25) "Having a description of policies and data flow steps implies a huge number of propagation rules to be managed and computed (number of policies times number of actions)." -> that's a bit handwaving, why's that so?
3.27) "Our hypothesis is that compressing the size of the rule base by enabling some sort of inference mechanism.. " -> some sort of inference mechanism? again, too handwaving.
3.28) ".. would not negatively impact the efficiency of the computation." -> the computation of what? propagated policies?

------------------
4) (A)AAAA Methodology
------------------
4.1) "The approach for compressing a knowledge base of policy propagation rules relies on the Datanode ontology, .." -> THE approach or YOUR approach?
4.2) ".. that organizes the possible dataflow steps in a hierarchy." -> "which organizes possible data flow steps .."
4.3) ".. PPRs.The.." -> ".. PPRs. The .."
4.4) "2) an ontology is available to organise data flow steps in a semantic hierarchy - the Datanode ontology." -> does it only work with the Datanode Ont.?
4.5) "in a semantic hierarchy .. For example, this ontology would tell us that the relation isCopyOf is a kind of isDerivationOf. " -> ".. in a semantic hierarchy, e.g., for expressing the fact that relation isCopyOf is a sub-relation of isDerivationOf".
4.6) "We provide here a journey through the methodology .. how it has been applied." -> coll.; rephrase
4.7) I'm not really convinced that the brief introduction of A1-A5 on p.5 is necessary, given that all phases are discussed in more detail later on anyway. Additionally, due to the low level of detail of A1-A5 on p5 more questions are raised than actually answered.
4.8) "The output of the process is an ordered lattice of concepts: policies that propagate with the same set of relations.", however [8] says: "The output of the process is a set of concepts: clusters of policies that propagate with the same set of relations, ordered in a lattice." -> the term "clusters of policies" is crucial given that you refer to it in A4 without ever mentioning/introducing it before.
4.9) "identifies what are the partial matches between the clusters .." -> "identifies all partial matches between clusters and.."
4.10) ".. depending on whether the policy propagates or not with the given relation." -> does the policy propagates itself? or shall it be propagated given a particular relation.
4.11) ".. we can generate a PPR, and populate the set of rules R." -> on p.4, R is used to denote a Datanode relation between two data objects.
4.12) "We used the Datanode Ontology [7].." -> I don't think that needs to be cited every time you mention it.
4.13) "Thanks to the Contento tool.." -> coll.; remove
4.14) ".. , it was possible to edit manually the matrix with a reasonable effort [6,8]." -> ".. to manually edit.."; what's a "reasonable effort"? reasonable for whom?
4.15) "At the end of this process, the matrix had 3363 cells marked as true." -> where they all defined manually?
4.16) ".. with respect to policies propagation." -> ".. with respect to policy propagation."
4.17) ".. binary matrix representation of the rule base R." -> consisting of policy propagation rules? or any other type of rule?
4.18) "In FCA terms, each concept groups a set of objects .. and maps it to a set of attributes .. An FCA Concept groups a set of objects all having a given set of attributes (and vice-versa). " -> redundant; remove one
4.19) ".. while the bottom concept B all the policies but no relations." -> ".. while the bottom concept B includes all the policies but no relations."
4.20) "The reader can deduce that a large part of Datanode included relations that do not propagate any policy,.. " -> why is the reader able to "deduce" that? "included" -> aren't they included anymore?
4.21) ".. and all the others sub-relations.. " -> "and all the other sub-relations"
4.22) ", for example." -> remove
4.23) "We expect the branches of the ontology.." -> "We expect branches of the ontology"
4.24) Definition of "3. Matching" is poorly paraphrased from [8], e.g.: (1) variables c and h aren't explicitly introduced (only implicitly mentioned in Listing 6), (2) present definition says "We search for (partial) overlaps between branch h in H.." which implies that only one branch (namely h) is evaluated, whereas [8] actually says "We search for (partial) overlaps between branches in H .." which (in correlation with Listing 6) matches the fact that all branches in H are investigated.
4.25) ".. whose intersection with the concept is made of all 7 sub-relations." -> "over all 7 sub-relations"
4.26) Listing 7: use a table like Table 1 of [8] for illustrating the results (including respective captions "c=Concept ID,..")
4.27) "A coherency check process is necessary to identify whether this assumption does hold for all the relations." -> for all relations in the extent?
4.28) "Given a concept c, the algorithm extracts the relations (extent) of each of any superconcept (S)." -> S denotes the set of all super concepts s of c.
4.29) Concept 71/concept 71/concept (71) -> choose one
4.30) "Quasi matches. The result of the Abstraction phase.." -> What are "quasi matches"? I also assume \subsection{Abstraction} is missing here.
4.31) "Here it is worth mentioning some general considerations that can be made by inspecting these measures." -> "Some general considerations can be made by inspecting these measures." and put it below Table 2.
4.32) "The size of the matrix that was prepared in the Acquisition phase is pretty large, and it is possible that errors have been made at that stage of the process." -> how large is "pretty large"? "pretty large" compared to what? what errors could have been made? due to what circumstance?
4.33) ".. referring to [8] for the details of it, and provide three examples." -> three examples of what? merge it with "In what follows we illustrate three examples of changes performed during our application of the methodology."
4.34) Given the amount of content taken from [8] already, I don't see any reason for not including figures for Fill, Wedge, Merge, .. too.
4.35) "After a change to the ontology, .." -> the article revolves around policy propagation, how do changes to the ontology (supposedly to e.g. Datanode?) relate to that? what changes?
4.36) "As shown in Figure 2, after the Adjustment phase we restart a new iteration." -> and loop forever? (only at p.11 conditions for terminating the process are mentioned)
4.37) "Example 3. A branch with similar scores is.." -> similar to ..?
4.38) "and having different policies while containing the same data!" -> "and may have different policies while containing the same data."
4.39) ".. with a more focused semantic:" -> ".. with precise semantics: "
4.40) ".. and a reasonably good compression factor is reached, or no more meaningful changes are possible." -> reasonably/meaningful according to whom? what's a reasonably good compression factor? what are meaningful changes (or what changes aren't meaningful?)
4.41) "are provided in Table 3 (+). This includes .." -> "It/Table 3/,which includes.."
4.42) Table 3 has 2 captions
4.43) "Apart from being mandatory to be able .. " -> rephrase
4.44) "Thanks to this methodology .. " -> coll.; remove
4.45) "we have been able to fix many errors in the initial data, to refine Datanode by clarifying the semantics of many properties and adding new useful ones." -> too fuzzy/handwaving; if you mention those adaptations you also have to elaborate on/explain them in more detail.
4.46) "The version of the ontology at the beginning of this work can be found at.." -> ? I reckon prior to performing said changes?
4.47) Fig 3/4 -> s/Progresss/Progress/

------------------
5) Experiments
------------------
5.1) "The methodology described in the previous Section allows to reduce the number of rules to be stored .." -> "The methodology described in the previous section allows to reduce the number of rules that need to be stored .."
5.2) ".. from a previous work [7].." -> ".. from previous work [7].."
5.3) "Each data flow represents a data manipulation process, from an input data source (sometimes multiple sources), resulting in one principal output node." -> rephrase
5.4) "The task of the reasoner is to find all the policies associated with the output of the data flow, according with the ones associated with the input. " -> what? I assume you mean something along the lines of "Given a set of policies P associated to the input of a data flow, a reasoner tries to find a set of policies P' wrt. P that's associated to the respective output of the data flow".
5.5) ".. when using an Uncompressed or a Compressed rule base." -> ".. when using an Uncompressed and Compressed rule base respectively."
5.6) Listing 12/13 have a different font size than the other listings.
5.7) "SPARQL Language" == "SPARQL Protocol And RDF Query Language Language" -> "SPARQL"
5.8) "These use cases were formalized before the present work (in [7]). " -> which use cases? are data flows == use cases?
5.9) "The related data flow descriptions were not altered for the task at hand, except that the part about the policies of the input was added." -> related to what? s/the part about the/information about/
5.10) There is way too much text in the caption of Table 4.
5.11) REXPLORE-4 is missing in Table 4
5.12) Table 4 only reports on quantitative differences between data flows; do qualitative differences (e.g., complexity) also matter?
5.13) Chosen scale for charts in Fig. 5 isn't really suitable; try a log y-axis to depict changes (even if they are only marginal) more clearly
5.14) "Each data flow describes a process executed within one of five applications." -> what applications? (how) do they differ from each other?
5.15) "The table lists the basic properties of these use cases." -> s/The table/Table 4/; again, what use cases? are data flows == use cases? stick with data flows!
5.16) "The has relation column reports the number of statements about policies." -> "The has policy column .."; I would also explain the columns from left to right, i.e., has policy -> has relation -> relations -> ..
5.17) ".. we feed the reasoner .." -> coll.; ".. we provide xxx as input for the reasoning process .."
5.18) "It is worth noting that the (A)AAAA Methodology is also an evolution method, .." -> context? what's an evolution method? why is it "also an evolution method"?
5.19) "The experiments was executed on a MacBook Pro with processor Intel Core i7/3 .." -> "The experiments were executed on a MacBook Pro with an Intel Core .. processor and .."
5.20) "In case a process was not completed in five minutes, it was forcely interrupted." -> "In case a process was not completed within five minutes, it was interrupted"
5.21) "What we are showing here is the .. " -> coll.; rephrase or remove
5.22) ".. accuracy of the average measure computed from the twenty executions of the same experiment, .. " -> ".. accuracy of the computed average measures, we .."
5.23) What level of CV is considered to be good/bad? Is <0.1 good/bad? Is [0.2,0.4] good/bad? What are the ranges? Why aren't all computed CV's reported (e.g., in a chart/table)?
5.24) ".. except the Query time .." -> either 1) "query time", 2) Q, or 3) \emph{Query time} (preferrably 1) or 2))
5.25) ".. but to observe the impact of our compression methodology on a Naive and an Optimized implementations, .." -> methodology != method; ".. but to observe the impact of our compression approach/method/strategy on a .. implementation .."
5.26) "For each use case, .." -> "For each data flow .."
5.27) ".. of the experiment could not complete in five minutes, .." -> ".. of the experiment could not finish within five minutes, .."
5.28) "The execution time of the experiments with the Optimized reasoner .. having the maximum execution time .." -> "The execution time of the experiments with the Optimized reasoner .. having a maximum execution time .."
5.29) There are a lot of inconsistencies wrt. naming/usage of introduced performance measures/dimensions, e.g., on p.14 following performance measures are introduced: L(=Resources load time),S(=Setup time),Q(=Query time),T(=Total duration), Pa(=Average CPU usage), M(=Maximum memory required by the process). However, in the remainder of p.14 those measures aren't referred to anymore, instead "execution time" is mentioned several times -> what's "execution time"? is T == "execution time"? In 6a, T is suddenly referred to as "total execution time (T)" -> is "total execution time" = "execution time"? 6b reports "Setup/Query execution time"; is "execution time" == "Setup execution time" == S? => is S == T? Be precise and consistent!
5.30) Captions of Figures 6/7 aren't consistent -> While 6 mentions acronyms of respective measures, 7 only does it for 7a (where 6a says ": total execution time (T)" and 7a says ": Total execution time (T)"; total/Total?)
5.31) 6e defines max memory consumption as (M), whereas 7e defines it as (RSS)
5.32) Keep coloring of individual bars consistent, e.g., uncompressed always "light red" and compressed always "light green" (introducing a distinct pattern for both approaches could further improve readability for bw prints [8,9])
5.33) 6b/7b are not very intuitive: 1) "S/Q" could be interpreted as "Setup time per Query time" instead of "Setup and Query time"; it's also not obvious which bars report results for which performance measure (use a legend or see, e.g., [10])
5.34) In general, I would suggest to order the charts wrt. the order their respective performance measure was introduced (i.e., L -> S -> Q -> T -> Pa -> M).
5.35) ".. we report a costant increase .." -> ".. we report a constant increase .."
5.36) ".. increase in performance for all the use cases, in some cases significant (DBREC-3, DBREC-4)." -> how's the overall performance calculated? where's it reported? why is the performance increase for D-3 and D-4 significant, but not for the others? what's the threshold for being significant?
5.37) "We can divide the Total time (T).." -> divide by what? I guess you meant something along the lines of "Total time (T) can be broken up into setup time (S).."
5.38) "The cost of the query time in the Naive reasoner is very large compared.." -> "The costs of querying using the Native reasoner.."
5.39) ".. showed a larger setup time (S) with a very low query execution (Q) cost." -> ".. requires more time for setup, but less for querying .."
5.40) "The reason is that the second materializes all the inferences at setup time.." -> What second? Assuming changes of 5.39) ".. , but less for querying since all inferences are materialized during the setup."
5.41) "We did not observed changes.." -> "We did not observe changes.."
5.42) ".. while changes in .. (M) looks significant." -> s/looks/look/; "look significant"? are they or are they not significant?
5.43) "The boost in space consumption.."-> coll.; "An increase in .."
5.44) ".. has been also observed in the Optimized reasoner .." -> you haven't observed any results in the reasoner itself, but when using it
5.45) The sentence starting with "Such a reasoner could be implemented in several ways, .." is way too long and bulky.
5.46) Y-axis of Fig 8/9 is missing a unit (0.05-0.7 what? seconds/%o/minutes/hours/times faster/apples/..); what's a "boost" analysis? how shall Fig 8/9 be interpreted/read?
5.47) "this is why we bring the experiments with the Optimized implementation.." -> coll.; rephrase

------------------
6) Conclusions
------------------
6.1) "The contributions of this article are two: " -> doesn't really seem to fit here; "The present article provides two major contributions, namely: .."
6.2) ".. reasoning on policies propagation.." -> ".. reasoning on policy propagation.."
6.2) "Future work includes methods to support or automate the generation of data flow descriptions and to study.." -> "Future work includes exploring methods for supporting and automating .. as well as studying the.."
6.3) "..support or automate the generation of data flow descriptions.." -> why is this relevant? why do we want to support/automate the generation of such descriptions? what's the benefit? how does such a description look like?
6.4) "Finally, we are setting up.." -> "Finally, we are currently setting up.."

------------------
7) References
------------------
7.1) "R. Iannella, S. Guth, D. PÃd’hler, and A. Kasten. ODRL: Open Digital Rights Language 2.1." -> encoding

== References ==

[0] E. Daga, M. d’Aquin, A. Gangemi, and E. Motta. Propagation of policies in rich data flows. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015
[1] Stefan Bischof, Christoph Martin, Axel Polleres, Patrik Schneider:
Collecting, Integrating, Enriching and Republishing Open City Data as Linked Data. International Semantic Web Conference (2) 2015: 57-75
[2] Stefan Bischof, Christoph Martin, Axel Polleres, Patrik Schneider:
Open City Data Pipeline - Collecting, Integrating, and Predicting Open City Data. KNOW@LOD 2015
[3] https://www.w3.org/community/odrl/model/2.1/
[4] https://www.w3.org/2016/poe/charter
[5] Simon Steyskal, Axel Polleres:
Towards Formal Semantics for ODRL Policies. RuleML 2015: 360-375
[6] https://lists.w3.org/Archives/Public/public-odrl/2015Jan/0013.html
[7] https://lists.w3.org/Archives/Public/public-odrl/2016Feb/0000.html
[8] http://tex.stackexchange.com/questions/24964/how-to-combine-fill-and-pat...
[9] http://tex.stackexchange.com/questions/243945/bar-chart-patterns-in-legend
[10] http://tex.stackexchange.com/questions/152143/grouped-bar-chart-with-pgf...

Review #2
By David Corsar submitted on 16/May/2016
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper is relevant to the “Data licensing and policy propagation in linked data” topic of the special issue. In terms of originality, this paper is an extended version of [1]; in addition to an alternative motivation and slightly expanded descriptions in certain sections, the two original contributions of this work are: definition of a coherency check stage within the fourth (assessment) phase of the (A)AAAA methodology defined in [1]; and an evaluation that compares performance of two reasoners (in terms of time (load, setup, query), CPU usage, maximum memory consumption) working with an uncompressed or compressed policy propagation rule base with 15 data flows.

In terms of the paper, section 2 (Related Work) consists mainly preliminary material: further motivation for the use of policies in open environments; introduction to the ODRL ontology for policies and the RDF Licenses Database; justification why PROV-O is not used; requirements for representing policy propagation rules; a brief introduction to Formal Concept Analysis and the Contento tool; justification that the (A)AAAA methodology is a knowledge engineering task (which I feel is more relevant for [1] than this paper); and acknowledgement that compression of propositional knowledge bases has been investigated by others. However, there section lacks a satisfactory discussion of previous approaches to the problem tacked by this paper - that of determining which policies propagate from data sources to an output of a computational process, how the (A)AAAA methodology improves on them, and how the coherency check contributed in this work improves on that.

Section 3 provides a welcome description of the authors’ approach for describing policies, data flows, and policy propagation rules. Given the importance of the Datanode ontology’s property hierarchy in the (A)AAAA methodology, I found the description of it insufficient to support the reader through the remainder of the paper – some of the properties are mentioned, but no details of the hierarchy are provided. This section also introduces an example based on the EventMedia system; however, this is not used as effectively as it could be, and as the paper currently stands, offers little value to the paper. However, this could be resolved by better linking with the text – for example, it is unclear if the :outputPset of Listing 2 maps to a node in Fig 1 (which requires namespaces to be defined, along with details of the representation (what do the circles represent?)); similarly the relation to the example policy propagation rule (PPR) - propagates(attribution; isCopyOf) – could be made explicit through the inclusion of namespaces – i.e. change to propagates(odrl:attribute, dn:isCopyOf). The clarify of Fig 1 would also be improved by some rearranging of property labels to improve clarity.

In terms of reproducibility of this work, no references are provided to the resources used (other than the Datanode ontology) – e.g. the policy rule bases, the FCA concept graph, data flows used in the evaluation or reasoner implementations.

Section 4 presents the (A)AAAA methodology as previously published with the addition of the coherency check in the fourth (assessment) phase. The inclusion of examples throughout this section potentially improves the quality of writing; however, the examples provided in the listings and tables throughout the paper use different concepts which, without the availability of the source resources, do not provide a clear illustration of the data flow throughout the (A)AAAA methodology or the information necessary to manually “run” the code listings. For example, Listing 5 describes Concept 71, this is followed by a description of the abstraction algorithm, sample results of which are shown in Listing 7 for Concept 74; the assessment stage is then described, with Listing 9 providing the results of the coherency check for Concept 71. Having a consistent example, ideally with the full example available online, that is derived from the EventMedia mash-up introduced in Section 3 would improve this section and the paper overall by increasing the understandability of the approach and allowing the reader to verify that the presented material (and their understanding of it) is correct.

There appears to be an error with Listing 4 – should the final PPR (prop(dn:metadata, prohibition odrl:transform) be included here, as in Listing 3 that cell has a value of 0, which from my understanding means it should not be included in the set of PPRs – this is consistent with the other rules in Listing 4.

Table 2 appears to be out of place – it presents results of the abstraction algorithm, when the section it is included in (4.4) describes the assessment phase; the text introducing it refers to an “example concept obtained applying the approach” but Table 2 lists several concepts.

Section 5 discusses experiments into the impact of the methodology on reasoner performance. Two reasoners are used, a “naïve” Prolog reasoner and an “optimized” SPIN based reasoner. The paper currently lacks explanation as to why the two reasoners area classified as “naïve” and “optimized” – in what way is the second one optimized? It is also not clear that the compressed rule base is a compressed version of the uncompressed one – this should be made explicit.

The initial description of the evaluation is could be confusing: initially it states that “the task of the reasoner is to find all the policies associated with the output of the data flow…. This task is performed two times: the first providing the full set of propagation rules, the second providing the compressed rule base…” then immediately after the stated objective of the experiment is to “compare (the) performance of the reasoner when using an Uncompressed or a Compressed rule base.” Is it the case that when first describing the task, it should read “providing the reasoner with the full/compressed set of propagation rules as input”?

All the graphs in Section 5 require labels on the X and Y axis to describe what is presented. Many, such as Fig 6 (c) and Fig 6 (e) would also benefit from additional labels on the Y axis to indicate the scale used, as this is not mentioned.

It is also unclear what the Query execution time introduced on page 14 relates to – which queries are being executed here? Is it the query to identify the policies relevant to a piece of data?

Given that the main novel contribution of this paper over [1] is the coherency check, I would have expected some evaluation of its impact. As it is, the evaluation of it is limited to one example described in Table 3. Table 3 would be improved by clarification as to if the same input was used for rows 0-15 of Table 3 (i.e. results of the process without the coherency check), and rows 16-26 (i.e. with the check) or if rows 16-26 used the compressed rule base produced at the end of row 15 and reduced it further using the coherency check. Related to this, the authors should consider evaluating the reasoner performance with the compressed rule based created without the coherency check against a compressed rule base created with the check to further illustrate the value of the check. Similarly, the evaluation only focuses on the end product (i.e. the compressed rule base), but does not consider the differences in time and computational resources of the extended (A)AAAA methodology of this paper with that of [1], nor provide details of the effort required to produce the materials necessary for the (A)AAAA methodology to be applied, which is useful to know from the knowledge engineering perspective for those considering using this approach.

Overall, the results will be of interest not just to those directly involved in this area of policy related work directly, but for those whom it should be a concern – i.e. anyone using data with license restrictions. However, the results would be of more significant if the evaluation was extended as discussed above (and, if possible, a comparison with alternative tools/methods that focus on this task), and the examples and code for the reasoners were made available to allow their reuse/guide future implementations.

The quality of writing is generally acceptable, although would be improved by consideration of the points above. There are a few typos and other minor comments:
Abstract – the phrase “realistic scenarios” is used, however the case studies used in the evaluation are not sufficient described in the paper to justify that they are indeed realistic scenarios.
Table 1 – T&C column, “foursqaure Developers Policies” -> foursquare
Listing 2 – odrl:prohibition cc:ommercialUse; - should this be cc:commercialUse?
Pg 3, right column, “containmnet"
Pg 7 – the description of step "3. Matching": Is this performed for each branch h and each concept c? Is “c” is a concept from “C” (the set defined in 1. Concepts)?
Pg 8 - In the description of compression, the text refers to “use the relation h”, but h is a branch; should this be “use the relation of h”?
Pg 8, left column - has the text “Concept 74 in Listing 5” but Listing 5 states Concept 71.
Pg 10 – the phrase “pretty large” is informal, please quantify.
Pg 14 – the phrase “to be very accurate” – please quantify what is meant by this.
Pg 17 – “costant” should be “constant”; “observed” should be “observe”
There are inconsistencies with the use of listing and tables – for example Listing 7 provides the same data as Table 6 (but for different concepts).

[1] Daga, E., d'Aquin, M., Gangemi, A. and Motta, E. (2015) Propagation of Policies in Rich Data Flows 8th International Conference on Knowledge Capture (K-CAP 2015), Palisades, NY, USA

Review #3
Anonymous submitted on 28/May/2016
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper is well written. As far as the originality go, the problem is well known (minimizing access control policies) but the idea of minimizing aggressively policies via "careless abstraction" and then repairing the result - if I understood correctly - looks interesting to me. I am not entirely convinced of its validation, though, and the significance of the results looks correspondingly low. But maybe I am missing something here.
The authors argue that abstraction may bring to policy compression, for example, if I have two rules (underage customers cannot order beer) and (children cannot order beer) I can discard the latter rule as all children are underage, and save some time at policy evaluation.
Of course, if I try to compress a policy by abstracting carelessly, I may end up with an unwanted effect ("underage customers cannot order beer" and "overweight customers cannot order desserts" could become "customers cannot order menu items") or even with a conflict. All this is well known from the literature. This paper's attempt to define a methodology to "repair" the policy after careless abstraction so that the carelessly-compressed-and-then-repaired policy is equivalent to the original one looks attractive on paper, but - while the authors do a good job in showing that some enforcement time is saved - the real validation would require comparing their methodology with the complexity of direct policy minimization. Also, it is unclear to me what happens with policy updates. I suspect that, depending on update frequency, one may spend more time in running the methodology than one would save when evaluating the policy.
A second - perhaps idle - curiosity: after repairing the policy, can we wstimate how far are we from the minimal one?