A Five-Star Rating Scheme to Assess Application Seamlessness

Tracking #: 951-2162

Authors: 
Timothy Lebo
Nick Del Rio
Patrick Fisher
Chad Salisbury

Responsible editor: 
Guest editors linked data visualization

Submission type: 
Full Paper
Abstract: 
Analytics is a widespread phenomenon that often requires analysts to coordinate operations across a variety of incompatible tools. When incompatibilities occur, analysts are forced to configure tools and munge data, distracting them from their ultimate task objective. This additional burden is a barrier to our vision of seamless analytics, i.e. the use and transition of content across tools without incurring significant costs. Our premise is that standardized semantic web technologies (e.g., RDF and OWL) can enable analysts to more easily munge data to satisfy tools’ input requirements and better inform subsequent analytical steps. However, although the semantic web has shown some promise for interconnecting disparate data, more needs to be done to interlink user- and task-centric, analytic applications. We present five contributions towards this goal. First, we introduce an extension of the W3C PROV Ontology to model analytic applications regardless of the type of data, tool, or objective involved. Next, we exercise the ontology to model a series of applications performed in a hypothetical but realistic and fully-implemented scenario. We then introduce a measure of seamlessness for any ecosystem described in our Application Ontology. Next, we extend the ontology to distinguish five types of applications based on the structure of data involved and the behavior of the tools used. By combining our 5-star application rating scheme and our seamlessness measure, we propose a simple Five-Star Theory of Seamless Analytics that embodies tenets of the semantic web in a form which emits falsifiable predictions and which can be revised to better reflect and thus reduce the costs embedded within analytical environments.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Roberto García submitted on 09/Feb/2015
Suggestion:
Major Revision
Review Comment:

The paper covers a very interesting topic, related with estimating the cost of data analysis based on the the tools used, the 7 munging operations identified in the paper and the characterization of the consumed/generated data in terms of the five stars open data classification.

The proposal is very interesting and illustrated with very detailed examples. However, it keeps the whole discussion at the theoretical level, failing to provide real evidence about the suitability of the proposal under real conditions. And specially under the presence of real users that might not have a profound knowledge of semantic web technologies. In this regard, the papers makes to many assumptions about semantics-is-better without enough empirical support.

For instance, in Section 3.2.2 the authors claim that it is better to go back to a semantic representation and use SPARQL and Sgvizler to generate a plot visualization. However, from a user perspective and after a user study, it might be feasible that most data analyst find it easier to paste the HTML table into Tableau and generate the visualization that way.

Consequently, my impression is that the paper lacks some sort of empirical study that includes the user experience dimension into the cost estimation proposal.

Review #2
By John Howse submitted on 12/Feb/2015
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper describes a framework for rating applications by codifying how easy it is to re-use the output of the application in other contexts. The work is motivated and illustrated using an example workflow, in this particular case attempting to learn about the man-made satellites of Earth. In addition to a star-rating, each ecosystem (set of applications) is assigned a pair of scores which (it is claimed) capture the cost of that ecosystem.

I cannot give much insight into the originality of the work, as this is somewhat outside my field of expertise. However, I can comment on the significance of the work (or, at least, the applicability of the work), and the quality of writing. It will be on these aspects, therefore, that I will focus my review.

The mathematical description and notation used throughout was problematic. The problems become pronounced further through the paper. Some can be easily rectified (the intention of the authors is still clear), yet others make the intentions of the authors unclear. This is clearly a problem for the applicability of the result, and contradicts the assertion that the framework provides falsifiable claims to test.

Firstly, the subscript after a dataset D is overloaded. On page 2, we are told that D_\alpha is the dataset generated by application \alpha. On page 3, however, we find that the subscript on D tells us the star-rating of a dataset, so that tbl(D_s) = s. We can also see that D_[1,3] is a set of datasets having a star-rating of 1, 2 or 3. In other words, when we come to page 5 and see D_1,0, we do not know what kind of object we are looking at. 1,0 is neither a star-rating, nor an application (the application we are told in section 3.1.1 is \alpha_1,1). We can probably still work out what is going on, but in fig. 7 we see tbl(D_4,1) = 1^5, which is confusing. This notational issue should be tidied up somehow.

In section 4.1, there are a few layout issues which hamper understandability. Firstly, analytical ecosystem is written as a term on page 10, but not explained until half way down page 11. We are told that a sequence induces an analytical ecosystem, but are unable to determine what that means. Later on, it is described, but it makes the paper more difficult to read than needs be. Secondly, on page 11, we have S_1 and S_* defined, and also current and prospective, but we are not explicitly told that S_1 refers to current and S_* refers to prospective. I assumed they did, but it was not clear from the text (and, therefore, I may be wrong.)
We wait until page 11 to define formally an application \alpha, even though it has been used informally beforehand. I understand that the separation of informal and formal definitions may be for a reason, but surrounding the formal definition is no explanation or intuition as to how that encapsulates the informal concept of an application.

The discussion of cost is missing some information. We're told that the cost of an application is greater than the sum of the individual munge costs. (Note, the text and the displayed maths do not match: the former says at least equal, the latter says strictly greater than.) When do we have equality between sum of munge costs and application cost? What would cause the costs to be different? This is never addressed. We are also told that the munge-level cost is bounded, and yet it is not. There is nothing stopping cost(shim) from being arbitrarily large: all we are told is that it has to be bigger than some number. We are also told that an application must have a strictly positive cost (displayed formula on page 11, col 1), and yet we are then told if not munges are required the munge-cost must be 0 (page 11, col 2). The displayed ordering of munge costs (top of page 11, col 2) is presented with reference to a partial ordering from section 3, which contains no such ordering, and otherwise has no explanation. We are told any function which satisfies this ordering will work, and then given a function which does not satisfy it: cost(shim) = 19, which is not greater than cost(lift) + 2cost(align) + cost(cast). Further, is there any guarantee that choosing a different function will maintain the same ranking of ecosystem scores? In other words, is it possible that given two ecosystems E_1 and E_2, and two different seamless measures based on cost functions c and c', we could have S_1(E_1) < S_1(E_2) but S'_1(E_1) > S'_1(E_2)? This needs to be addressed: if such a situation arose then it undercuts the claim of applicability.

There is no explanation of why the authors have chosen to represent S_1 with the numerator as being the worst case: is it not more normal to have the denominator as the worst case, and then the score would be a proportion between 0 and 1 of the worst score? It does not really matter, it was just unexpected.

I did not know how to read pot(D_\alpha). I see no reason to believe that it is even a function (can more than one condition be satisfied at once?). It would also be much more clear if the authors wrote conneg(D_\alpha) \neq \emptyset, too.

I was surprised that with the cost functions now (sort of) defined, that the anti-patterns were not revisited to show why the house top was bad but the inverted house top was good. The pot function (again, if it is tidied up) gives us the explanation of what makes each part bad or good. We are just told that anti-patterns are expensive, but the reader is left to supply the explanation as to why.

Section 5 suffers from some mathematical problems too, along with an unusual issue that the authors should perhaps explain more thoroughly. The rating for stars 2-5 is based purely on a tool or application. That makes sense, since it gives an indisputable rating of how easy the application is to integrate into a system. However, the criterion for 1-star does not give the same indisputable rating: it is entirely possible (although maybe not probable), that just because a developer, analyst and provider are non-disjoint that we necessarily have an inefficient application. Similarly, just because they are disjoint doesn't guarantee anything. In other words, an application being 1-star doesn't tell me anything about it that will necessarily affect how I use that application.

The cost bounds at the bottom of page 14 and onwards are a bit of a mess. Firstly, we are told that they are intervals, but the first thing that is written is an inequality between two costs. This will evaluate to a Boolean (true or false), which is then written as equal to an interval. Or, we could have that the cost of \alpha_* is less than a cost which is an interval, even though we know costs are single-valued from the definition on an earlier page. Either way, it does not parse. Even worse, however, because equality is transitive we have the equality [0,\infty] = [6,\infty]. Later on we have [6,24] = [6,9]. I found it very difficult to read this section because of the errors.

My overall recommendation is revise and resubmit. The use of mathematics in an article should reduce the ambiguity in the authors' words, whereas in this case it increases the ambiguity. As such, it needs to be very carefully rewritten, and a complete overhaul of the mathematics appearing is necessary.

Review #3
By Aba-Sah Dadzie submitted on 14/Feb/2015
Suggestion:
Major Revision
Review Comment:

The paper describes an approach to aid more fluid analysis across multiple tools and using (input and output) data in various formats, by harnessing SW technology.
The authors posit that to get the best returns from visual analytics and research/analysis using Linked Data there is the need to calculate the costs associated with the analytical process. They term their proposal "Seamless Analytics", described using a five-star model.

There is fairly detailed coverage of relevant literature, especially from the Visual Analytics viewpoint, and this provides good grounding for the work reported.
While a bit difficult to follow, the authors detail a scenario that illustrates an analyst's path to (serendipitous) information discovery. Some of the examples are, however, not very realistic.

Overall, the paper identifies a valid challenge for data analytics, whether visualisation-based or not. It is however difficult to follow, in part because the arguments could be more compactly summarised. Also, how the costs were calculated is not very clear to me. It would also be useful to provide the ideal cost value (for 5 stars) - this would help to evaluate the relative cost for the scenario presented. 2.x compared to infinity is a bit difficult to interpret.

The argument for promoting RDF (and at the next level, Linked Data) as the preferred method for encoding/formatting data, as a route to easing data analysis, could be stronger. While Linked Data is mentioned several times, the authors really focus on RDF as the ideal data set. I could give a few examples in the use case where it would have been useful to have emphasised the particular benefits of LD, e.g., in switching from one perspective to another, especially where this necessitated obtaining additional data that could be seen as attributes of that in use. In fact, the most detailed explanation of why LD is a good candidate is found in the further work section.

******

I have a few questions about the future work:
"For example, an organization with a strong workforce of linked data researchers might further widen the cost between shimming and aligning to better emphasize their efficiency with semantic technologies." - don't understand this - is this saying that such researchers might deliberate RAISE their costs just because they can, or to prove they can, build tools to accommodate this? Or is this saying this would be a route to reducing overall cost?

Why does the pot function not already return "actual cost savings" - was this deliberate design? Or this was simply what can be calculated for now? If any case, why change it? Further, unless the current function does not serve a useful purpose (which begs the question why use it at all) why replace it?

The proposal for further work is actually almost a whole (position) paper in itself, too many points are addressed in a bit more detail than is necessary. I would suggest (only) a few key ones directly related to challenges raised in the main content be addressed here.

A few very specific terms not in wide usage that need to be defined at first use. Or at least point forward to where the definition can be found. E.g., "munge" is not defined till p.3.
"Munging has been recognized in the field for decades …" - WHICH field - the submission is to a Semantic Web journal.
Also need to expand all acronyms, especially those not in common use, at first use. E.g., "… stumbles upon a KML file provided by Analytical Graphics Inc. (AGI)" -> pointing to a footnote with a description of KML - if this is necessary then the implication is the reader may need an expansion of the acronym.
Also, capitalisation of acronyms should be used correctly and consistently - problem mainly in reference list.

"it remains difficult to easily reuse those tools in evolving environments such as the world of linked data analytics – perhaps because they rely on more mundane representations that make it difficult to establish and maintain connections across analyses." - what exactly does "mundane" mean?

"This relation is also shown in Figure 2 using PROV, but we further relate munging activities as also being part of the application." -> points to footnote 3, which points to DC. It's not obvious to me how this is relevant.

"Amy does not know, however, which countries are most responsible for the resulting environmental condition. Is it her home country launching the majority of junk, or some other developing country new to space exploration and less sensitive to environmental awareness?" - Is there some justification for assuming that a "developing country … [IS] … less sensitive to environmental awareness?" While someone may make such an assumption it is probably not a good example for a scientific paper, not for a scenario/use case for a research analyst. Of course, the results further on show this not to be the case, even after being "normalised" - which also highlights why the statement is problematic/contentious.

The scenario describes, initially, Amy's use of certain TYPES of tools. But specifically names some, e.g., Sgvizler Aduna. That she used Sgvizler or Aduna, specifically is not really the point, is it? Shouldn't the focus be on the type of functionality she could access (and maybe Sgvizler, Aduna ... could be given as an example if necessary?) Especially as the aim of the work is to improve analysts' ability to work across the most efficient tool(s) and data for their task, not restrict them to a set of tools.

"Amy uses the selection data obtained from the previous application" -> points to footnote 12. Why is the text in the footnote not worked into the text - it isn't extra info describing the point, but a separate statement that is part of the discussion.

S 3.6 - much as I agree going back to merge two sets of data is inefficient, I can't see why Amy would need to keep the "ClusterMap visual" in memory to do this. Surely, opening multiple windows on a computer screen is not unusual?

"To facilitate linking, analysts’ ecosystems force them to perform anti-patterns such as house tops and hill slides that are associated with higher costs munges. Tools that generate gleanable datasets have the potential to reduce such costs." - this is contradictory.

S4.1 - "The cost to perform an application is at least equal to the cost incurred by its munges. This inequality is based …" - why inequality? " - at least equal" means it may be equal. The equation doesn't shed much more light on what exactly is meant here.

In S5.1: "application ↵1,2 earns one-star " - but Fig. 11 shows two. In 5.2 "application ↵1,2 earns two stars" - which of these two is correct? Further, who is "Mary" in the second discussion?

In S5.3: "The cost bound for three-star applications is not only tighter than one- and two-star applications, but also lower since the upper cost is reduced from 38 to 34." 34 or 24? The equation has the latter.

S5,4 0 the inset seems a bit out of place - would this not have been more useful in the case description?

In the discussion in S5.6 "Amy would have been able to use the result generated by Aduna ClusterMap in the previous application ↵5,1, rather than resorting to an earlier, less evolved dataset. To accommodate, Google Earth would need to be modified to accept geospatial RDF and produce gleanable geospatial visualizations."
I really cannot see how resolving this issue becomes the problem of Google Earth (see also point above about named tools). The argument really should be that Amy makes use of a tool that can take this input and provide her with the map view she needs.

S6 ends "We believe our work embodies the community’s assumptions, claims, and hypothesis as a simple theory that can be used to assess, predict, and refute the tenets of Linked Data that have been advertised for nearly a decade." Was "refute" really meant?

S7 "These extensions should draw strength from the on-going work in the area of Linked Data quality, while at the same time inform future work in that area." references for work on Linked Data quality needed here.

FIGURES

Wrt Fig. 6 and 7… I'm a bit confused - if the charts are "gleanable" why is there a hill slide from the RDF data input?

Fig. 10 - I cannot see either grey or bold face

The labels on the map in Fig 8 are truncated such that the reader can't interpret it - would be useful to annotate it to provide this info.

I initially spent quite a bit of time hunting for where in the text figures were cited - it would be useful to place figures on the same page, or at worst, following the point where they're. Earlier, as is done in some cases, makes it difficult to find relevant context for interpretation.

******* Minor Points

S5.6 - "Both scores S1 and S⇤ when we incorporated a greater number of five-star applications." - sentence is incomplete.

"Semantic Automated Discovery and Integration (SADI)" needs to be cited. This is actually finally done, hidden in a reference to another point.

The text on some of the smaller figures is difficult to read in print - needs higher resolution. It would also, especially for this reason, be useful to annotate the figures to highlight the four countries of interest. Esp. China… it took a bit of searching for me to realise PRC was the matching entry - some countries use abbreviations, others don't - some consistency here would help with reading the charts.

Footnote 22 is incomplete

"Amy’s cognitive model [14] that is casted into a mundane sequence " - cast is irregular - past of "cast" is "cast"

" non-trivial analyses that span across multiple applications." -> "non-trivial analyses that span [] multiple applications." - no "across"

Wrt language, the paper is generally well written. There are however a number of minor grammatical errors/typos - will be caught by an auto check and/or proofread. Section 7 especially changes in style and has a lot of errors - this really needs a single author to proofread and homogenise.