Review Comment:
The paper describes an approach to aid more fluid analysis across multiple tools and using (input and output) data in various formats, by harnessing SW technology.
The authors posit that to get the best returns from visual analytics and research/analysis using Linked Data there is the need to calculate the costs associated with the analytical process. They term their proposal "Seamless Analytics", described using a five-star model.
There is fairly detailed coverage of relevant literature, especially from the Visual Analytics viewpoint, and this provides good grounding for the work reported.
While a bit difficult to follow, the authors detail a scenario that illustrates an analyst's path to (serendipitous) information discovery. Some of the examples are, however, not very realistic.
Overall, the paper identifies a valid challenge for data analytics, whether visualisation-based or not. It is however difficult to follow, in part because the arguments could be more compactly summarised. Also, how the costs were calculated is not very clear to me. It would also be useful to provide the ideal cost value (for 5 stars) - this would help to evaluate the relative cost for the scenario presented. 2.x compared to infinity is a bit difficult to interpret.
The argument for promoting RDF (and at the next level, Linked Data) as the preferred method for encoding/formatting data, as a route to easing data analysis, could be stronger. While Linked Data is mentioned several times, the authors really focus on RDF as the ideal data set. I could give a few examples in the use case where it would have been useful to have emphasised the particular benefits of LD, e.g., in switching from one perspective to another, especially where this necessitated obtaining additional data that could be seen as attributes of that in use. In fact, the most detailed explanation of why LD is a good candidate is found in the further work section.
******
I have a few questions about the future work:
"For example, an organization with a strong workforce of linked data researchers might further widen the cost between shimming and aligning to better emphasize their efficiency with semantic technologies." - don't understand this - is this saying that such researchers might deliberate RAISE their costs just because they can, or to prove they can, build tools to accommodate this? Or is this saying this would be a route to reducing overall cost?
Why does the pot function not already return "actual cost savings" - was this deliberate design? Or this was simply what can be calculated for now? If any case, why change it? Further, unless the current function does not serve a useful purpose (which begs the question why use it at all) why replace it?
The proposal for further work is actually almost a whole (position) paper in itself, too many points are addressed in a bit more detail than is necessary. I would suggest (only) a few key ones directly related to challenges raised in the main content be addressed here.
A few very specific terms not in wide usage that need to be defined at first use. Or at least point forward to where the definition can be found. E.g., "munge" is not defined till p.3.
"Munging has been recognized in the field for decades …" - WHICH field - the submission is to a Semantic Web journal.
Also need to expand all acronyms, especially those not in common use, at first use. E.g., "… stumbles upon a KML file provided by Analytical Graphics Inc. (AGI)" -> pointing to a footnote with a description of KML - if this is necessary then the implication is the reader may need an expansion of the acronym.
Also, capitalisation of acronyms should be used correctly and consistently - problem mainly in reference list.
"it remains difficult to easily reuse those tools in evolving environments such as the world of linked data analytics – perhaps because they rely on more mundane representations that make it difficult to establish and maintain connections across analyses." - what exactly does "mundane" mean?
"This relation is also shown in Figure 2 using PROV, but we further relate munging activities as also being part of the application." -> points to footnote 3, which points to DC. It's not obvious to me how this is relevant.
"Amy does not know, however, which countries are most responsible for the resulting environmental condition. Is it her home country launching the majority of junk, or some other developing country new to space exploration and less sensitive to environmental awareness?" - Is there some justification for assuming that a "developing country … [IS] … less sensitive to environmental awareness?" While someone may make such an assumption it is probably not a good example for a scientific paper, not for a scenario/use case for a research analyst. Of course, the results further on show this not to be the case, even after being "normalised" - which also highlights why the statement is problematic/contentious.
The scenario describes, initially, Amy's use of certain TYPES of tools. But specifically names some, e.g., Sgvizler Aduna. That she used Sgvizler or Aduna, specifically is not really the point, is it? Shouldn't the focus be on the type of functionality she could access (and maybe Sgvizler, Aduna ... could be given as an example if necessary?) Especially as the aim of the work is to improve analysts' ability to work across the most efficient tool(s) and data for their task, not restrict them to a set of tools.
"Amy uses the selection data obtained from the previous application" -> points to footnote 12. Why is the text in the footnote not worked into the text - it isn't extra info describing the point, but a separate statement that is part of the discussion.
S 3.6 - much as I agree going back to merge two sets of data is inefficient, I can't see why Amy would need to keep the "ClusterMap visual" in memory to do this. Surely, opening multiple windows on a computer screen is not unusual?
"To facilitate linking, analysts’ ecosystems force them to perform anti-patterns such as house tops and hill slides that are associated with higher costs munges. Tools that generate gleanable datasets have the potential to reduce such costs." - this is contradictory.
S4.1 - "The cost to perform an application is at least equal to the cost incurred by its munges. This inequality is based …" - why inequality? " - at least equal" means it may be equal. The equation doesn't shed much more light on what exactly is meant here.
In S5.1: "application ↵1,2 earns one-star " - but Fig. 11 shows two. In 5.2 "application ↵1,2 earns two stars" - which of these two is correct? Further, who is "Mary" in the second discussion?
In S5.3: "The cost bound for three-star applications is not only tighter than one- and two-star applications, but also lower since the upper cost is reduced from 38 to 34." 34 or 24? The equation has the latter.
S5,4 0 the inset seems a bit out of place - would this not have been more useful in the case description?
In the discussion in S5.6 "Amy would have been able to use the result generated by Aduna ClusterMap in the previous application ↵5,1, rather than resorting to an earlier, less evolved dataset. To accommodate, Google Earth would need to be modified to accept geospatial RDF and produce gleanable geospatial visualizations."
I really cannot see how resolving this issue becomes the problem of Google Earth (see also point above about named tools). The argument really should be that Amy makes use of a tool that can take this input and provide her with the map view she needs.
S6 ends "We believe our work embodies the community’s assumptions, claims, and hypothesis as a simple theory that can be used to assess, predict, and refute the tenets of Linked Data that have been advertised for nearly a decade." Was "refute" really meant?
S7 "These extensions should draw strength from the on-going work in the area of Linked Data quality, while at the same time inform future work in that area." references for work on Linked Data quality needed here.
FIGURES
Wrt Fig. 6 and 7… I'm a bit confused - if the charts are "gleanable" why is there a hill slide from the RDF data input?
Fig. 10 - I cannot see either grey or bold face
The labels on the map in Fig 8 are truncated such that the reader can't interpret it - would be useful to annotate it to provide this info.
I initially spent quite a bit of time hunting for where in the text figures were cited - it would be useful to place figures on the same page, or at worst, following the point where they're. Earlier, as is done in some cases, makes it difficult to find relevant context for interpretation.
******* Minor Points
S5.6 - "Both scores S1 and S⇤ when we incorporated a greater number of five-star applications." - sentence is incomplete.
"Semantic Automated Discovery and Integration (SADI)" needs to be cited. This is actually finally done, hidden in a reference to another point.
The text on some of the smaller figures is difficult to read in print - needs higher resolution. It would also, especially for this reason, be useful to annotate the figures to highlight the four countries of interest. Esp. China… it took a bit of searching for me to realise PRC was the matching entry - some countries use abbreviations, others don't - some consistency here would help with reading the charts.
Footnote 22 is incomplete
"Amy’s cognitive model [14] that is casted into a mundane sequence " - cast is irregular - past of "cast" is "cast"
" non-trivial analyses that span across multiple applications." -> "non-trivial analyses that span [] multiple applications." - no "across"
Wrt language, the paper is generally well written. There are however a number of minor grammatical errors/typos - will be caught by an auto check and/or proofread. Section 7 especially changes in style and has a lot of errors - this really needs a single author to proofread and homogenise.
|