Review Comment:
The authors propose intuitive support to especially non-expert users, to increase perception and reduce cognitive load, through the use of "ad-hoc", online, visual and statistical analysis of very large, dynamic, heterogeneous, hierarchical data - such as Linked Data - using a tree visualisation model derived based on data properties.
The paper starts with a good introduction to features provided by visual analysis tools to support exploration, and help to provide useful overviews for large datasets (such as LD) that do not overwhelm the end user.
The authors propose the use of hierarchical visualisation as a solution - while this is a good/reasonable option in and of itself, it would be useful if the introduction also took a brief look at other options, to justify why this is a better choice than any other.
The proposal made is interesting, and could make a good contribution to libraries for visual analysis of RDF or LD. However, I have a couple of reservations:
1. the model and tool are (system) tested, and the authors report “very good" results. However, as a visualisation tool, user evaluation is vital - without proof of usability, the most amazing system test results may not translate into a usable tool or one that will be adopted in practice. While it may be argued as out of scope for this paper, I expected at least some plan for evaluation with end users to be given, especially as the authors state at the start the need to support esp. non-tech users. Also, the FW states plans to develop additional visual techniques, evaluation with end users to determine requirements for these is at the least a good idea.
2. Again, esp. wrt non-tech end users - the authors state that the contribution is not to develop some new hierarchical vis technique, but to provide a model that may be integrated into other vis systems.
Two quotes are of particular relevance here (in order of appearance):
“We are primarily interested on enhancing the visualization and user exploration functionality by providing statistical properties of the visualized datasets and objects, making use of existing computation techniques." - how? I can see the point in the statistical info, but how does this influence/enhance the visualisation or visual analysis?
“In contrast to above approaches, our work does not introduce a new hierarchical visualization technique, instead it proposes a model that can be adopted by the existing non-hierarchical visualization techniques, in order to provide multilevel visualizations." - the argument could be made that some other hierarchical vis model, or even implementation of this, could be used to achieve the same. As above, what in this model makes it preferable over such an option?
Obviously, integrating this model into some other tool cannot be done by a non-tech user, and even for tech experts, this would need some development expertise. FYI, I couldn’t find the config panel - maybe this would help to answer my question? It would be useful to provide some guidelines, however brief, about how this would be done. Essentially, the proposal, in this respect, follows principles of reuse. However, it’s difficult to imagine how this would be done in practice - rdf:SynopsViz has ONE hierarchical visualisation technique implemented on top of the model - a treemap. These are not particularly intuitive, and navigation especially becomes quite complex as interim nodes and leaves increase. Having watched the video, and tried out the online tool, I cannot see where else the hierarchical model is being reused, except, maybe, in the facet panel?
An example of this outside rdf:SynopsViz would be useful - unless, of course, this was an existing tool where exactly this was done to prove the point?
*** Detailed review
“The same also holds for any additional information (e.g., statistical information) that is computed for each group of objects. This information must be recomputed when the groups of objects (i.e., data organization) are modified." - does this have any impact on response? - see also point at bottom about response times reported.
p.3 - “Also, the proposed structure aims at organizing the data in a practical manner for a (visual) exploration scenario, rather than for indexing and querying efficiency." - thinking out loud… would optimising for querying not affect (improve) response during exploration? Suppose could be either way, depending on how this is implemented, but it would be worth clarifying this.
Don’t understand footnote 1 - what exactly does “uniformly handles" mean?
I do not follow this: D = S, i.e., “for each tr ∈ D iff tr ∈ S" - ∈ denotes memberOf - if S is derived from D, this still makes it a sub-set that may or may NOT be the complete dataset, no? Unless this reads “iff for each tr ∈ D, tr ∈ S"
p.5 - in fig.2 node f has ‘p3 age 37’ - fig.1 doesn’t have this value, but 40 for p3
Each leaf node contains λ or λ − 1 triples, where λ = ceil(|D|/l|)^3 - this is simply a factor of the formatting, but I read this as “cubed/to the power 3" - it wasn’t till I reached ex.3 and found a mismatch in the equations that I realised this is footnote 3. - I’d suggest moving the footnote into the text or attaching the pointer to text, rather than the equation. Same for footnote 4.
“The third part is the ConstructInternalNodes procedure, which requires… “ on what basis is the approximation made? It’s not obvious to me how LHS = RHS
Similarly, “ the overall computational cost for the HETree-C construction in the worst case is O(κlogκ + κ + κ) = O(κlogκ)" - again, how is the approximation reached? Considering the final cost calculation is the same as that for the R-Tree, even though they each start from a different point, this clarification is necessary - for both.
I can’t see any difference between algorithms 1 and 2 beyond one being R and the other C - unless I’m missing the obvious, I would suggest they both refer to the same (base) algorithm, and then the (different) procedures expanded from there. On the same point, there are a few other instances where identical information (or nearly so) is repeated for C and R. Simply because this makes reading tedious I’d suggest the point simply be made, in reference to both.
For ex.5, why are λmin = 25 and λmax = 50 ideal/optimal/preferred values? Note, I’m not saying they’re not, but it’s not clear whether they’re random values, based on a specific screen size, based on a [named] set of visualisation guidelines… Along the same lines, why is d = 2 rejected? - what makes it “extreme"? To illustrate, if D had a value, say, 4, 2 would be perfectly acceptable, if not the preferred value of d.
What data can be input - has this got to be a pointer to a (downloadable) RDF file? Considering the paper specifically addresses linked data, not static RDF. I tried a couple of pointers to RDF data sets but just ended up with the everlasting wheel - one is http://www.bbc.co.uk/nature/life/Mammal.rdf
To that end, in S5.2 it would be more useful to point to the actual datasets used, rather than the top level site. Esp as the online tool does not provide a way to browse the source data (or maybe I just didn’t find it?)
FYI, I managed to get the tree and chart views to display for the default dataset loaded, but not the timeline, just the wheel of death.
Some of the tools listed in the review are quite old, would suggest checking to make sure they all still work (some of the URLs fail, some don’t return anything for the examples provided). Alternatively or additionally, distinguish which are early prototypes that may still be in use, or are simply an illustration of where the field has been.
“Payola [45] … The framework offers a variety of domain-specific analysis and visualization plugins (e.g., graphs, tables, etc.)." - I wouldn’t say graphs and tables were domain-specific, or is this referring to analysis (non-visual) alone? If so, what are examples of these?
“Balloon Synopsis [62] … it supports automatic information enhancing, similarity analysis and ontology templates. “ - how exactly is information enhanced here? And from a vis viewpoint, what contribution do the ontology templates make?
How does VISU present the university data? That it is domain-specific is interesting, but no other information is provided.
“In our evaluation scenario, the numeric and temporal properties induced in the employed datasets, are visualized using our hierarchical model. " - what is meant by “induced"? - the properties exist or they don’t. At best they may be derived by linking with another dataset. Unless this has some specific meaning, which needs explaining, I’d suggest deleting it -> “ …properties in the employed datasets are visualized …"
“As the number of input triples increases, the construction time slightly increases, too." - in reference to Tables 4 & 5 - however, these do not provide tripleCount, therefore there’s nothing to compare with.
“From Tables 4 & 5, we can see that the areaWater property requires the minimum construction time (8.7 msec), in HETree-R case; while birdDateP requires the maximum time (346.6 msec). Overall, the HETree structure takes reasonable time, even for properties with 4396 triples, allowing real- time user interaction." … and… without knowing the data structure or how important either property is in relation to the other the statement isn’t meaningful. The numbers are finally presented, later - should be done at the first point, otherwise the reader’s job is made unnecessarily difficult.
For areaWater - 58 triples (small) -> 2% constructing the HETree
squadNumber - 198 triples (medium) -> 5% for HETree
birthDateP - 4396 triples (largest)-> 52% for HETree
The conclusion that on average “only 9% for constructing the HETree" is a bit skewed. Statistically correct, yes, but meaningful, not so much - the variance is HUGE. The comparison should be more like for like.
*****
FIGURES & TABLES
Fig. 1 appears before it is referred to, a bit lost/confusing. The same happens with Fig.2 - in fact the section describing the layout is that following the one in which it’s placed! - it took another half page for me to find answers to my questions about how the layout was determined.
Same with Tables 2, 3, 4 & 5. Worse, most of the description in the text is on the following page so it is tedious going back and forth to map content to description.
“Regarding the most detailed level (i.e., RDF triples), several visualization types are offered; e.g., area, column, line, spline, areaspline, etc. (Figure 7)." - the three examples are so similar visually as not to be a good example set. I’d suggest three sufficiently different visualisation types on the same data, so the user can see how each type contributes to analysis (from a different perspective).
Fig 9 - the lines are so close together that even with the annotation it’s impossible to tell which belongs to C or R. (Even with the closeness noted in the text) I’d suggest one line be broken and the other solid, or some such, OR colour that can be distinguished in monochrome used. AND, the y-axis be staggered/split to pull apart and increase space between the two sets of results.
Further, quantitative data as a measure of “very good" really needs a benchmark or threshold of some sort. Compared to [STATED] expected/recommended maximum response times for interactive analysis, how good is 0.7 sec for 4K objects? Is 4K typical, average, low, for this kind of analysis?
*** Very large number of typos and grammatical errors - most if not all would be caught by an auto-check and proofread, several, but not all examples:
“Given an RDF dataset R consisted of a set of RDF triples. " - not a complete sentence. Further, “CONSISTING"
Ditto, “Regarding the time required for the construction of the HETreee structure."
“user-friendly" is not encouraged much any more, would suggest the term “usable" instead=
“evaluation of our system is presented in Section 16" - doesn’t go to 16
“ The level of a node is defined by letting the root be at level zero. If a node is at level l, then its children are at level l + 1." - the second sentence is redundant
S 5.1 - mixture of different tenses
deadDate => deathDate & birdDateP - importantly, also, these are correct in the tables but not in the text, for consistency alone, this check should have been made.
“overloading is a common issue in large datasets visualisations" -> “overloading is a common issue in large [dataset visualization]"
“It also enriches groups with statistical information regarding their contents," -> CONTENT - no ’s’
“ insights on" -> ON-> INTO
“The remaining of this paper“ -> REMAINDER
“without setting any constraint at the way “ -> “without setting any CONSTRAINTS ON the way "
“Assume that S [i] denote to the i-th triple, with S[1] be the first triple." -> “DENOTES THE …" & “WITH S[1] THE FIRST…"
“the ordered set S is resulted by ordering the triples" -> “the ordered set S RESULTS by ordering the triples"
“Let I− and I+ denote the lower or upper bound of the interval I, respectively" -> “Let I− and I+ denote the lower AND upper bound of the interval I, respectively"
“level of each leaf node differs at most one from the level of other leaves nodes" - > “level of each leaf node differs at most one from the level of other [LEAF OR LEAVES’] nodes"
“we present in more details the "- > “we present in more DETAIL the ", similarly,
“a four stages workflow" -> “a four STAGE workflow", “several visualizations techniques" -> “several VISUALIZATION techniques",
“Squarified Treemaps [17] use shades in order to provide insight in the" -> “Squarified Treemaps [17] use SHADE in order to provide insight INTO the"
“Each internal node, has at most d children nodes." -> comma redundant, there is no natural pause here
another e.g., “Assume the scenario in which, a user wishes to (visually) explore and analyse the historic events from DBpedia, per decade. “ - the first comma is redundant and makes it more difficult to read, while the second is appropriate.
Also, every instance of “Note that," should NOT have the trailing comma - the comma would be fine with the short form (latin) “N.B., … " OR “Note, …"
“For an internal node, its interval is bound by the union" -> BOUNDED - as in boundary, otherwise this is past tense of bind, which I don’t think is what is meant here.
“where the objects values of each group" -> “where the OBJECT values of each group"
While maybe not incorrect, “Opposite to HETree-C" is a bit unusual, this would more typically be written “In opposition to HETree-C, in HETree-R…" or “conversely, HETree-R…"
“a HETree" -> “AN HETree"
“in a button-up fashion" - should this be “bottom-up"?
“In case, where more than one settings satisfy the" -> “In THE case where more than one SETTING SATISFIES the"
“with the children nodes g and h" -> “with the CHILD nodes g and h"
“based on user’s preferences" either “based on USERS’ preferences" OR “based on THE USER’S preferences"
“Facets Generator" -> “FACET Generator" (whether plural or singular)
“The problem of ontology visualization and exploration have been extensively … In existed works, “ -> “The problemS of ontology visualization and exploration have been extensively … In EXISTING works, “
ect. -> etc.
“superscripts indicating the dataset are used in properties names." -> “propertY names" OR “properties’ names"
|