Hierarchical Visual Exploration and Analysis on the Web of Data

Tracking #: 957-2168

Authors: 
Nikos Bikakis
George Papastefanatos
Melina Skourla
Timos Sellis

Responsible editor: 
Guest editors linked data visualization

Submission type: 
Full Paper
Abstract: 
The purpose of data visualization is to offer intuitive ways for information perception and manipulation, especially for non-expert users. The Web of Data has realized the availability of a huge amount and variety of datasets; most of them offer SPARQL endpoints for online access and analysis. However, most traditional visualization tools and methods operate on an offline way without offering the ability for ad hoc visualization and analysis of large dynamic sets of data. In this work, we present a model for building, visualizing, and interacting with hierarchically organized Linked Data (LD). Our model is build on top of a lightweight tree-based structure which can be easily constructed on-the-fly for a given set of data. This tree structure organizes input data objects into a hierarchical model based on the values of their properties that they exhibit. Additionally, we define two versions of this structure, which adopts different data organization approaches, well-suited to visual exploration and analysis context. Furthermore, statistical computations can be efficiently performed on-the-fly in the proposed structure. The presented model is realized in a web-based prototype tool, called rdf:SynopsViz that offers hierarchical visual exploration and analysis over LD datasets. Finally, we provide an evaluation of our approach employing LD datasets.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Tomi Kauppinen submitted on 04/Feb/2015
Suggestion:
Minor Revision
Review Comment:

My review of the paper is organized according to suggested dimensions of (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality

The paper is very good and clear. The idea (hierarchical visual exploration) is rather simple but the development of it is done and communicated in a very nice way. The paper properly explains the motivation, provides algorithms (that are surprisingly easy to follow), gives examples and presents the evaluation results. It would have been nice to see some evaluation of the system also with human users but I suppose for the needs to the paper the current evaluation (time elapsed with different methods) should be sufficient.

(2) significance of the results

The results are clearly good and support well the requirements (need for the real-time exploration of data).

(3) quality of writing.

The paper is very well written. Some minor issues:

- there are almost no spelling issues, except in the abstract "our model is build …" -> "our model is built"
- fig 8 is quite fuzzy. It would be nice to be able to read the text in it. Also figures 6 & 7 should also be improved if possible.

I would recommend to accept the paper given that the above few issues are corrected.

Review #2
By Heiko Paulheim submitted on 16/Feb/2015
Suggestion:
Reject
Review Comment:

The paper introduces a semantic web visualization technique based on the creation of hierarchies for numerical values, which allow for exploring such data in a drill-down fashion.

I have two major concerns about this paper. First, the creation of hierarchies for numerical values, which fills the major portion of this paper, is actually a well-researched problem, but the state of the art is not respected in this paper; instead, the authors rather present a re-invention of the wheel. Second, the evaluation is not appropriate for the problem at hand. For those two reasons, I cannot recommend the acceptance of this paper.

With respect to my first concern: the authors, in short, want to create a hierarchic discretization of a numeric attribute. Discretizing numerical attributes is a well-researched field in data mining (see any data mining text book, such as [1], and an introductry article [2]). Hence, the two characteristics of HETrees introduced by the authors have their counterparts in the literature: the discretization created by HETree-C is known as "equal frequency binning"; the one created by HETree-R as "equal width binning". Furthermore, sophisticated implementations exist in tools such as Weka or RapidMiner. More advanced hierarchical variants for discretization have also been proposed in the past, see, e.g., [3]. Here, the authors should argue why their work is new, or build on existing works.

With respect to my second concern: evaluating the tree characteristics and runtimes to build the trees is interesting, but does not prove the utility of the approach proposed by the authors. What is required here is a user study showing that end users benefit from the proposed visualization technique, i.e., they can address an information need faster, or with greater satisfaction.

Another remark about the evaluation is that I cannot retrace the triple counts in table 2. The actual numbers in DBpedia are much larger. For example, for areaWater (the first example), there are 55,501 in DBpedia 2014, which is three orders of magnitude larger than the 58 in the example. The same holds for the other properties as well.

Minor comment: Placing the related work section between the system description and the evaluation is a bit unusual. It commonly goes after the introduction or before the conclusion.

[1] Tan et al.: Introduction to Data Mining.
[2] Dougherty et al.: Supervised and unsupervised discretization of continuous features. ICML 1995.
[3] Shen and Chen: A dynamic-programming algorithm for hierarchical discretization of continuous attributes. In: European Journal of Operational Research 184(2), 2008

Review #3
By Aba-Sah Dadzie submitted on 20/Apr/2015
Suggestion:
Major Revision
Review Comment:

The authors propose intuitive support to especially non-expert users, to increase perception and reduce cognitive load, through the use of "ad-hoc", online, visual and statistical analysis of very large, dynamic, heterogeneous, hierarchical data - such as Linked Data - using a tree visualisation model derived based on data properties.

The paper starts with a good introduction to features provided by visual analysis tools to support exploration, and help to provide useful overviews for large datasets (such as LD) that do not overwhelm the end user.
The authors propose the use of hierarchical visualisation as a solution - while this is a good/reasonable option in and of itself, it would be useful if the introduction also took a brief look at other options, to justify why this is a better choice than any other.

The proposal made is interesting, and could make a good contribution to libraries for visual analysis of RDF or LD. However, I have a couple of reservations:
1. the model and tool are (system) tested, and the authors report “very good" results. However, as a visualisation tool, user evaluation is vital - without proof of usability, the most amazing system test results may not translate into a usable tool or one that will be adopted in practice. While it may be argued as out of scope for this paper, I expected at least some plan for evaluation with end users to be given, especially as the authors state at the start the need to support esp. non-tech users. Also, the FW states plans to develop additional visual techniques, evaluation with end users to determine requirements for these is at the least a good idea.

2. Again, esp. wrt non-tech end users - the authors state that the contribution is not to develop some new hierarchical vis technique, but to provide a model that may be integrated into other vis systems.
Two quotes are of particular relevance here (in order of appearance):

“We are primarily interested on enhancing the visualization and user exploration functionality by providing statistical properties of the visualized datasets and objects, making use of existing computation techniques." - how? I can see the point in the statistical info, but how does this influence/enhance the visualisation or visual analysis?

“In contrast to above approaches, our work does not introduce a new hierarchical visualization technique, instead it proposes a model that can be adopted by the existing non-hierarchical visualization techniques, in order to provide multilevel visualizations." - the argument could be made that some other hierarchical vis model, or even implementation of this, could be used to achieve the same. As above, what in this model makes it preferable over such an option?

Obviously, integrating this model into some other tool cannot be done by a non-tech user, and even for tech experts, this would need some development expertise. FYI, I couldn’t find the config panel - maybe this would help to answer my question? It would be useful to provide some guidelines, however brief, about how this would be done. Essentially, the proposal, in this respect, follows principles of reuse. However, it’s difficult to imagine how this would be done in practice - rdf:SynopsViz has ONE hierarchical visualisation technique implemented on top of the model - a treemap. These are not particularly intuitive, and navigation especially becomes quite complex as interim nodes and leaves increase. Having watched the video, and tried out the online tool, I cannot see where else the hierarchical model is being reused, except, maybe, in the facet panel?
An example of this outside rdf:SynopsViz would be useful - unless, of course, this was an existing tool where exactly this was done to prove the point?

*** Detailed review

“The same also holds for any additional information (e.g., statistical information) that is computed for each group of objects. This information must be recomputed when the groups of objects (i.e., data organization) are modified." - does this have any impact on response? - see also point at bottom about response times reported.

p.3 - “Also, the proposed structure aims at organizing the data in a practical manner for a (visual) exploration scenario, rather than for indexing and querying efficiency." - thinking out loud… would optimising for querying not affect (improve) response during exploration? Suppose could be either way, depending on how this is implemented, but it would be worth clarifying this.

Don’t understand footnote 1 - what exactly does “uniformly handles" mean?

I do not follow this: D = S, i.e., “for each tr ∈ D iff tr ∈ S" - ∈ denotes memberOf - if S is derived from D, this still makes it a sub-set that may or may NOT be the complete dataset, no? Unless this reads “iff for each tr ∈ D, tr ∈ S"

p.5 - in fig.2 node f has ‘p3 age 37’ - fig.1 doesn’t have this value, but 40 for p3

Each leaf node contains λ or λ − 1 triples, where λ = ceil(|D|/l|)^3 - this is simply a factor of the formatting, but I read this as “cubed/to the power 3" - it wasn’t till I reached ex.3 and found a mismatch in the equations that I realised this is footnote 3. - I’d suggest moving the footnote into the text or attaching the pointer to text, rather than the equation. Same for footnote 4.

“The third part is the ConstructInternalNodes procedure, which requires… “ on what basis is the approximation made? It’s not obvious to me how LHS = RHS
Similarly, “ the overall computational cost for the HETree-C construction in the worst case is O(κlogκ + κ + κ) = O(κlogκ)" - again, how is the approximation reached? Considering the final cost calculation is the same as that for the R-Tree, even though they each start from a different point, this clarification is necessary - for both.

I can’t see any difference between algorithms 1 and 2 beyond one being R and the other C - unless I’m missing the obvious, I would suggest they both refer to the same (base) algorithm, and then the (different) procedures expanded from there. On the same point, there are a few other instances where identical information (or nearly so) is repeated for C and R. Simply because this makes reading tedious I’d suggest the point simply be made, in reference to both.

For ex.5, why are λmin = 25 and λmax = 50 ideal/optimal/preferred values? Note, I’m not saying they’re not, but it’s not clear whether they’re random values, based on a specific screen size, based on a [named] set of visualisation guidelines… Along the same lines, why is d = 2 rejected? - what makes it “extreme"? To illustrate, if D had a value, say, 4, 2 would be perfectly acceptable, if not the preferred value of d.

What data can be input - has this got to be a pointer to a (downloadable) RDF file? Considering the paper specifically addresses linked data, not static RDF. I tried a couple of pointers to RDF data sets but just ended up with the everlasting wheel - one is http://www.bbc.co.uk/nature/life/Mammal.rdf
To that end, in S5.2 it would be more useful to point to the actual datasets used, rather than the top level site. Esp as the online tool does not provide a way to browse the source data (or maybe I just didn’t find it?)
FYI, I managed to get the tree and chart views to display for the default dataset loaded, but not the timeline, just the wheel of death.

Some of the tools listed in the review are quite old, would suggest checking to make sure they all still work (some of the URLs fail, some don’t return anything for the examples provided). Alternatively or additionally, distinguish which are early prototypes that may still be in use, or are simply an illustration of where the field has been.

“Payola [45] … The framework offers a variety of domain-specific analysis and visualization plugins (e.g., graphs, tables, etc.)." - I wouldn’t say graphs and tables were domain-specific, or is this referring to analysis (non-visual) alone? If so, what are examples of these?

“Balloon Synopsis [62] … it supports automatic information enhancing, similarity analysis and ontology templates. “ - how exactly is information enhanced here? And from a vis viewpoint, what contribution do the ontology templates make?

How does VISU present the university data? That it is domain-specific is interesting, but no other information is provided.

“In our evaluation scenario, the numeric and temporal properties induced in the employed datasets, are visualized using our hierarchical model. " - what is meant by “induced"? - the properties exist or they don’t. At best they may be derived by linking with another dataset. Unless this has some specific meaning, which needs explaining, I’d suggest deleting it -> “ …properties in the employed datasets are visualized …"

“As the number of input triples increases, the construction time slightly increases, too." - in reference to Tables 4 & 5 - however, these do not provide tripleCount, therefore there’s nothing to compare with.

“From Tables 4 & 5, we can see that the areaWater property requires the minimum construction time (8.7 msec), in HETree-R case; while birdDateP requires the maximum time (346.6 msec). Overall, the HETree structure takes reasonable time, even for properties with 4396 triples, allowing real- time user interaction." … and… without knowing the data structure or how important either property is in relation to the other the statement isn’t meaningful. The numbers are finally presented, later - should be done at the first point, otherwise the reader’s job is made unnecessarily difficult.

For areaWater - 58 triples (small) -> 2% constructing the HETree
squadNumber - 198 triples (medium) -> 5% for HETree
birthDateP - 4396 triples (largest)-> 52% for HETree
The conclusion that on average “only 9% for constructing the HETree" is a bit skewed. Statistically correct, yes, but meaningful, not so much - the variance is HUGE. The comparison should be more like for like.

*****

FIGURES & TABLES

Fig. 1 appears before it is referred to, a bit lost/confusing. The same happens with Fig.2 - in fact the section describing the layout is that following the one in which it’s placed! - it took another half page for me to find answers to my questions about how the layout was determined.
Same with Tables 2, 3, 4 & 5. Worse, most of the description in the text is on the following page so it is tedious going back and forth to map content to description.

“Regarding the most detailed level (i.e., RDF triples), several visualization types are offered; e.g., area, column, line, spline, areaspline, etc. (Figure 7)." - the three examples are so similar visually as not to be a good example set. I’d suggest three sufficiently different visualisation types on the same data, so the user can see how each type contributes to analysis (from a different perspective).

Fig 9 - the lines are so close together that even with the annotation it’s impossible to tell which belongs to C or R. (Even with the closeness noted in the text) I’d suggest one line be broken and the other solid, or some such, OR colour that can be distinguished in monochrome used. AND, the y-axis be staggered/split to pull apart and increase space between the two sets of results.
Further, quantitative data as a measure of “very good" really needs a benchmark or threshold of some sort. Compared to [STATED] expected/recommended maximum response times for interactive analysis, how good is 0.7 sec for 4K objects? Is 4K typical, average, low, for this kind of analysis?

*** Very large number of typos and grammatical errors - most if not all would be caught by an auto-check and proofread, several, but not all examples:

“Given an RDF dataset R consisted of a set of RDF triples. " - not a complete sentence. Further, “CONSISTING"
Ditto, “Regarding the time required for the construction of the HETreee structure."

“user-friendly" is not encouraged much any more, would suggest the term “usable" instead=

“evaluation of our system is presented in Section 16" - doesn’t go to 16

“ The level of a node is defined by letting the root be at level zero. If a node is at level l, then its children are at level l + 1." - the second sentence is redundant

S 5.1 - mixture of different tenses

deadDate => deathDate & birdDateP - importantly, also, these are correct in the tables but not in the text, for consistency alone, this check should have been made.

“overloading is a common issue in large datasets visualisations" -> “overloading is a common issue in large [dataset visualization]"

“It also enriches groups with statistical information regarding their contents," -> CONTENT - no ’s’

“ insights on" -> ON-> INTO

“The remaining of this paper“ -> REMAINDER

“without setting any constraint at the way “ -> “without setting any CONSTRAINTS ON the way "

“Assume that S [i] denote to the i-th triple, with S[1] be the first triple." -> “DENOTES THE …" & “WITH S[1] THE FIRST…"

“the ordered set S is resulted by ordering the triples" -> “the ordered set S RESULTS by ordering the triples"

“Let I− and I+ denote the lower or upper bound of the interval I, respectively" -> “Let I− and I+ denote the lower AND upper bound of the interval I, respectively"

“level of each leaf node differs at most one from the level of other leaves nodes" - > “level of each leaf node differs at most one from the level of other [LEAF OR LEAVES’] nodes"

“we present in more details the "- > “we present in more DETAIL the ", similarly,
“a four stages workflow" -> “a four STAGE workflow", “several visualizations techniques" -> “several VISUALIZATION techniques",
“Squarified Treemaps [17] use shades in order to provide insight in the" -> “Squarified Treemaps [17] use SHADE in order to provide insight INTO the"

“Each internal node, has at most d children nodes." -> comma redundant, there is no natural pause here
another e.g., “Assume the scenario in which, a user wishes to (visually) explore and analyse the historic events from DBpedia, per decade. “ - the first comma is redundant and makes it more difficult to read, while the second is appropriate.
Also, every instance of “Note that," should NOT have the trailing comma - the comma would be fine with the short form (latin) “N.B., … " OR “Note, …"

“For an internal node, its interval is bound by the union" -> BOUNDED - as in boundary, otherwise this is past tense of bind, which I don’t think is what is meant here.

“where the objects values of each group" -> “where the OBJECT values of each group"

While maybe not incorrect, “Opposite to HETree-C" is a bit unusual, this would more typically be written “In opposition to HETree-C, in HETree-R…" or “conversely, HETree-R…"

“a HETree" -> “AN HETree"

“in a button-up fashion" - should this be “bottom-up"?

“In case, where more than one settings satisfy the" -> “In THE case where more than one SETTING SATISFIES the"

“with the children nodes g and h" -> “with the CHILD nodes g and h"

“based on user’s preferences" either “based on USERS’ preferences" OR “based on THE USER’S preferences"
“Facets Generator" -> “FACET Generator" (whether plural or singular)

“The problem of ontology visualization and exploration have been extensively … In existed works, “ -> “The problemS of ontology visualization and exploration have been extensively … In EXISTING works, “

ect. -> etc.

“superscripts indicating the dataset are used in properties names." -> “propertY names" OR “properties’ names"