Hierarchical Visual Exploration and Analysis on the Web of Data

Tracking #: 1139-2351

Authors: 
Nikos Bikakis
George Papastefanatos
Melina Skourla
Timos Sellis

Responsible editor: 
Guest editors linked data visualization

Submission type: 
Full Paper
Abstract: 
The purpose of data visualization is to offer intuitive ways for information perception and manipulation, especially for non-expert users. Most traditional visualization tools and methods operate on an offline way requiring from users to access data locally. They also restrict themselves on dealing with small dataset sizes, which can be easily visually analysed with conventional visualization techniques. However, the Web of Data has realized the availability of a great amount and variety of big interlinked datasets that are dynamic in nature; most of them offer SPARQL or API endpoints for online access and analysis. Modern visualization techniques must address the challenge for ad-hoc visualizations of large dynamic sets of data offering efficient data organization and exploration techniques. Moreover, they must take into account user-defined exploration scenarios and visualization preferences. In this work, we present a model for building, visualizing, and interacting with hierarchically organized numeric and temporal Linked Data (LD). Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure organizes input objects into a hierarchical multi-level model based on the objects’ values. Additionally, we define two versions of this structure, which adopts different data organization approaches, well-suited to visual exploration and analysis context. Furthermore, statistical computations can be efficiently performed on-the-fly in the proposed structure. Considering different exploration scenarios over large datasets, the proposed model enables efficient multi-level exploration, offering incremental construction via user interaction, and dynamic adaptation of the hierarchies based on user’s preferences. The proposed model is realized in a Web-based prototype tool, called rdf:SynopsViz that offers multi-level visual exploration and analysis over LD datasets. Finally, we provide a performance evaluation and a empirical user study employing real datasets.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Tomi Kauppinen submitted on 31/Jul/2015
Suggestion:
Accept
Review Comment:

This version of the paper has been carefully revised according to all comments from the reviewers and editors. Thus I would accept it as it is.

Review #2
By Heiko Paulheim submitted on 01/Sep/2015
Suggestion:
Minor Revision
Review Comment:

The paper has been substantially extended, both on the side of the description of algorithms, as well as by the inclusion of a user study. With this, the paper has improved significantly, however, there are still a few open issues.

My first concern is that some parts of the paper still discuss something which is, in principle, trivial data mining knowledge, i.e., the distinction of equal-width and equal-height discretization. This does *not* hold for the interactive construction of those hierarchies, but at least the basic distinction is something rather well known in computer science, and should be handled shorter.

For the interactive construction, it would be interesting to see what the workload on the SPARQL endpoint is. How many requests does the client have to send? My assumption is that the amount of data to be transferred is much bigger for content-based than for range-based trees (as the latter only requires the min and max initially, while the former requires *all* values), but a more thorough analysis would be appreciated.

With respect to the runtimes reported in table 4, the authors should make a statement about how they influence the usability of an interactive tool. Actually, 1min26s is way too long for a web page to load or a tree node to open. It looks like the approach works for properties roughly below 10,000 values, where the response time is below a second. The authors should also address some ideas for improving those times in future work, e.g., which basic pre-computations can be performed to scale up the tool.

Finally, the authors have acknowledged some previous works in hierarchical discretization. At least for inspecting the resulting trees for some DBpedia properties (not necessarily for the user study), the authors should contrast their approach with the existing approaches from the literature.

In the algorithmitic part, some questions remain open, which should be clarified by the authors.
* In the example shown in Fig. 2, f could as well be a child of c. Of course the solution presented is more balanced (i.e., the interval widths of b and c are distributed more equally), but this bias towards equal width intervals as a tie breaker is not mentioned.
* In the example in Fig. 3, what would happen if p7 was not in the dataset? Would c only have h as its child (it cannot be a leaf, since all leaves need to be on the same hierarchy level)? In general, do outliers always lead to degenerated trees (think: what would the tree look like if the value of p1 was 1,000 by mistake)?

In section 3.2, there are quite a few complex algorithms, while the overall vision of what is going on is fairly easy to understand, so my feeling is that the level of algorithmic detail is not required. My suggestion is that the authors rather describe the algorithms shortly in text, and move the actual algorithms to the appendix. That way, the paper could be read more smoothly, without any information getting lost.

Minor remarks:
* Procedure 3, line 12: the <- is used for appending to a list, while before, it was used for assigning to a variable
* p.10: range-based exploration is the third, not the second scenario introduced
* DBpedia misses a reference or footnote

Review #3
By Aba-Sah Dadzie submitted on 11/Sep/2015
Suggestion:
Minor Revision
Review Comment:

The authors have addressed the review comments in detail. I have only a few minor additional comments - below. There are however some areas where the detail provided in the response needs to be included in the paper itself.

In "2. The HETree Model": "The proposed model can be adopted by various existing visualization techniques (e.g., charts, scatterplots, timeline, etc.), offering scalable and multi-level visual representations over non-hierarchical data."
- contradictory - how is a hierarchical model supposed to support non-hierarchical data? This is actually explained in the response, I suggest that bit be put in the paper itself.

In S2.3 - Computational Analysis - why the approximation is valid is not explained in the text (but in the response) - it needs to be explicitly stated in the text. Ditto for other relevant sections.

In Ex.4 - Table 1 gives 4 options, not 2 - the largest height in light grey is for 5, for d=32, not 27.

I'd suggest S6.3.1 be expanded based on the relevant part of the discussion in the response.

EVALUATION RESULTS ANALYSIS

"Regarding the time required for the construction of the HETree structure, from Table 4 we can observe the following. The performance of both HET- tree structures is very close for all examined properties, with the HETree-R performing slightly better than the HETree-C."
- For the temporal properties - yes, but the differences for esp. the top half for the numeric are quite big.

"Finally, for the largest property for which the construction time is dominated by the other costs (i.e., powerOutput, 5.453 triples), 42% of the time is spent on constructing the HETree." - not possible - if construction alone is 42 it cannot be dominated by any other, let alone a set - that leaves only 58%. Was the converse meant?

"Overall, our hierarchical approaches exhibit reasonable time performance (i.e., sub-linear w.r.t. number of triples), handling properties with 762K objects in
about 1min 26s." - 'reasonable' is a bit vague - 1m26s is exact, but relative to what? What would be a benchmark or threshold value?

It could be argued that in Fig.11b the plots for the HETrees approximate an exponential rather than a sub-linear curve.

Is age of participants relevant? Stating only that out of several potentially relevant demographics draws attention to age.
What types of visualisations were the participants familiar with? Also, what exactly is "familiar"? BOTH influence interpretation of the results.

Maybe a bit pedantic, but I can't see how this conclusion is reached: "Essentially, we try to avoid easily guested answers like 5, 10, 50, etc. " - what makes them easy guesses? And why specifically 10 for T2.1?

"As a result, except for the long time required for this process, it is also very difficult to find the correct solution. " - 'except' doesn't make sense - should this be 'apart from'?

GRAMMAR & PRESENTATION

Quite a few errors - a few examples given below.

Strange change to present tense in S5.3.2

"However, offering an overview of a large dataset, is an extremely challenged task." -> "However, offering an overview of a large dataset is an extremely challengING task."

endpoits -> endpoints

"In this case, user orders historic events by their dates and organizes…" -> DATE - no 's'

"In opposition to HETree-C, in HETree-R "… should be "as opposed to…" - the two expressions don't mean the same thing

"The procedure takes as input an ordered set of RDF triples S, as well as the number of leaves nodes l. " -> "leaF nodes"

such that the resulted tree avoids overloaded and scattered visualizations." -> resulted -> resultING

"we have computed the statistical significant of the results" -> "we have computed the statistical significanCE of the results"

"presented approach are demonstrated via a through performance evaluation " - 'via' implies 'through', or was the latter meant to be 'thorough'?