Eye Tracking the User Experience - An Evaluation of Ontology Visualization Techniques

Tracking #: 672-1882

Authors: 
Bo Fu
Natasha Noy
Margaret-Anne Storey

Responsible editor: 
Guest editors linked data visualization

Submission type: 
Full Paper
Abstract: 
Various ontology visualization techniques have been developed over the years, offering essential interfaces to users for browsing and interacting with ontologies, in an effort to assist with ontology understanding. Yet, few studies have focused on evaluating the usability of existing ontology visualization techniques. This paper presents an eye-tracking user study that evaluates two commonly used ontology visualization techniques, namely, indented list and graph. The eye-tracking experiment and analysis presented in this paper complements the set of existing evaluation protocols for ontology visualization. In addition, the results found from this study contribute to a greater understanding of the strengths and weaknesses of the two visualization techniques, and in particular, how and why one is more effective than the other. Based on approximately 500 MB of eye-tracking data containing around 30 million rows of data, we found indented lists to be more efficient at supporting information searches, and graphs are more efficient at supporting information processing.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Jun/2014
Suggestion:
Minor Revision
Review Comment:

The manuscript provides an unbiased evaluation independent of the authors of the original algorithms that uses eye trackers to estimate different criteria, such as accuracy and time, to identify the more appropriate approach for visualizing ontology data. The two approaches tested were indented lists and graph visualizations.

The manuscript provides an overview of techniques that are used for visualizing ontologies. It explains the testing method based on the eye tracking system well. At times it is a little wordy and repetitive so it could be shortened a little in parts.

The authors present a thorough analysis of the captured data using the eye tracker which provides very useful insight. The presentation and discussion based on this data is very appropriate. It would be interesting to see additional data, for example EEG or fMRI data, to get a better feel for which regions of the brain are used more during the individual tasks, which the eye tracker can only provide indirectly.

Overall, I think this is a useful contribution to the field. The manuscript is well written. It would probably be useful to include more visualization techniques than just those two, though. There are some issues and inaccuracies that I will list in the detailed comments below.

Detailed comments:

The InfoVis community (I would consider the acronym InfoVis as the one typically used instead of InfoViz) does usually test their approaches based on user studies or other means and do not merely list what the additional features are as indicated by the authors.

"Shneiderman point out" should be "Shneiderman points out".

When discussing the results and pointing out significant difference, the p-value should be listed. I assume there was an ANNOVA test performed? The manuscript lists p and R values but is not very specific as to what test was used exactly. This should be clarified.

The figures list the p-values but use upper and lower case for the letter "p". The authors should stick with just one version.

Figures 19 and 20 use a rather unfortunate color scheme. Colors that are more visibly different should be used, preferably a color palette that works for people with color blindness as well.

Why is figure A-3 used in an appendix? It seems odd to just add it to the back of the manuscript. Instead it should probably be more integrated.

Review #2
By Emmanuel Pietriga submitted on 23/Jun/2014
Suggestion:
Major Revision
Review Comment:

This paper reports on a user study that compared the relative performance of two commonly-used visual representations of ontology class hierarchies: the now-ubiquitous indented lists whose nodes can be collapsed, popularized many years ago by, e.g., MS-Windows' file explorer; and networks represented as node-link diagrams. A paper by the same authors was published at ISWC 2013, running a similar(*) study (same tasks, same data), but focusing on qualitative feedback from the participants. The work reported in this submission is quite different. It takes an original approach, analyzing eye-tracker data and correlating it with quantitative measures (task completion time and error count) to provide new insights about the comparative advantages and drawbacks of the two visualization techniques considered.

(*) Whether the eye-tracker data was measured during the same study or separate-but-similar studies should be clarified.

This is quite an interesting piece of work, and I think this is definitely a promising research direction. I fully agree with the authors that we need more evaluations of ontology visualization techniques, and the idea of looking at eye-tracking data to get a better understanding of the strengths and weaknesses of the different visualization techniques is highly relevant. The authors have barely scratched the surface of what can be done in this direction, and I think this paper could inspire more work and open the way for more research that would help us design better visualizations of ontologies. That said, I have some strong reservations about the experiment that lead me to recommend acceptance only after major revisions are made, as detailed below.

My first and strongest reservation is about the interpretation of results and conclusions drawn from it. Throughout the paper, the authors treat the different eye-tracking measures and corresponding interpretations in an almost completely orthogonal manner. Information search is associated with saccades and scanpath length; information processing with fixation duration; and cognitive workload with pupil dilation. This is essentially based on results drawn individually from different studies about eye tracking, namely [34,36,49]. My first problem with this is that the authors of this submission transform the results from these prior studies, which were performed in very specific conditions on very specific tasks (**) into more general "expectations" about what visualizations of ontologies should optimize. I have two problems with that: 1. I do not buy some of the associations made and expectations drawn from them, like relative angles as an indicator of cognitive load (Section 4.3), but most importantly 2. when asking participants to perform tasks as complex as those involved in this study, there is a significant chance that there will be some interplay between those different measures. Treating them in an orthogonal manner and drawing conclusions from their individual analysis sounds a bit naive to me. There is a good chance that things are not that simple. And beyond the internal interplay between those different elements, just to take one example, there are many factors that can influence pupil dilation, including external ones, as mentioned in [49]. Granted, such factors can be controlled to some extent, and the fact that this is a comparative study of two visualization techniques evaluated in the same environment should minimize the impact/bias of such external factors. Still, my point is that I am not convinced by some of the hypotheses and analysis of study results. I am not claiming that the analysis is wrong or that there is absolutely no truth in them. I am just saying that I need some more evidence that they are indeed reliable. I need to be convinced that these different elements (saccades, fixations, scanpaths, pupil dilation, angles) and their mapping can indeed be treated orthogonally. And if there is no strong evidence that they can be treated orthogonnaly, this does not nullify the results of this study, in my opinion. But then this limitation has to be acknowledged and discussed, and the claims made have to be toned down, more clearly indicating that the observations made _suggest_ something rather than they _show_ or _demonstrate_ it.

(**) Comparing the effectiveness of quantitative data charting techniques in [34], UI for drawing tool selection in [36], some sort of game in [49].

A side comment w.r.t pupil dilation is that as stated in [49], pupil size as seen by the eye tracker depends on the person's gaze angle. Did you apply the calibration method of [49] to compensate for this? As the two visualization techniques lay out data on screen in a very different manner, thus having a significant impact on where users are looking, there is a strong potential for confounding factors here if the distortion is not eliminated prior to analysis.

Another reservation I have is about the graph visualization technique chosen. There are many ways to lay out a network. The authors seem to have chosen a spring-based layout with straight edges between nodes; nodes that all have the same size and whose labels can potentially overlap as the layout gets reconfigured. The chosen layout strategy and the interactive features associated with the visualization would undoubtedly have an impact on the performance of this technique. Much more so than in the case of the indented list, since there are very few design variations in the latter case. I would like to better understand the rationale beyond the choice of this particular network visualization technique. Why is this one representative? Is it more relevant to choose a representative technique, or one that tries to optimize some set of criteria (readability, edge crossing minimization, screen real-estate consumption, ...)?

Finally, Section 5 is doing a fairly poor job at reporting on the statistical analysis. Please provide more information about the statistical tests involved (ANOVA, post-hoc pairwise comparisons, others?) and report effect size. The latter is especially important given that according to the bar charts and box plots, and as acknowledged in parts of the text (like Section 5.3), differences are relatively small. Please also put error bars in Figures 8, 10, 12, 14, 16 and 18. It would also be nice to have charts illustrating success rate.

Minor comments:

- "500MB of eye-tracking data" and even "30 million rows of data" does not tell us much. This certainly does not belong to the abstract, and if the authors want to keep this information in the paper (first paragraph of Section 5), I would suggest providing a measure of dataset size that "speaks" more to the average reader, like the sampling frequency and duration of eye tracking sessions. Whether this represents 500MB or 1TB of data is totally irrelevant to me, to be honest.

- Section 4.1: how do you segment the continuous stream of saccades and fixations into scanpath sequences?

- Section 4.4: I guess it should be R \geq 0.9 rather than R \geq 0.09

- Section 5. "To automatically process this large volume of raw data, we generated and ran a script on them.": This is not telling the reader much... Most data analysis requires running scripts to process the data. What is the purpose of this particular script?

- Section 7: it makes little sense to use pixels as the unit to discuss areas and distances in the visualization. Clearly, what matters here is the physical distance between, and size of, elements on screen. It should be expressed in centimeters or inches. Expressing it in pixels makes it dependent on screen resolution, which can vary dramatically from one screen to another (depending, e.g., whether you have a standard monitor or a HiDPI one).

- There are some grammar mistakes and typos throughout the paper that should get fixed.

Review #3
By Roberto García submitted on 22/Jul/2014
Suggestion:
Minor Revision
Review Comment:

First of all this paper contributes in a very interesting topic for the Semantic Web, the evaluation of the user experience. In this case, it is the user experience when interacting with ontology visualization component and particularly the most common ones: indented lists a graphs. Additionally, the evaluation of the user experience involves not just the classical measures of efficiency and effectiveness. The evaluation includes eye-tracking and relevant measures of how the user looked at the user interface, specially from a cognitive point of view.

Overall, paper motivation and contribution are clearly stated. The evaluation method and the results are then presented, though there are some points were minor revisions are needed to clarify them. Finally, the discussion, conclusions and future work are well sustained and motivate further work.

The only caveat from a general standpoint might be that the evaluation techniques presented and used in the paper should be more clearly organized in the context of what has been become a standard in the quality community, a quality framework. In this case, as what is evaluated in the user experience, it should be a Quality in Use framework. For instance, a Quality in Use framework like the one proposed for Semantic Web Exploration Tools in:
González, J.L.; García, R.; Brunetti, J.M.; Gil, R.; Gimeno, J.M.: “Using SWET-QUM to Compare the Quality in Use of Semantic Web Exploration Tools”. Journal of Universal Computer Science, 19(8), 1025-1046, 2013

Detailed Comments
-----------------

Section 2.1 (p.2):
"Notably, the majority of recent advancement in ontology visualization has incorporated or is building upon indented lists and graphs, which makes these two most commonly used visualization techniques relevant and interesting subjects for our evaluation." --> To support this claim it might be interesting to show a table showing some of the main ontology visualization tools together with the kinds of visualizations they provide.

Section 3.2 (p.5):
"74 and 100 classes" --> From Table 1 this seem to be referring to entities instead of classes and the numbers should be 74 and 110 respectively.
"by loading ontologies into Protégé" --> Why hasn’t an specific visualization being developed like in the graph case? It should be easier to implement and make the comparison between both fairer.
"dark-colored nodes illustrate the existence of subclasses whereas light-colored nodes illustrate nonexpandable classes" --> Did users find this approach easy to catch? Have been alternatives been considered? For instance making expandable nodes look like Web links (blue & underline), this is a quite common metaphor for actionable elements…

Section 3.3 (p.6):
"biomedical, biochemistry" --> Has it being taken into account the influence of users background in task understanding? For instance it should be easier for participants with a biology background to perform better in the BioMed scenario.
"This protocol ensured that a participant did not become overly familiar with a particular visualization" --> The impression is that this is not the intention of this protocol for each individual participant, but for the set of all participants in average…

Section 4.1 (p.7):
"we would also expect to see small saccade counts given low fixations" --> Do you refer to low fixation counts?

Section 4.2(p.8):
This section needs a profound rewriting as it is hard to follow. For instance:
"the duration of each saccade can be determined, as the time between two succeeding fixation timestamps" --> Is this time minus the duration of all the involved fixations? Otherwise, if the timestamp is just at the beginning of each fixation the sum of the times between each pair of consecutive timestamps would be the total user attention time (including also fixations).
"The higher the ratio, the more time the participant spent processing information" --> It would be clearer if all this measures are accompanied by formulas showing how they are calculated. In this particular case the explanation seems to be referring to the reverse case. A formula should clarify…

Section 4.4(p.9):
"Finally, time on task is the length of time it took a participant to complete the task (i.e. after the participant has finished both identification and creation activities)." --> Are help requests from the user measured? Or they are not allowed during task performance?

(p.9):
"4.4. Data Cleaning, Validation & Statistical Tests" --> should be numbered 4.5


Comments

Review #1 Roberto García

1.1. Reviewer Comment: The only caveat from a general standpoint might be that the evaluation techniques presented and used in the paper should be more clearly organized in the context of what has been become a standard in the quality community, a quality framework. In this case, as what is evaluated in the user experience, it should be a Quality in Use framework.

Author Comment: Several definitions for usability exist [72]. The definition used in this paper is taken from the ISO standard 9241-11, as ‘‘the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use” [31]. Effectiveness is typically measured by task success, efficiency is measured by time on task, and satisfaction is measured through user feedback. We argue these are indirect measures that tell you what has happened, but not necessarily why something has happened. Thus we use eye tracking to complement our previous study [9] (that focused on the above indirect measures) in order to find out if eye movement data can provide additional cues as to why users may find one visualization more effective/efficient/satisfying to use than the other. We have added further details and clarification in the revised submission on the usability standard in section 1 (third paragraph) and section 2.2 (last paragraph).

1.2. Reviewer Comment: Section 2.1 (p.2): To support this claim it might be interesting to show a table showing some of the main ontology visualization tools together with the kinds of visualizations they provide.

Author Comment: This claim is supported by the 34 visualization tools surveyed, which are listed in footnote 2, 3 and 4. Further survey details can be found in [3]. Also, we have explicitly discussed the type of visualization technique (whether indented list, or graph) that the tools (discussed in the last paragraph of section 2.1) are extending upon in the revised submission.

1.3. Reviewer Comment: Why hasn’t an specific visualization being developed like in the graph case? It should be easier to implement and make the comparison between both fairer.
Did users find this approach easy to catch? Have been alternatives been considered? For instance making expandable nodes look like Web links (blue & underline), this is a quite common metaphor for actionable elements…

Author Comment: There are two considerations for the setup & implementation, which are clarified in section 3.2 (last three paragraphs) in the revised submission.
1) Ideally, it would be nice to have participants who have never used indented lists or graphs to take part in our study. However, indented lists have obvious similarities with computer file directories, and it is highly difficult to recruit participants who have never browsed a file directory prior to taking part in our study. To minimize bias, Protégé is sufficient for us to practically achieve what is essentially the closest arrangement to the ideal setup, since none of the participants have used it before and this environment is as novel as it can be to the participants.
2) With regard to the specific visualization implementations, there can be many variations/ways to implement an indented list or a graph, thus the key to choosing the appropriate visualizations for our study lies in the representativeness of the chosen ones. A discussion on what qualifies an indented list and what characteristics a graph should have is presented in section 2.1 (first paragraph), which motivates why the chosen implementations are sufficient for the purpose of the study. A further discussion is presented in section 3.2 (last paragraph) with additional evidence for why the two types of visualization are sufficient (i.e. since they both support the necessary features identified in [13]).

1.4. Reviewer Comment: Has it being taken into account the influence of users background in task understanding? For instance it should be easier for participants with a biology background to perform better in the BioMed scenario.

Author Comment: To minimize bias that a participant’s background may have on the study outcome, we have used two knowledge domains in the experiment. This is clarified in section 3.2 (forth paragraph) in the revised submission. Also, we have a mixture of participants with various backgrounds besides biomedical engineering, as discussed in section 3.3. Furthermore, the Biomed ontologies used in the study (see footnote 10) are about organisms (discussed in 3.2 third paragraph), and it is likely that most people would have encountered this topic in high school biology.

1.5. Reviewer Comment: Section 4.2(p.8): It would be clearer if all this measures are accompanied by formulas showing how they are calculated.

Author Comment: We have added equations in the revised submission.

1.6. Reviewer Comment: Section 4.4(p.9): "Finally, time on task is the length of time it took a participant to complete the task (i.e. after the participant has finished both identification and creation activities)." --> Are help requests from the user measured? Or they are not allowed during task performance?

Author Comment: As discussed in section 3.3, no interactions (e.g. questions, discussions) were allowed once the eye tracker has started recording, otherwise it would interfere with the time-on-task results.

Review #2 Anonymous

2.1. Reviewer Comment: It would be interesting to see additional data, for example EEG or fMRI data, to get a better feel for which regions of the brain are used more during the individual tasks, which the eye tracker can only provide indirectly.

Author Comment: To generate EEG/fMRI data, one will need an EEG system to track brain activities. We used a Tobii 2150 (see footnote 6), which is an eye tracking system that only generates eye movement data, as discussed at the beginning of section 4. Studying brain activities is outside the scope of this study, but can be another avenue to explore in future research. We have included this research opportunity in section 7 (second last paragraph) in the revised submission.

2.2. Reviewer Comment: It would probably be useful to include more visualization techniques than just those two, though.

Author Comment: As discussed in section 2.1, there are many visualization tools that can be classified into many different categories depending on their presentation, interaction, functionalities, dimensions etc. The research impact of a study would be low if it compares techniques that are not being used by many. The decision of selecting which ones to include in our study thus depends on how frequently these visualization techniques are being used by existing tools. As indented lists & graphs are used most often (and typically used together in one tool), they are most relevant for us as a starting point for ontology visualization evaluation. However, we agree that the evaluation of other visualization techniques can further inform and contribute to the body of knowledge in this field, as discussed in section 7.

2.3. Reviewer Comment: The InfoVis community (I would consider the acronym InfoVis as the one typically used instead of InfoViz) does usually test their approaches based on user studies or other means and do not merely list what the additional features are as indicated by the authors.

Author Comment: Section 2.3 states that one approach - among two others - is to benchmark one tool against a set of others, but we do not claim this is the only methodology used by the community.

2.4. Reviewer Comment: When discussing the results and pointing out significant difference, the p-value should be listed. I assume there was an ANNOVA test performed? The manuscript lists p and R values but is not very specific as to what test was used exactly. This should be clarified.

Author Comment: A school of thought on p values is that the exact values should not be reported, since these values alone do not permit any direct statement about the direction/size of a difference/relative risk between different groups.* Significance tests of null hypotheses are based on whether the p value is less than a chosen significance level. P-values themselves must not be interpreted as a measure for the magnitude of the difference that exists between groups, hence should be reported in the format of p<α level.
As discussed in section 4.5, p values are generated using mixed models [16], and r values are generated using the Pearson correlation coefficient test [43].
* Jean-Baptist du Prel, Gerhard Hommel, Bernd Röhrig, Maria Blettner, Confidence Interval or P-Value?: Part 4 of a Series on Evaluation of Scientific Publications. Dtsch Arztebl Int. 2009 May; 106(19): 335–339. doi: 10.3238/arztebl.2009.0335

2.5. Reviewer Comment: Figures 19 and 20 use a rather unfortunate color scheme. Colors that are more visibly different should be used, preferably a color palette that works for people with color blindness as well.

Author Comment: We have updated these figures in the revised submission.

2.6. Reviewer Comment: Why is figure A-3 used in an appendix? It seems odd to just add it to the back of the manuscript. Instead it should probably be more integrated.

Author Comment: As discussed in section 4.1, saccade count is simply the fixation count minus one (since saccades occur between successive fixations) for any recording. When statistically significant differences are found in fixation counts between groups (discussed in section Fig. 7 & 8), from a mathematical point of view, it is expected to also find the same statistically significant differences in saccade counts. Fig. A-3 (now Fig. A-1 in the revised submission) gives what a statistician would consider redundant statistical information, thus is presented in the appendix for completeness.

Review #3 Emmanuel Pietriga

3.1. Reviewer Comment: I need to be convinced that these different elements (saccades, fixations, scanpaths, pupil dilation, angles) and their mapping can indeed be treated orthogonally. And if there is no strong evidence that they can be treated orthogonnaly, this does not nullify the results of this study, in my opinion. But then this limitation has to be acknowledged and discussed, and the claims made have to be toned down, more clearly indicating that the observations made _suggest_ something rather than they _show_ or _demonstrate_ it.

Author Comment: We have rephrased our claims as advised in the revised submission. It should be noted that Goldberg et al.’s research in eye tracking and how to interpret eye movement data is the state of the art in this field. These measures and how they should be understood [33-35, 6] are the current leading standards used by the community as well as our paper. To date, there is no evidence to suggest these analyses/interpretations are unreliable. If the reviewer is aware of any eye tracking research that suggests otherwise, please point us to the relevant papers. In this study, we aim to provide a basis for future research in this direction and we hope to inspire methods/analyses that challenge or build on the results/approach discussed in this paper.

3.2. Reviewer Comment: A side comment w.r.t pupil dilation is that as stated in [49], pupil size as seen by the eye tracker depends on the person's gaze angle. Did you apply the calibration method of [49] to compensate for this? As the two visualization techniques lay out data on screen in a very different manner, thus having a significant impact on where users are looking, there is a strong potential for confounding factors here if the distortion is not eliminated prior to analysis.

Author Comment: Calibration is a must in any experiment involving an eye tracker. The calibration process specific to the Tobii 2150 we used is discussed in section 3.3. A recording session only begins after the participant has calibrated her/his eyes.

3.3. Reviewer Comment: I would like to better understand the rationale beyond the choice of this particular network visualization technique. Why is this one representative? Is it more relevant to choose a representative technique, or one that tries to optimize some set of criteria (readability, edge crossing minimization, screen real-estate consumption, ...)?

Author Comment: As discussed in section 3.2, the graph visualization uses a force directed layout, which minimizes edge crossing, supports drag-and-drop to improve readability of labels and tailored use of screen space. Please also see comment 1.3 above.

3.4. Reviewer Comment: Please provide more information about the statistical tests involved (ANOVA, post-hoc pairwise comparisons, others?) and report effect size.

Author Comment: We have included effect size in the revised submission. Please also see comment 2.4 above.

3.5. Reviewer Comment: Please also put error bars in Figures 8, 10, 12, 14, 16 and 18. It would also be nice to have charts illustrating success rate.

Author Comment: These figures have been updated in the revised submission.

3.6. Reviewer Comment: "500MB of eye-tracking data" and even "30 million rows of data" does not tell us much. This certainly does not belong to the abstract, and if the authors want to keep this information in the paper (first paragraph of Section 5), I would suggest providing a measure of dataset size that "speaks" more to the average reader, like the sampling frequency and duration of eye tracking sessions. Whether this represents 500MB or 1TB of data is totally irrelevant to me, to be honest.

Author Comment: Additional frame rate information is added in section 3.1 in the revised submission. As discussed in section 5, recordings vary between 10 minutes to well over an hour. We would like to argue that the data size speaks to the average reader as well, since it provides an understanding of how much data is generated per participant/recording, and the volume of the data processed and analyzed in this study.

3.7. Reviewer Comment: how do you segment the continuous stream of saccades and fixations into scanpath sequences?

Author Comment: As discussed at the beginning of section 4, ClearView generates basic data such as the timestamp and duration associated with each fixation. As saccades are the quick eye movements between successive fixations, the difference between two fixations’ timestamps is the saccade duration. Since a scanpath is the complete saccade-fixate-saccade sequence, its duration is determined as the total duration of fixations and saccades, as discussed in section 4.1. Please also see footnote 15 for additional engineering information.

3.8. Reviewer Comment: "To automatically process this large volume of raw data, we generated and ran a script on them.": This is not telling the reader much... Most data analysis requires running scripts to process the data. What is the purpose of this particular script?

Author Comment: The measures discussed in section 4.1-4.3 are not automatically generated by the eye tracker/ClearView (only basic information is provided as discussed at the beginning of section 4), hence the code included in footnote 15 solves the engineering problem this study has to overcome first.

3.9. Reviewer Comment: it makes little sense to use pixels as the unit to discuss areas and distances in the visualization. Clearly, what matters here is the physical distance between, and size of, elements on screen. It should be expressed in centimeters or inches. Expressing it in pixels makes it dependent on screen resolution, which can vary dramatically from one screen to another (depending, e.g., whether you have a standard monitor or a HiDPI one).

Author Comment: Likewise, physical distances expressed in centimeters or inches can also vary dramatically from one screen to the next, e.g. the same visualization on a 21” monitor vs. on a 30” monitor. In the eye tracking community, saccade lengths are typically reported in pixels accompanied by the monitor size & resolution - in our study, we used a 21.3” TFT with 1600*1200 resolution as discussed in section 3.1.