Inferring Editor Roles in Ontology Engineering Projects - A Large-Scale Study of WebProtégé Change Logs

Tracking #: 2015-3228

Authors: 
Patrick Kasper
Simon Walk
Matthew Horridge
Denis Helic
Mark A. Musen

Responsible editor: 
Marta Sabou

Submission type: 
Full Paper
Abstract: 
Previous research attempting to analyze and understand the processes involved in ontology creation was often limited in scope and generality due to the lack of available data. In this paper, we shed light on the editing behavior of users creating ontologies by investigating change logs for a large number of ontology engineering projects. To that end, we analyze a corpus of nearly five hundred ontology engineering project change logs, extracted from the Web-based online ontology editing tool WebProtégé. The change logs contain over four million edits made by over one thousand users. In our analysis, we cluster users with similar editing behavior by applying k-means clustering on their editing sequences. We infer and describe five distinct editor roles, revealing that individual users concentrate on specific tasks for extended periods. We further investigate these individual clusters by (i) analyzing their distributions over the projects, and by (ii) tracking editor role changes over the complete lifespan of the individual projects. Our results indicate that the majority of projects have one leading editor role and that there are regular patterns of how users switch between roles during different phases of an ontology engineering project. Moreover, our results reveal valuable insights into the engineering processes and the editor role distribution and evolution of nearly five hundred real-world ontology engineering projects, which can potentially be leveraged for improving existing ontology editing tools by, for example, creating automatically adapting interfaces to support the individual editor roles.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Garijo submitted on 24/Nov/2018
Suggestion:
Major Revision
Review Comment:

This paper describes an extensive data analysis over Web Protege logs in order to identify which are the main user activities when developing ontologies, and how they vary over time.

The paper is well written, structured and easy to follow. I think the authors tackle an interesting problem, i.e., how to capture common patterns in ontology design to categorize users and facilitate the creation of future ontologies.

Although I think the results of the analysis may be useful for the Semantic Web community, the contribution of this research is more related to the area of user-interfaces rather than Semantic Web. In fact, the paper has been submitted as a full research paper, but the research contribution to the semantic web community is rather small. I believe that the paper should be turned into a tool report, as all the logs belong to Web Protege. I list my rationale and further questions below:

- Contribution: The main editor roles detected by the analysis reflect the main activities of users in Web Protege, but as the authors list in their limitations section, this may be due to the way the interface has been designed. I wonder if the title should specify that these are the roles in the Web Protege platform, as editor logs belong only to that system. I found a little unexciting the lack of detail in axiomatization of classes versus just adding classes, annotations and individuals. Don't users add these kind of knowledge? At the moment the roles are pretty much what one could expect users would do in an ontology editor...

- Also, it is unclear to me how detecting these roles will actually help modifying or adapting other current guis (e.g., http://editor.visualdataweb.org/, https://app.gra.fo/login), besides maybe merging operations (rename, change language, etc.) which are currently lacking in the current Web Protege gui. Current systems support all types of roles. Why would it be necessary then to adapt the system to a particular role?

- There are many ontologies being developed in GitHub. I think a comparison with the different commits would help generalize the study, but I understand this would imply significant work.

- Is the data used for the analysis available for inspection? I think it could be useful in case other researchers wanted to find patterns at a finer-grained level.

- I found two things of the analysis very curious:1) editors who add classes very rarely annotate them. Is this consistent in behavior? How do users provide a definition for their classes then? 2) It is usually a good practice to keep the individuals separate from the ontologies. What type of individuals do users add to their projects? Examples? Do they actually populate ontologies manually?

- Aren't there any actions related to the creation of properties and data properties? Or is this included in the Class editor role?

- The related work (last p.) does not clarify whether the method used in this work differ from other studies. Is the only difference the data preparation and scale of the study?

- Table 1 is never explained in detail in the text.

- I don't understand very well the first paragraph in page 6. Why should every state be reachable from every state? And I do not understant why a "teleportation factor" with value 0.15 has been added there.

- What is the definition of a "weak cluster"?

Review #2
By Elena Simperl submitted on 30/Nov/2018
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper describes an empirical analysis of user roles in ontology engineering projects on the web. Collaborative knowledge engineering on a large scale is gaining traction in recent years, as it is shown by thriving projects such as Wikidata, which makes this topic very important and timely.

The breadth of the analysis, which covers a large number of projects from the WebProtégé platform, makes this work useful and relevant in its field. Furthermore, it is well-written, pleasant to read, and clearly structured.

Nonetheless, a few points need still to be addressed, therefore we advise major revisions. We truly believe the work could have substantial impact for the community and the changes proposed would make it better.

In particular, the paper presents very promising results and, after raising some interesting issues that we would have expected to find addressed therein, it veers to an analysis of the transitions between roles. Furthermore, although the paper is intended to address the gap in the literature of large-scale studies of collaborative ontology engineering projects, the discussion falls short of explaining how its findings compare to prior works, e.g. whether they contradict or support them. More detailed comments are listed per section below.

We look forward to reading the revisions and responses of the authors.

Section: Related work
p. 3, Detecting user roles paragraph: This paragraph covers three aspects of your approach, the type of features used, the clustering algorithm, and the scale of the study. At the moment, these three aspects are covered together by listing a number of studies investigating user roles in ontology engineering projects. The rationale behind your choices is not clear.

In order to clarify this point, we would suggest to: 1. Separate the three aspects mentioned above (at least, make them emerge clearly from the text); 2. Explain why your choice (especially with respect to the clustering algorithm and the features) was suitable to your study; 3. Articulate the added value resulting from carrying out an analysis at a large scale.

Section: Materials and methods
The data collection and preprocessing is adequately described, the methods adopted are sound and sufficiently grounded in previous research. Some improvements could be made though.

p.3, 18-26: a screenshot of the WebProtégé would give those who are not familiar with its interface a visual reference. Furthermore, you could add a graphical representation of the relationships between projects, ontologies, and metadata.

p.3, 42: ‘high-level action’, it would be good to define what this refers to.

p. 4, Figure 1: this could be larger, spanning over two columns.

p.4, line 29: ‘we remove all projects with fewer than 250 total log entries’, why 250? How did you determine this threshold?

p.4, lines 33-36: ‘we define a lower threshold of two edit actions and remove all users that contributed only a single change to their project.’, this is a necessary step, considering the approach you follow. Casual users are often left out in similar studies. Nonetheless, some authors [1] include editors with a small number of contributions in order to take ‘marginal profiles’, such as occasional users and vandals, into account. It would be interesting to quantify these marginal users (comparing that to the total number of users) and provide details about the actions they perform.

Section: Results
p. 7, Figure 2: we understand the need to plot several variables, but 3D charts are hard to read. We would suggest to explore the possibility to visualise the differences between plots using a different approach.

p.7, lines 28-onwards: please add further details about how you have defined edit actions for each principal component to help someone reproduce what you have done. Moreover, the cluster would be more rigorously defined by applying a significance test to the edit actions of their members, e.g. a Tukey’s test.

p.9, lines 29-31 ‘Our results indicate that in 62% of all projects, there exists a single editor role which all users of the project assume.’, this is a very interesting insight and we would have expected some further analysis to understand its implications. A possible way to expand on the analysis could be to look at what being a ‘single-role’ project implies in terms of e.g. structural features of the resulting ontology such as average or maximum depth or inheritance richness.

p.9, Figure 5: The figure is hard to read, partly because of the overlaying cluster labels. One solution to that could be to make it larger in size, spanning over two columns and place the labels externally.

Section: Discussion and conclusion
Several works already exist around user roles in collaborative ontology development projects, some of which you cite in your work. You motivate your study with the lack of comprehensive insights about how the community at large works on ontology engineering projects in the wild. How do your insights connect to (and differ from) previous findings in the field? Without this framing in previous literature it is difficult for the reader to fully appreciate the contribution of your work.

The limitation section may be expanded as well: e.g. what does using k-means imply, compared to other clustering algorithms?

[1] Arazy, O., Daxenberger, J., Lifshitz-Assaf, H., Nov, O., & Gurevych, I. (2016). Turbulent stability of emergent roles: The dualistic nature of self-organizing knowledge coproduction. Information Systems Research, 27(4), 792-812.

Review #3
By Oscar Corcho submitted on 25/Jan/2019
Suggestion:
Minor Revision
Review Comment:

This paper describes the method used by the authors in order to get a better understanding of the different profiles or roles that people who participate in an ontology engineering project may have. The method is sound and provides a very interesting insight into this envisaged characterisation, which may be further used to improve ontology engineering tools in the future, offering features that are more well suited according to the role that a user is playing, or allowing also to divide more clearly the roles and tasks that different profiles have to do when performing an ontology engineering project.

The work presented here can be also considered as a novel piece of work, even though there are already previous approaches that have aimed at characterising different types of roles in ontology engineering (and which are included as references in the paper). This works positions well in the state of the art and provides interesting conclusions.

That said, there are several small concerns that I have on the work that has been presented in the paper, which may be easy to deal with in a revised version of the paper:
- First, I do not agree with the title that the authors have provided to the paper. The title is too general (and of course, there is a subtitle, but this may not be enough and may be missing from a camera ready version of the paper). I think that the title should contain explicitly the main bias that the authors are introducing in their analysis, which is the fact that they only rely on the change logs from WebProtégé. As a result, some of the editing actions may be biased because of the user interface of this online tool, and not be applicable to any kind of ontology engineering projects (neither those that are using online tools nor especially those that use offline tools). I would prefer a title that is more precise: inferring editor roles WebProtégé-based ontology engineering projects. Or something alike to reflect clearly this bias. This comment seems to be reflected explicitly only at the end of the paper, and I think that it should be also made very explicit in the abstract and in the introduction.
- Second, there is yet another bias that may be relevant in this context. The authors select a good number of ontologies with some basic restrictions. However, this does not allow to get rid of those that may be large but still educational. Indeed, I would like to have the list of ontologies or ontology URIs listed somewhere with a link in the paper, so that this list can be used by other researchers (I will comment on this later). Why not filtering this list a bit more taking into account only those that are published online in the corresponding URI, and/or registered in registries like LOV or BioPortal, to name a few? Probably the results may be similar, and the method is applicable anyway, but it would be good to have the confirmation that this bias is not introducing any problem in the classification. I understand that this may make the authors have to replay all the experiments, but as long as this is well automated it does not seem to be a major task.
- Third, I would really like to see all the associated resources to this work available in software and/or data repositories, so that others can reproduce the experiments, improve them, etc. I know that this is not an explicit requirement from this journal, but it is recommended, and this would make this work much more visible and useful for others (for instance, people who may try to do some of the changes to user interfaces that are hinted).

These are the three main concerns that arise from reviewing this paper. Next a few more detailed comments that may be taken as recommendations for improvement of the text:
- The first paragraph of the introduction is too limited. Why don’t you refer to other ontology tools or approaches that are focused on the Web? There are available editors (WebvOWL, gra.fo), and approaches that also exploit collaboration by using other tools/plataforms (OnToology, vocol), which may be worth mentioning and referring to to show how ontology engineering is moving back into a more collaborative effort exploiting online facilities).
- Would it make sense and/or would it make a difference taking into account the method that is used for the characterisation to consider more abstract changes that may be obtained from the query logs such as the editing/evolution actions that were proposed in ontology change ontologies (e.g., those from Palma, Haase, et al.?). This is just an idea, which may not need to be implemented, since my intuition is that the use of Markov chains as they are used here may actually make sure that this is not a major problem. Indeed, this seems to me to be very related to the Full set that you refer to.
- In the related work you comment on the change logs in software repositories, and the commonalities and differences with this approach. Wouldn’t it make sense to talk about the fact that in software repositories the change logs also contain messages that are commonly exploited? Would it make sense to use potential messages that are provided by users in WebProtégé?
- Tools and visualisations like the one in figure 5 may be very relevant if made available online, so that it is easy to interact with them. Does it make sense to have more steps in the sankey diagram?

In summary, this paper provides a very interesting contribution to the state of the art, which may help better understand how ontology engineering is being done nowadays if also extended to other types of approaches (besides WebProtégé), although the current work would be sufficient, in my opinion, to deserve being published already if the previous recommendations are taken into account.