A decade of Semantic Web research through the lenses of a mixed methods approach

Tracking #: 1850-3063

Authors: 
Sabrina Kirrane
Marta Sabou
Javier D. Fernandez
Francesco Osborne
Cécile Robin
Paul Buitelaar
Enrico Motta
Axel Polleres

Responsible editor: 
Christoph Schlieder

Submission type: 
Full Paper
Abstract: 
The identification of research topics and trends is an important scientometric activity. In the Semantic Web area, initially topic and trend detection was primarily performed through qualitative, top-down style approaches, that rely on expert knowledge. More recently, data-driven, bottom-up approaches have been proposed which can offer a quantitative analysis of the research field’s evolution. In this paper, we aim to provide a broader and more complete picture of Semantic Web topics and trends by adopting a mixed methods methodology, which allows a combined use of both qualitative and quantitative approaches. Concretely, we build on a qualitative analysis of the main seminal papers, which have adopted a top-down approach, and on quantitative results derived with three bottom-up data-driven approaches (Rexplore, Saffron, PoolParty) on a corpus of Semantic Web papers published in the last decade. In this process, we both use the latter for “fact-checking” on the former and also derive key findings in relation to the strengths and weaknesses of top-down and bottom-up approaches to research topic identification. Overall, we provide a reflectional study on the past decade of SemanticWeb research, however the findings and the methodology are relevant not only for our community but beyond the area of the Semantic Web to other research fields as well.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Yingjie Hu submitted on 21/Jun/2018
Suggestion:
Major Revision
Review Comment:

This paper presents a study using mixed methods for investigating the topics and trends in Semantic Web research. The authors identified top-down research topics from four seminal papers and used three data analysis tools to extract bottom-up research topics from Semantic Web research papers. The top-down and bottom-up topics are then compared and the results are discussed.

Pros:
This paper presents an original and interesting study on the main topics in the field of Semantic Web using mixed methods. The insights summarized and discovered from the seminal papers and data-driven analysis can provide useful references for scholars in this field. This paper is also well-written.

Cons:
There are two major issues with this paper. First, the authors didn't discuss the impact of parameter selection on the data-driven analysis results. Do the three tools require some input parameters, such as the number of topics? What are the parameters selected for your experiments and why do you select these values? It is also possible that the tools may not require any parameters from the users. In either case, the authors may need to provide some explanations to enhance the reproducibility of this work. Second, this paper overall is presented in a qualitative manner although it took a mixed methods approach. Is there anyway that the authors can, e.g., quantitatively compare the rankings of the topics output by the three tools (e.g., using Spearman's correlation coefficient), and explain why the ranking is different?

Some more detailed comments are listed as below:

- Abstract: "Overall, we provide a reflectional study on the past decade of Semantic Web research, however the findings and the..." should be put into two sentences: "...past decade of Semantic Web research. However the findings and the..."

- Page 3: "Another common solution is the adoption of probabilistic topic models, such
as Latent Dirichlet Analysis (LDA)" LDA refers to "Latent Dirichlet Allocation" not "Latent Dirichlet Analysis".

- PoolParty seems to be a commercial service. Did you purchase their service for this research or they have a free academic version? Some explanation is necessary here.

- Figure 10: what do the different colors of the nodes represent? If you use a metric (e.g., degree of centrality) to define the colors, you will need to add a legend here.

- Page 19: "Among the topics experiencing strong variations through time, Web Service is a declining one. After experiencing a peak of use in 2008 with a 40% distribution in the documents, it then dropped to less than 20% in 2015." It is important to note that fewer explicit mentions do not necessarily mean less popular. At the beginning of Web, people often say "World Wide Web" but nowadays people just say "Web". One more sentence explanation should be added here.

- Paper length: overall, this paper is too long. The authors may consider shortening some discussions.

Review #2
Anonymous submitted on 03/Jul/2018
Suggestion:
Major Revision
Review Comment:

The paper presents empirical work on the deveopment of the semantic web filed in terms of the coverage and importance of topics. In constrast to previous work, the authors make an attempt to combine a meta survey with the application fo different automatic topic detection methods.

In principle this approach is interesting as it promises to test the claims made by the 'experts' about the development of the field. Being an empirical approach, it has to stand a number of questions:
(1) are the hypotheses novel and important
(2) is the methodology well founded
(3) does the data support the claims

In order to be accepted, these questions have to be answered in a positive way.

(1) Question one already reveals one of the major weaknesses of the approach: there are no hypotheses. The study remains almost completely descriptive and does not really aims at drawing substantial conclusions about either the state of the field or the vailidity of the previous surveys. I think the paper would be much stronger if the authors would formulate a number of hypotheses that are tested.

(2) The overall idea of the proposed methodology is interesting and promising. Contrasting expert opinions with statitiscal results is very interesting. A well founded approach however, requires these two ways of anaylyzing the data ar eas independent as possible. In the current way the study is designed, I am not so sure if this is always the case. In particular, the results of the automatic analyses - at least for some approaches - strongly depend on manual input by the authors. This means that there could easily be a bias in the choices made. Maybe it would eb more clean to either resort to completely automatic methods or first consolidate a shared taxonomy that is the basis for all analyses manual and automatic.

(3) Getting the data right is clearly a critical step in this study. I acknowledge that this is a difficult problem and there are clearly different ways to take, but stil I am not really happy with the choices made by the authors, both with respect to the selection of survey papers to compare and the document corpus to analyse. Wrt. the surveys the authors mix two types of papers. The first type are vision papers that are not based on factual information but rather aim at convincing people to follow a certain path. The second type are actual surveys that describe the research field a posteriori. Mixing these two types is very questionabls. The problems become apparent when looking at the vision paper from the database community that names completely different topics than all the other papers. The selection of teh corpus is also not unproblematic in my opinion. The authors use papers from ISWC and ESWC which is somewhat reasonable as these two conferences can be seen as the core events of the community. I don't really understand why the 'semantics' conference has been included. I don't think it has a status that is anywhere near ISWC and ESWC. Further, it is unclear why other sources have not been included (e.g. the Journal of Web Semantics or the Semantic Web TRack at WWW). I miss clear selection criteria. This makes the selection look rather arbitrary.

In summary, I think the idea of the paper is very interesting and worthwhile, but there are a number of weaknesses (s.a.) that should be addressed by the authors before publication.