Watson, more than a Semantic Web search engine

Paper Title: 
Watson, more than a Semantic Web search engine
Mathieu d’Aquin and Enrico Motta
In this tool report, we present an overview of the Watson system, a Semantic Web search engine providing various functionalities not only to find and locate ontologies and semantic data online, but also to explore the content of these semantic documents. Beyond the simple facade of a search engine for the Semantic Web, we show that the availability of such a component brings new possibilities in terms of developing semantic applications that exploit the content of the Semantic Web. Indeed, Watson provides a set of APIs containing high level functions for finding, exploring and querying semantic data and ontologies that have been published online. Thanks to these APIs, new applications have emerged that connect activities such as ontology construction, matching, sense disambiguation and question answering to the SemanticWeb, developed by our group and others. In addition, we also describe Watson as a unprecedented research platform for the study the Semantic Web, and of formalised knowledge in general.
Full PDF Version: 
Submission type: 
Tool/System Report
Responsible editor: 
Jérôme Euzenat

This is the final version to be published. The reviews below are for the original submission.

Review 1 by Laura Hollink

This is a well written paper about a tool that has a clear impact on semantic web application development and search.

I do have a few remarks that I think should be addressed before the paper can be published.

The main point is that the authors are often vague about what Watson does exactly. They give examples and use phrases like "certain characteristics", while the reader would like to know things exactly and completely. I'll give some examples:

1. On p2 it is said that: "Different sources are used by the crawler of Wat- son to discover ontologies and semantic data (Google, Swoogle4, PingTheSemanticWeb5, etc.) Specialized crawlers were designed for these repositories, extracting potential locations by sending queries that are intended to be covered by a large number of ontologies. For example, the keyword search facility provided by Swoogle is exploited with queries containing terms from the top most common words in the English language. "

Could you clarify which sources are used exactly? How did you determine the queries "that are intended to be covered by a large number of ontologies"? (Giving one example of how it is done in Swoogle suggests that it is done differently for other sources).

More in general: it would be interesting to add to the paper how many data is indexed by Watson. Also, related to section 6, does Watson also index RDF-a? I guess not but that is not explicit now.

2. On p3 it says: "By combining these elements of information, Watson can decide whether or not a particular document should be treated as a semantically rich ontology."

I am curious to know how this is done exactly, what are the criteria for a semantically rich ontology?

3. Also on p3: "For each collected semantic document, Watson provides a page that summarizes essential information such as". What essential information does Watson provide exactly?

4. On p4/p5 you write: "This allows applications to define filters and selection criteria ensuring certain characteristics from the elements they exploit." Which characteristics? Are these the same as the 'essential information' mentioned on page 3?

5. Page 7 says: "With many users, … , Watson is now a mature system". I think readers would like an indication of how many users (or requests from systems) approximately. Thousands of users per year? per day?

Another point is that quite a few papers have already been published about Watson. Can you make clear what is new information in the current paper, and what is a summary of exiting work? (e.g. citation [12])

Minor things:
p1. your forgot an s in "for application to find and exploit"
p2. "the crawler eliminates any document that cannot be parsed by Jena6. In that way, only RDF based documents are considered." Only VALID RDF documents are considered. This is a different thing, and potentially quite a limitation of the number of documents that are considered.
p2. "When communicating with users and applications, these identifiers are transformed into common, non-ambiguous URIs" Do you mean transformed into the original URIs that were used in the original document?
p3. You write: "The keyword search feature of Watson is similar in its use with usual Web or desktop search systems." I'm a non-native English speaker, but I think it is similar to.
p3. Figure 2 is not readable.
p3 You write: "One principle applied to the Watson interface is that every URI is clickable." Is clickable the same as dereferenceable here?
p4. you forgot an s in "for application to find, access and exploit…"
p5 "Gracia et al. in [17], exploits" should be Gracia et al. in [17], exploit, without s, I think.
p7. you consistently use "engines" where it should be "engine"
p8. "they are related to each others" should be "They are related to each other" with capital T and without s.

Review 2 by Philipp Cimiano

This paper provides an overview of the Watson semantic web search engine developed at KMI. It describes the rationale and goals for its development as well as services provided to the community. It also describes some applications built on top of Watson.
It is a well-written and round paper that provides a gentle overview of Watson and its current status.
Overall, Watson as a system has had a significant impact in the community, thus I propose to accept this paper as a systems and tool paper.

Anonymous Review 3

This a well-written overview about Watson.

My only concern about the article is that it does not really provide or explain new results concerning research around Watson that has been not been reported before. This is only a short overview about the system and related applications.

However, I believe that an overview article like this would be useful to many readers of the journal.

Some comments:

1. Intro

Here I would explain more, what new insights this paper brings that have not been reported already earlier.

p. 2 Specialized -> . Specialized

p. 2 RDF based -> RDF-based

Fig.2 is of too bad print quality.

Figure 2 should also be explained in the text, now it is only referred to.

p. 6 a the -> choose "a" or "the"

p. 7 It would be better to explain how your system relates to the listed systems when they are first mentioned in the list.
Now eplanation come afterwards and not all systems are even analysed.

number results -> number of results

p. 8 . they -> . They

Ref [11] Capital letters missing.