TheSoz: A SKOS Representation of the Thesaurus for the Social Sciences

TheSoz: A SKOS Representation of the Thesaurus for the Social Sciences
Benjamin Zapilko, Johann Schaible, Philipp Mayr, Brigitte Mathiak
The Thesaurus for the Social Sciences (TheSoz) is a Linked Dataset in SKOS format, which serves as a crucial instrument for information retrieval based on e.g. document indexing or search term recommendation. Thesauri and similar controlled vocabularies build a linking bridge for other datasets from the Linked Open Data cloud - even between different domains. The information and knowledge, which is exposed by such links, can be processed by Semantic Web applications. In this article the conversion process of the TheSoz to SKOS is described including the analysis of the original dataset and its structure, the mapping to adequate SKOS classes and properties, and the technical conversion. Furthermore mappings to other datasets and the appliance of the TheSoz are presented. Finally, limitations and modeling issues encountered during the creation process are discussed.
Dataset Description
Responsible editor: 

Submission in response to

Revised paper after an accept pending minor revisions, now accepted for publication. The original reviews are beneath the second round reviews.

Solicited review by Ivan Herman:

This is just to complete my earlier review: the answers and changes in the manuscript satisfy my earlier comments and questions.

Solicited review by Christophe Gueret:

In this revised submission the authors addressed most of the comments that were made about their initial submission. I think the paper is now clear and complete enough to be published.

Just two minor things to consider, eventually:
* There are some sentences that could be improved. A final proof read by an English native speaker would be beneficial to the manuscript.
* It would be preferable, IMHO, to define the extensions of SKOS in another domain than "thesoz". In case other thesauri publisher would like to use them, they may prefer not depending on TheSoz.

Solicited review by Danh Le Phuoc:

The revised version and authors' response addressed most of my concerns. The paper could be published as it is.

First round reviews:

Solicited review by Ivan Herman:

Similar thesauri (like the STW you refer to, or the LoC datasets) combine a human readable interface with the SKOS terms via RDFa. This is a great way of using this; indeed if, say, an article author would like to use the terminology to annotate their article, this interface makes it easy. Do you plan to provide a similar interface? The DBPedia-like HTML interface is not very friendly to laypersons...

That being said, and looking at the Web interface, I realized that you also publish the data (at least for a specific term) in Turtle. I applaud this, and it may be good to mention this the paper. At first reading one gets the impression that RDF/XML is the only serialization available... Or is it so that the full dataset can be downloaded in RDF/XML only?

It is not clear what the curation process of the dataset is. You do say that there is a separate data management system at GESIS, but does that system maintain the SKOS version directly? Or is the SKOS version a regular dump of the dataset? If yes, how up-to-date is the SKOS thesaurus?

Tiny bit: on page 3, lower part of the left column it says "descriptors has are represented as "skos:Concept", but...". I would guess the 'has' is to be removed...

Solicited review by Christophe Gueret:

The data set being presented is a thesaurus for social sciences.

* Quality of the dataset
The end point works and the resources can be dereferenced as announced. From a modelling perspective, the choices are sound and well motivated in the paper. I was only surprised by two aspects:
1) It is said in 2.2 that DC, OWL and CC elements have been used for provenance purposes. One could wonder why not using OPM or PROV-O instead but, in fact, I could not find any resource using any of these terms. It would be good to provide a reference to a resource having provenance tracking and explain why others don't have such information
2) The description of "" contains some loops with "thesoz:hasTranslation". It seems that predicate is used for both the direct and inverse property. Is it the intended behaviour or a modelling mistake?

* Usefulness (or potential usefulness) of the dataset
TheSoz is argued to be widely used and a generic thesaurus about Social sciences is surely something that has a great potential. I would have appreciated having some concrete examples in the paper though, or maybe at least some more indication of the size of the user base for TheSoz.
The presence of the links is also merely motivated by the fact that it is a requirement to have them in order to list a data set in the LOD cloud. Concrete examples on how these links could be used and why those created are of particular importances would give a clearer indication of the motivation behind this (time consuming and typically non trivial) linking work.

* Clarity and completeness of the descriptions
The paper is clear and well motivated. There is some writing that could be improved for clarity, like Section 4 which is the hardest to read, but nothing really important. Two sentences to fix: "in the TheSoz descriptors has are" (page 3) and "there stated as" (page 5).
Apart from the textual considerations, the description of the data set would be improved with:
- some indication of the number of users (as mentioned earlier);
- the frequency of updates of the data set;
- an summary of the different themes covered by the thesaurus - so that readers can get a idea of what the data set actually contains.
Finally, it would be interesting to know what differs between what is described and the (failed?) attempt [17] that is mentioned in the introduction.

Typos in the bibliography:
[2] "Assem" -> "van Assem"
[16] Weird sequence of characters between "Silk" and "A link", there should be a ":"

Solicited review by Danh Le Phuoc:

This paper presents the Linked Dataset TheZoz, a thesaurus for Social Sciences represented in SKOS. I don't see following aspects of the dataset convincing :

-Modeling patterns and proposed classes and properties are over-simplified. The dataset only uses trivial sets of existing vocabularies such as owl:version, refs:label, etc .

-the dataset only contains 8000 descriptors, but the paper does not reveal how many
triples/links they have in the dataset. There are only around 10.000 links to external datasets.

-Paper claims that TheSoz is used for indexing data in a portal, but the technical details are vague and descriptive. It is not clear how the other systems/dataset can benefit from this dataset.