Linked Open Vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web

Tracking #: 974-2185

Authors: 
Pierre-Yves Vandenbussche
Ghislain Atemezing
Maria Poveda
Bernard Vatant

Responsible editor: 
Tania Tudorache

Submission type: 
Tool/System Report
Abstract: 
One of the major barriers to the deployment of Linked Data is the difficulty that data publishers have in determining which vocabularies to use to describe the semantics of data. This system report describes the Linked Open Vocabularies (LOV), a high quality catalogue of reusable vocabularies for the description of data on the Web. The LOV initiative gathers and makes visible indicators that have not been previously been harvested such as interconnection between vocabularies, version history, maintenance policy, along with past and current referent (individual or organization). The LOV goes beyond existing Semantic Web search engines and takes into consideration the value's property type, matched with a query, to improve terms scoring. By providing an extensive range of data access methods (SPARQL endpoint, API, data dump or UI), we try to facilitate the reuse of well-documented vocabularies in the linked data ecosystem. We conclude that the adoption in many applications and methods of the LOV shows the benefits of such a set of vocabularies and related features to aid the design and publication of data on the Web.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Robert Danitz submitted on 04/Mar/2015
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

The paper describes a curated catalogue for registering vocabularies, which assesses vocabulary metadata and the quality of vocabularies in terms of interlinking with other vocabularies.

An overview of the system architecture of LOV is given, followed by a description, how LOV can be incorporated into an ontology development methodology. An evaluation is given on how the system is embedded into the Linked Open Data ecosystem, concluding with a discussion with a comparison to other vocabulary catalogues, and proposed future work.

The paper is well-structured, readable and leaves an overall well-rounded impression.

# Quality, Importance, and Impact:
Chapter 4 gives evidence on extent of data sets, figures of usage, and indicates the benefits for other LOD tools. It can be deduced that the tool has a positive impact on the quality of the publication of Linked Data.

It is well regarded that the authors tool is openly accessible, that the data is published under an appropriate open license, and that the authors compiled and published their findings and experience on the process of creating vocabularies.

# Clarity, Illustration, and Readability:
It states (abstract) that maintenance policies are gathered per vocabulary. While useful, there could be no evidence found either in the paper, nor in the tool.

A SPARQL endpoint is advertised, but there seems to be only an endpoint by means of the web interface. One can find out the actual endpoint, but it is not advertised neither in the paper, nor on the corresponding website. This hinders the use of LOV by software agents.

Due to this inconsistency to older versions of LOV, Neither Listing 1 and Listing 2, nor the footnotes 16 and 17 (p7) are working properly, or producing an HTTP 404 respectively, and should be updated.

Factors for gathering a match score are stated by referring to “label properties”. It is noted that certain vocabularies result in a disparate evaluation, without being substantiated.

# Minor remarks:
- p3. The use of Inlinks/Outlinks is inconsistent (cf. Figure 2) with the use of Incoming/Outgoing link in the web interface. Is it mixed up?
- p5. RESTful instead of Restful
- p5. missing full stop in last paragraph and two following “For instance, …”
- Listing 1: content sticks out of box
- p7. Both goo.gl links in the footnote yield a 404 (see above).
- Figure 9 is hardly readable, and not comprehensible without further description.
- p11. typo in last sentence

Review #2
By Christian Bizer submitted on 10/Mar/2015
Suggestion:
Minor Revision
Review Comment:

The paper nicely describes the scope, goals and current development state of the Linked Open Vocabularies initiative. It clearly motivates the need for a curated vocabulary catalog and explains the design decisions taken concerning the semi-automatic cataloging workflow, which make a lot of sense. The related work section nicely compares LOV to the more automated approaches of Swoogle, Watson, and Falcons, again highlighting the benefits for the community in having a rich curated catalog in addition to the shallow automatically generated catalogs. For the practitioner, the paper gives a good overview of the wide range of different access modes that LOV offers and which should satisfy all needs of data publishers and vocabulary/ontology engineers.

Thus, I recommend to accept the paper as a Tool and Systems paper given that the following minor changes are included in order to further improve the paper:

1. One of the key features of LOV is to make the relationships between vocabularies explicit using voaf: terms such as extends, specializes, hasEquivialenceWith … In order for the reader to better understand the meaning of these terms it would be good to explicitly state in the paper which terms from other vocabularies are considered for generating these relationships. For example, is skos:exactMatch and skos:closeMatch considered for generating the hasEquivialenceWith relationship or by some other relationship?

2. As links between vocabularies are central for LOV, it would be great to add a table with statistics about the amounts of vocabulary links that are currently cataloged, so that the reader can relate these numbers to the overall number of vocabularies.

3. Please describe in the paper how to number of datasets that use a specific vocabulary is determined and maintained (checking the website, LOV seems to rely on lodstats, but it is unclear how the numbers are calculated. For instance LOV states that void is used by 77 datasets while LODstats states 47 and an alternative namespace).

4. In the related work section, related more dataset-centric efforts such as lodstats or the Mannheim Linked Data catalog should also be mentioned as they also provide vocabulary usage statistics.

5. Please also compare your usage statistics to the ones presented in the ISWC paper "Adoption of the Linked Data Best Practices in Different Topical Domains" http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/ as the numbers reported in this paper seem to be higher (for instance 137 occurrences of void in the crawled data, and 160 occurances in the whole catalog including manually generated dataset descriptions, see http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?sort=sco...).

6. Please also state if the usage statistics also aim at covering the usage of vocabularies in the RDFa and Microdata context. If yes, the usage number for schema.org should be 4 orders of magnitude larger (see Meusel, et al: The WebDataCommons Microdata, RDFa and Microformat Dataset Series). Sorry, for being so picky about the usage statistics, but I think that the adoption of a vocabulary is one of the core indicators for an ontology engineer to reuse the vocabulary.

7. According to the Linked Data best practices, it is enough to provide for each vocabulary term to dereference into its definition, but it is not strictly required to provide a single web document that defines the whole vocabulary (for instance DBpedia only provided single dereferencable terms for a long time and just recently started to provide a single document summarizing the definition of the DBpedia ontology, but not the DBpedia terms in the properties namespace). It would be interesting to know if LOV plans to support such “decentralized” Linked Data vocabularies in the future.

8. As part of the future work section, it would be interesting to know the opinion of the LOV effort about extending the catalog with additional links between the vocabularies that are for instance generated using transitive closure or schema matching techniques. Such information would clearly be relevant to the LOV users, but it quality will likely be lower that the quality of the current explicitly set vocabulary links. Would the LOV community be open to including such data in the catalog or not?

Review #3
By Irene Celino submitted on 29/Mar/2015
Suggestion:
Major Revision
Review Comment:

Generally speaking I'm very in favour of accepting a paper that describes LOV. Nonetheless, I think that this paper, in its current form, is not ready yet for publication and would need a serious revision in both form and content. I hope that my review can help in improving the paper for a final acceptance.

I would recommend the authors to insert a section just after the introduction to give an initial overview of LOV in term of its content. I would expect basic statistics: number of vocabularies, terms, properties and classes and their trend over time (there are some details in section 4, but they come too late I think), most frequent knowledge domains represented in vocabularies, min/max number of properties or classes in a vocabulary, statistics on the different inter-vocabulary relationships (e.g. is extension more frequent than import?), etc. In short, I think that the some more details about the actual content of LOV should make the paper more comprehensive.

The other missing bit in the paper is the explanation of the "LOD popularity": in some places (and in evidence in LOV search results) there are references to the occurrences of a term in LOD datasets. Where do those numbers come from? How are they computed? How often are they updated? Are they included in the LOV dumps? Are those occurrence numbers "reliable" (i.e. are they actual indicators of vocabulary reuse)? More importantly, is this sort of "popularity" measure included in the LOV ranking when returning search results?

More specifically regarding the current content of the various sections of the paper:
- Figure 1: this is a minor comment, but it would have been more natural to me if the figure was "upside down", with the Web at the bottom and the community on top, since this is the usual way to represent things that gets data from some sources and elaborate them to present them to final users. As a consequence, I would also revert somehow the order of the following sections.
- Section 2.1: while I understand that the community plays an important role, it sounds a bit odd to consider it a "component" also identified by a squared box in Figure 1 like software components. At least a different graphical representation would be advisable.
- Section 2.2, curators: please, add more information about curators and their work. Who are the curators? How many? What's their background? Are they sufficient or you envision an increment of this "editorial board" the more vocabularies are inserted in LOV? What is the process they follow? What are the criteria for inserting or rejecting the insertion of a vocabulary in the catalogue? How long does it take on average for a new vocabulary to be added to LOV? What happens if a vocabulary suggested for insertion is about a domain that is unknown/unfamiliar for the curators (e.g. a very specialized biomedical vocabulary)? What if a suggested vocabulary contains hundred thousand or million terms? I recommend the authors to add those (and maybe other) details; maybe the LOV curation team is worth a "component" in Figure 1 as well.
- Section 2.2, inlinks/outlinks: I may be wrong, but it seems to me that the definitions of inlinks and outlinks are reverted (I assume that inlinks are those on the left and outlinks those on the right in Figure 2).
- Section 2.2, associated metadata: the authors say that the curators added creator information in about 85% of the cases (which is a lot). Do they also contact back the creators to suggest them to complement/complete their vocabulary metadata to improve future versions? If not, I think this would be a very useful service.
- Section 2.2, last sentence: "one needs to know in which vocabularies and datasets a particular vocabulary term is referenced", I understand that in LOV each term has a link to its respective vocabulary but not to datasets that use it. Or am I missing something? (This is also linked to the LOD popularity comment above.)
- Section 2.3: can you add a link to the LOV code and indicate its respective software license? Creative Commons are data/content licenses, not usually applicable to code.
- Section 2.3.1 and Figure 3: what are duplicate terms? If it simply means same term searched by another LOV user, it seems to me that it is an irrelevant information. Moreover, are multiple terms queries included in the count? Can we consider Figure 3 as a display of the total number of queries on LOV over time?
- Section 2.3.1 and Table 2: I would recommend adding a line with numbers for single term queries, just for reference and comparison. Moreover, can the authors comment on the numbers in Table 2? What do those numbers tell us about user searches?
- Section 2.3.2: do the dumps also contain the full vocabularies with all the terms or only their description according to VOAF? Can the authors add some statistics on number of triples, dimensions of dump files, etc.? This is another place where they mention the number of occurrences in LOD without fully explaining what this mean.
- Section 3, ontology assessment: here the authors mention a term score. Is this the same score explained in Section 2.3.1 with respect to the local name and various labels or is it something else?
- Section 3, listings: both SPARQL examples miss the WHERE clause.
- Section 3, ontology localization: while I am fully convinced of the importance of localization, the example provided by the authors sounds misleading, since it seems that they are saying that words in different languages sharing the same "root" have the same meaning. This is of course false (try with "burro" in Italian and in Spanish...).
- Section 3 and Figure 9: that figure is too complex; therefore either the authors insert a full explanation or they remove it since it does not convey much more that what is already written in the section (I suggest the latter option).
- Section 4: as said above, some of the things written in the section introduction should be anticipated and expanded. Minor comment on Figure 10: what is the reason for the tiny decrease in vocabulary number in October 2014?
- Section 4.2, last paragraph: "Giovanni" doesn't seem an author of reference [10].
- Section 5: when comparing the number of results between Swoogle and LOV, the authors wrote that in LOV "the term only appears in 1,562 vocabularies". I think this is incorrect since LOV has a total of less than 500 vocabularies, probably that is the number of terms matching that query.
- Section 6: again the authors present as a contribution the "terms search scoring". While it is indeed useful to distinguish between primary labels and other properties, I'd say that an average vocabulary "seeker" would also like to know who used a specific vocabulary and in which context, as a metrics/indicator/clue to choose between competing vocabularies. Therefore, as a possible future extension and improvement of LOV, I would recommend to think about ways to support users in vocabulary selection: besides the current scoring/ranking system (which is about relevance of the result w.r.t. the user query), LOV could offer other "scores" or additional supporting information for a more informed selection of vocabularies.

As final remark, I would recommend the authors to have their paper checked by a native speaker (I'm not...), because some expressions sound a bit weird to me. For example I think that LOV "alone" should not have the article ("LOV is" instead of "the LOV is"); on the other hand, when used in attributive manner it should need the article ("the LOV architecture/the LOV curators/..." instead of "LOV architecture/LOV curators/..."). Native speakers can of course prove me wrong.