JRC-Names: Multilingual Entity Name variants and titles as Linked Data

Tracking #: 1087-2299

Authors: 
Maud Ehrmann
Guillaume Jacquet
Ralf Steinberger

Responsible editor: 
Philipp Cimiano

Submission type: 
Dataset Description
Abstract: 
Since 2004 the European Commission's Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamin/Biniamin/Беньямин/بنيامين Netanyahu/Netanjahu/Nétanyahou/NetahnyНетаньяху/نتنياهو). This entity name variant data, known as JRC-Names, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union's Open Data Portal.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 19/Jul/2015
Suggestion:
Major Revision
Review Comment:

This paper presents the development of a new dataset that is intended for multilingual entity names, and represents the first time that this resource has been made available as linked data. This paper is in general excellently written and presents a resource that has already proved useful to a large number of use cases. However, as a primary criticism I was not quite sure what the added benefit would be of making this available as linked data. It would be helpful if the authors could expand on the use case of this dataset as a linked data resource. In particular this is currently only mentioned in a short paragraph at the bottom of page 6, which is too vague to be convincing.

In addition to the description of the dataset this paper also includes a section describing the linking between this dataset and several other datasets including DBpedia. This is very well carried out and could potentially greatly increase the value of the dataset. It would be great if the authors could further comment on the applications of these links. However, I do miss an evaluation of the quality of these links and would recommend that the authors followed the pattern followed by many other authors in the field and took a small sample of links (say 50-100) and evaluated those links.

p2. "can as well" => "can also"
p2. "anew" => "again"
p5. Please make lemon consistently italic
p7. Byzance => Byzantium
p8. rdfs:label and rdfs:seeAlso
p9. 1.7 million (not a comma)

Review #2
By Jorge Gracia submitted on 07/Aug/2015
Suggestion:
Minor Revision
Review Comment:

This article explores the new linked data version of "JRC-Names", a multilingual dataset with entity names and variants collected by the European Commission's Research Centre since 2004. The variants have been collected after a (mostly) automatic extraction process from the online version of printed media. The paper describes the representation model (based on lemon) and the other aspects related to the linked data version generation.

The paper is very well written and structured. It is also well illustrated with examples and supported by relevant external references. The motivation discussed at the beginning is strong. In my view the potential impact of this resource is high, and moving it into the Web of Data makes it even higher and more useful for the community. Specially interesting is the potential of this dataset for cross-lingual linkage and cross-lingual information access. The authors demonstrate a good understanding of the lemon model and the other underlying technologies. Here are some comments that I hope will help the authors to improve the quality of the submission:

- The paper is of "dataset description" type. The authors should cross-check the length restrictions of this type of submissions(up to 10 pages I think).

- I would clarify better the notion of "prior probabilities", introduced in page 5.

- In section 3.3, they say "This base name is not marked with a language..." but I do not see why not. If the language of the preferred label is known, reporting it can only be beneficial!

- In figure 2, why there is no "lemon:reference" relation between "jrc-names:Claude'owi_Junckerowi__pl#sense77" and "jrc-names:Jean=Claude_Juncker"?

- The model should be made available and dereferenceable online http://open-data.europa.eu/jrc-names#

- Something I miss in the paper is a quality-oriented evaluation of the extracted names and variations. In fact, they describe (section 2.2) some strategies to reduce the noisy terms, as for instance applying a threshold and filter out those variations with a low frequency. But a quantitative measure of the improvement is not reported. I understand, however, that this is not essential in this type of paper (and possibly this issue is more related to the general EMM framework), but adding a few lines about any already performed quantitative evaluation, or plans for future ones, would make the submission even stronger.

- The resource is freely available online and the relevant pointers are included in the paper. However, the entry describing JRC-Names should be updated in http://datahub.io, as it refers to the old MLODE'12 version currently.

- lemon:LexicalVarian is not well capitalised, it is lemon:lexicalVariant. This has to be corrected both in the text and in figure 2.

- Some references (e.g., 23, 29) have the complete name of the authors and not the initials as the other citations.

- Finally, a very minor one: lemon is written sometimes in italics, sometimes not. I would recommend to make it homogeneous and write it always in italics.

Review #3
By Gabi Vulcu submitted on 25/Aug/2015
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

The reviewer's input:
This journal paper describes a Language Resource, which is the conversion of JRC-Names to Linked Data using primarily the lemon vocabulary.
It takes as input the proper names JRC dataset which includes not just proper names but also titles for these names.

The paper starts with the motivation of the work; presents the needs of integration and interoperability that the LD approach supports.

Then the process JRC-Names is provided, but it is not clear whether this is a contribution of the paper or a description of what exists already.
The main section of the work, section 3 describes the data model used to model JRC_Names dataset. The modelling is explained in a long section which is difficult to follow. The content is fine but the presentation can be improved. Maybe a better quality illustration could help. I found myself lost few times in the crowded Figure 2.
However motivation of why these modelling decision are made is missing.
Conceptually it is not clear what happens with the names that to not have any links to an external knowledge base. I understand that the multilinguality link is made between senses in different languages which point to the same DBpedia (or other knowledge base) entity. What happens when there is not such entity?
Also, I am not sure if Occurrence is a good name for a relation Concept between the title senses and the DBpedia concept.
It is not clear why the property jrc-model:hasTitle is between the DBpedia concept to the title sense and not between the lexical entry sense and the title sense.
I think the figure is missing the following:
- link between the polish 'Claude Juncker 77 sense' to the DBpedia concept
- the relation lemon:lexicalVariant between the two polish lexical entries
- missing sense of the "presidente_dell'eurogruppo_it" lexical entry. Is it omitted because of space reason or maybe not all titles have a sense and therefore a connection the the external DB?

The overall impression about this paper is that it fits very well to the context of this journal, however there are things that need to be addressed:
- more motivation on the design decisions
- better and complete illustration of the model (maybe create 3 figures: one for variants, one for titles and one high-level of the relations between the two. )
- improve section 6 by showing how the dataset addresses the potential uses. Show some queries and discuss if they are actually feasible for the use-case.