Review Comment:
Review of
Countering language attrition with PanLex and the Web of Data
Submitted to Semantic Web Journal
Overall recommendation: Accept with minor revisions (though on the borderline between major and minor revisions)
Summary review
This paper describes the efforts of the PanLex project to make its data available in Linked Data form. PanLex represents one of the largest efforts (if not the largest) to create a cross-linguistic lexical database where elements with similar meanings are linked together across thousands of languages. As such, a paper describing these efforts seems completely appropriate for a collection of papers on Multilingual Linked Open Data--and indeed, it would seem important to include it. At the same time, I think the paper could be improved in various ways, and probably the linked data itself could be improved too. I am not accustomed to reviewing papers on Linked Data datasets like this. So, I am not sure if it is appropriate to suggest changes to the actual RDF, but it does seem to me that, at least, the paper could address why certain choices in representing the data were made when others were available (see below for more details).
I will give detailed comments below, but first I will mention the most global concern I had about the paper: The paper does a good job of going through the mechanics of making the PanLex dataset available as Linked Data, but it was hard for me, on the whole, to understand the underlying conceptual model of linked lexical data that PanLex assumes. Because of this, it was not easy for me to see how similar or different it is from other projects aiming for interoperation of lexical data (e.g., for NLP). One of the apparent selling points of this database is its potential applications for translation, but I didn't see anywhere in the paper discussion of how precisely this would be facilitated. I gather this is somehow done via the Meaning relation, but too little was said about how meanings were encoded for me to see how this might work. Is this essentially an "interlingual pivot" approach? If so, it would be good to know more about the "interlingua". If not, it would be good to make the nature of the intertranslatability mechanism clearer. More succinctly, I could summarize this comment as: The paper would have more impact, I believe, if the discussion was balanced more towards the conceptual side and less toward the technical one.
Detailed comments
Title: The title of the paper describes the goals of PanLex more than the contents of this specific paper.
p.2, section 2.1, first paragraphs: Can more be said about how meanings are represented? I quickly visited the link in footnote 4 and learned meanings are approver specific and the have identifiers, but I still didn't know what they look like and, in any event, I probably shouldn't have to leave the paper to find this out.
p.2, section 2.1, second column: Can PanLex really extract lemmas reliably from so many sources?
p.2, section 2.1, towards the bottom: The label "approver" is very confusing for this object! Also, can the logic of making this pairing a foundational object for PanLex be made clearer? On its surface, it would seem to hinder potential applications for intertranslatability.
p.2, figure 1: The use of the singular "Metadatum" seems odd here. Is there only one metadata field being encoded?
p.3, section 2.1: As with meanings, I don't have a good sense of what kind of content is associated with a denotation. How are similar denotations across languages identified as such? Also, the end of this section implies an annotation schema for things like register, but can the possible feature-value pairs be consulted somewhere?
p.3, section 2.2: The discussion of how PanLex meanings are encoded in RDF clarified some aspects of the structure of meanings that would have been better stated explicitly earlier, e.g., that they can include "definitions" represented as strings. Also, why is rdfs:label considered an appropriate way to encode these definitions? A definition does not strike me as a "label", or do I misunderstand? Also, I wonder if the use of SKOS terms, such as skos:definition, might be more appropriate here. (In general, my sense is that this dataset could make better use of available data encoding schemes like SKOS, TEI, etc.)
p.3, section 2.2: Why use a plx:wordClass concept for denotations when there must be other places where this concept has been defined for Semantic Web purposes.
p.4, bottom of second column: "vocubulary"->"vocabulary"; also I was not familiar with the acronym ETL. Can it be expanded?
p.5, section 3: Can a brief note be made about the reason why linking to DBpedia is deemed desirable in the context of this dataset? What specific applications is this linking expected to support?
p.5, table 3: Is there some way for us to evaluate if the number of links obtained is "good" or not? For example, how many potential links would one have estimated as possible? What percentage of potential links were realized? How was the linking better or worse across languages? In what cases may "missing" links be due to problems with PanLex, when might DBpedia be the problem, etc.?
p.6, section 6: This section describes related work in very general terms, but it would be good to talk more specifically about how standards for lexical data interoperation relate to this project. Some relevant work was (relatively recently) summarized in this paper, http://www.aclweb.org/anthology-new/W/W10/W10-2101.pdf, and it may serve as a good starting point. It would be welcome, in particular, if there was explicit discussion about why existing coding systems weren't used for the RDF version of PanLex. For instance, lemon is mentioned, but that makes one wonder why it wasn't used to encode the data. In general, merely stating that related work exists is less interesting than describing why the conventions of that work were or were not adopted and/or placing this project more explicitly with respect to the goals of other projects.
p.6, section 7: Discussion of weaknesses seems relevant, but do these weaknesses actually affect the ability of this resource to be used effectively as linked data? That is the kind of weaknesses that I would think would be most relevant here.
Addendum to earlier review to ensure coverage of key areas of review:
---------------------------------------------------------------------
URL for the data: Various URLs to PanLex are given, but it was not clear to me if there was a standard URL for the RDF data. Also, I tried various links to get at the RDF in the paper but couldn't access any of them (e.g., the /sparql URL gave an error and http://ld.panlex.org/rdf.html, found on the project's main webpage, never resolved). This should be checked before publication.
Version date and number: I didn't see this explicitly discussed. It may be that there is no single version date and number since each source is associated with its own version date and number. As far as I know, this seems fine, but it would be good to make this clear.
Licensing: There was discussion of licensing of the data in PanLex in general, but I didn't see discussion of how licensing applies to the RDF data. Is it merely derivative of the licensing used for each of the Panlex sources or does the RDF version have its own license? (Is the licensing information encoded in the RDF if it is different for each set of data?)
Availability: See above. I could not access the URLs where the data was supposed to be available. Certainly wide availability is intended, and the PanLex data has long been available. So, I imagine this is merely a technical glitch.
Authorship: I am not sure of the relationship of the first two authors to the dataset, but the third author is director of the PanLex project, meaning at least one author has a close connection to the data.
Use of vocabularies: As noted in my review, I have concerns that, more often than necessary, PanLex internal concepts are used in the RDF when equivalent concepts may already be available in, for instance, SKOS.
I believe my earlier review covered the core review criteria, but I repeat my evaluation of them explicitly here for ease of editorial decision making.
1. Quality of the dataset: PanLex represents one of the largest efforts (if not the largest) to create a cross-linguistic lexical database where elements with similar meanings are linked together across thousands of languages. I have not explored the dataset thoroughly, but, from what I know, it is a high-quality dataset--moreover, there is nothing comparable of higher quality that I am aware of
2. Usefulness of the dataset: I am not sure it is explicitly described in the paper (but if it is not, it should be), but the coverage of this dataset, in terms of number of languages, is at a level goes well beyond any other publicly available resource to the best of my knowledge. Moreover, its attempts to interlink lexical items give it clear applications for (certain kinds of) multilingual translation and interoperation, making it clearly useful.
3. Clarity and completeness of the descriptions: I raised this point explicitly in my initial review. It is hard to understand key aspects of the PanLex conceptual data model from this paper. I think the paper can be readily revised to address this since this is almost certainly an issue regarding presentation of the project rather than resulting from a lack of internal understanding on this point.
|