Semantic representation of annotation involving texts and linked data resources

Tracking #: 1175-2387

Jin-Dong Kim
Karin Verspoor
Michel Dumontier
Kevin Bretonnel Cohen

Responsible editor: 
Andreas Hotho

Submission type: 
Survey Article
The explosive growth in web documents presents a rich opportunity to use this information to support knowledge discovery, by synthesizing and reasoning over statements and relationships expressed in those documents. However, the use of natural language in these documents means that the information cannot be directly used for computational analysis; it must be transformed into a computable representation. Natural Language Processing systems are being developed to perform this transformation, and the results of their processing can be stored and made available for knowledge discovery applications via structured annotations over documents. In this work, we sought to examine models to capture text-mined annotations. We first examine the utility of existing community-based models for representing annotations, primarily that of the Open Annotation Data Model (OA) and the NLP interchange format (NIF). We then propose a new model consisting of named graphs that separate annotations from resource descriptions. Our work overcomes limitations of existing models, provides interoperability between OA and NIF, and can be deployed to describe any kind of text annotation.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 15/Oct/2015
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

This paper describes the representation of annotations on the Web, in particular in the form of annotation of Web documents with RDF. The paper focuses on the direct comparison of two recent models: The Open Annotation format and the NLP Interchange Format. This paper in spite of being submitted as a survey article seems to be more keen to push the use of named graphs for metadata. As a survey, this paper misses out some very important proposals and is too brief in its description of the challenges for annotating the Web. The suggestion of a two-layer approach is unfortunately naive.

As a survey paper for annotation, I would have expected the first comment to be about the choice between 'stand-off annotation' and 'in-line annotation' and while stand-off methods such as NIF and OA are more flexible, in-line methods such as RDFa should be mentioned in the context of this work. There are also a couple of major works that are missing, notably the POWLA model of Chiarcos has been proposed for this task[1]. Secondly, the authors mention LAF briefly, but the relationship to the GrAF model (part of LAF) is missing, and we have recently used this for publishing annotations on the Web[2]. The authors should also discuss in more detail the challenges of providing stable identifiers for text spans on the Web.

Reading this paper I do not get the feeling that the authors intend this paper to be a survey paper but instead to push the idea of using named graphs over OA. The authors do not consider however the obvious drawback to this approach, in that named graphs are not part of the RDF standard, although they are used in some other specifications, e.g., SPARQL. The proposed serializations for RDF with named graphs (TriX and TriG) are not recommended by W3C or widely understood. 'Named graphs' are essentially the equivalent 'database's in the relational databases world, in that their primary purpose is to allow for multiple datasets to coexist in the same triple store. The proposal by the authors to put all provenance annotation on the graph URI would require some use cases to assign a different graph URI for each annotation (say if I were to assign an exact time and date to each annotation). Moreover, another important use case of OA is linking annotations together (for example as a phrase structure tree), which would be beyond the normal use case of a named graph. While adding provenance about datasets is a valid use of named graphs, named graphs were not intended to be a replacement for reification and using them as such in practice turns out to be impractical, due to there being no standard serialization and named graphs already being used for other purposes (organizing datasets in triple stores).

On p5. the authors try to claim that a text span should be typed as penn:NN, this argument is false, a text span is a symbol that signifies a word which is a noun, the penn:NN is a value of an annotation for a singular noun. The authors thus are changing the meaning of penn:NN in the OLiA model significantly!

On p6. the authors claim that OA cannot distinguish between annotating text as a noun phrase or referring to the concept of a noun phrase. In fact, OA does allow the type of a 'body' to be specified and this can be used to distinguish these cases.

On p9/10. the authors attempt to deal with deletion of an annotation and the open world assumption, but the solution is not correct. Declaring ex:anno4 as 'exhuastive' does not work, as under the OWA any triple could potentially be missing, including the one the author's wish to mark as deleted. A solution that would work for example is to explicitly say which triple was deleted OR to indicate the number of triples in the graph allowing the system to deduce when all triples in a graph have been seen. Also, in general 'recall' is more difficult with the OWA (as it is impossible to know how many things exist unless this is explicitly stated), 'precision' is easier (out of how many facts my system proposes was their sufficient evidence to say whether this is true or not), although in many settings still impossible under OWA.

Natural language processing should not be capitalized in the abstract but NLP Interchange Format should be.
p8. "we advocate putting" (not to put)
p8. "demonstrates how" (no that)
p8. "necessry"
p9. "hard, unless impossible" do you mean 'almost'?
p11. Semantic Web should be capitalized
The references are a part of your paper and should be prepared with the same amount of care. There should not be encoding errors, bad capitalization and other errors.

[1] Chiarcos, Christian. "A generic formalism to represent linguistic corpora in RDF and OWL/DL." LREC. 2012.
[2] Linking Four Heterogeneous Language Resources as Linked Data B. Siemoneit, J.P. McCrae and P. Cimiano, Proceedings of the 4th Workshop on Linked Data in Linguistics, 2015.

Review #2
By Paul Groth submitted on 04/Nov/2015
Major Revision
Review Comment:

#Brief Summary
This article describes a new representation for annotations applied to natural language text. It takes a look at primarily the Open Annotation (OA) model. It discusses the inadequacies of the model and devices a named graph based approach for representing such annotations.

#Overall thoughts
Overall, I think the area of representing annotations over texts documents is of particular interest in the community as automated knowledge base construction has become a bigger topic in particular for sourcing Linked Data. Furthermore, I think the analysis done by the paper and its subsequent contribution are a good one. Using named graphs as a mechanism for precisely separating the annotation from the outcomes of annotation is a good one. I think the closest work here is the work on the Grounded Annotation Framework . It would be good to compare and contrast this. While I think the paper has a technical contribution, it was submitted as a survey paper, which I don’t think it qualifies as.

Thus, I would suggest either reworking the paper to be a true survey or submitting it as a another type of contribution. I think both would be a good addition to the literature but the article needs to pick. If it were up to me, I would resubmit as a technical paper but I think the authors need to decide what they want to contribute.

# Qualification as a survey paper

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper would need more background material. For example, the role of annotation in NLP in particular, why the Semantic Web changes that role (e.g. cross document coreference as an important activity.), and a bit more on annotation formats. Furthermore, it would be useful to orient the reader with respect to the larger work on information extraction. In particular, much of the work there is beginning to ignore boundaries of documents and thus flexible provenance becomes a particular important with an annotation framework. Here I’m thinking of the Deep Dive work [1]

(2) How comprehensive and how balanced is the presentation and coverage.
Again for a survey paper, I think there would more that needed to be covered. Also, for a survey paper the presentation is tilted towards the authors solution.

(3) Readability and clarity of the presentation.
The paper is very readable. As a survey paper, I think there lacks a bit of a framework to think categorize the relevant material.

# Other comments

2.4 - I think the text span discussion is interesting and is actually something that people encounter consistently in practice. It would be nice to have a forwarding pointer to more information about span identification. Or is the claim that NIF URI schemes and OA selectors are enough to cover all cases?

Given that “aboutness” is such a central notion within the paper. It would be good to provide a definition in paragraph 2 of section 2.1. instead of just the forwarding pointer.

It would be good to provide small RDF examples snippets of NIF and OA when they are introduced to help the reader in the comparison that comes later.

In section 4.1 you begin by an example from NIF, which is somewhat confusing because the discussion has been primarily about the OA model thus far. Furthermore, the discussion of rdf:type and the nature of NIF in this section seems rather tangential. Is this a downside of NIF?

With respect to ambiguous annotation (e.g. Figure 4), isn’t an option to subclass oa:Annotation to provide a more precise definition (e.g. ex:SytacticAnnotation pdfs:subClassOf oa:Annotation)? It would be good to discuss this option.

Section 4.3, I think the incremental development argument needs to be better explained If one represents all annotations in OA in the first place, then adding additional provenance information doesn’t require the “destruction” of a statement. I think the authors are assuming that statements need to be both represented in an “rdf natural form” (i.e. as triples) and an OA form. This may be a reasonable assumption but needs to be stated. I can imagine writing transformations that expose a view of a dataset represented in OA as a set of triples.

The authors state “NIF ones are still semantically very thin”. I can buy that but would be interested in more details on what would count as “thick” semantics according to the authors.

I like the comparison of the conciseness of the queries. I wonder if having an corresponding table to Table 1 that shoes an purely OA view would help further evidence the model.

I would suggest adding a figure that shows the various layers of the model for understandability.

# Minor comments
* p. 2 “the LD” -> LD
* p. 2 - it be good to provide a pointer to definitions of the deep web. e.g. Barbosa, Luciano, and Juliana Freire. "Searching for Hidden-Web Databases." WebDB. 2005.
* section 4.1 “OA model wants every piece” -> “The OA model”
* You should be clear about the notation your using for named graphs. In this, nquads.
* p.8 “If necessry,” - typo

[1] DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ce Zhang.Ph.D. Dissertation, University of Wisconsin-Madison, 2015.

Review #3
Anonymous submitted on 08/Jan/2016
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.


This paper starts with a brief introduction to NIF and the Open Annotation Data Model (OA), followed my a discussion of where each of them falls short. Then an approach based on named graphs is proposed, in order to overcome the shortcomings of NIF and the OA model.

In spirit, this paper is therefore closer to a research paper, rather than a survey paper, since it is not concerned with being a comprehensive introductory text or with providing balanced presentation and coverage of the topic. In particular, for a more balanced coverage more in-depth discussiions are needed of LAF/GrAF, UIMA CAS, JSON-LD, and the most recent LAPPS exchange vocabulary ( Also, the introductions to OA and NIF need to be made more detailed.

In order to improve the coverage and depth of the comparison amogst the various annotation models, I would recommend that the authors aim to create a set of requirements towards semantic annotation formats and then to review each existing format in that context. This will enable researchers, PhD students, and practitioners to easily grasp the differences between them. In addition, named graphs and the notation being used in the examples need to be introduced to the reader, since it's aimed as an introductory survey text.

With respect to the proposed named graphs approach itself, it would be better presented in a research, not a survey paper, in my view. Also, it needs to be introduced in more depth, using samples from existing datasets annotated already with OA, NIF, JSON-LD, or LAF. A more formal mapping to and from each of those models should also be provided. I think it's also worth discussing the issue of whether one should encode also annotation IDs and offset information (for Regions/annotations), as in LAPPS and some other models. In addition, more complex linguistic annotation examples than those shown in the paper should be included, e.g. relations, syntax trees, dependencies.

In summary, I believe this is not a survey paper meeting the criteria of this journal for such papers. I recommend accordingly that the authors either re-work it completely into a survey paper (that will be a worthwhile contribution to the community) or to re-work it into a more in-depth research paper for a journal. Alternatively, given its current length, it could be submitted as a conference paper to ISWC or ESWC or a similar venue.

Review #4
By Nancy Ide submitted on 20/Jan/2016
Major Revision
Review Comment:

The paper is not broad enough to be called a survey article, as there are many aspects of this topic and many solutions that are not included. It is however a perfectly fine regular article proposing a means to deal with limitations in the OA model (and to some extent, NIF). I found it interesting and thoughtful, and I like the proposal of the two-layer model (although it should be made clearer in the introduction that this is a major focus of the paper).I feel the paper should be published, but suggest major revisions because of some fundamental problems that should be addressed first.

(1) In terms of being useful as an introductory text, there is quite a bit of assumed knowledge here. The general form of statements in OA and RDF should be explained. Also, some terms are introduced and not really defined, especially “aboutness” which appears repeatedly with no explanation of what it means besides that it is “the (broad) semantics of the connection”. It is also not broad enough to be called a survey article, as there are many aspects of this topic and many solutions that are not included. It is however a perfectly fine regular article proposing a means to deal with limitations in the OA model (and to some extent, NIF).

(2) A particular concern is a seeming lack of understanding of some of the proposed schemes for annotation and a resulting lack of awareness of their relationships. For example, in section 2.3, it is claimed that LAF “employs the representation language of XML” and goes on to assume that LAF is not a graph-based model. The distinction between a data model and a serialization is apparently not understood; LAF is a graph-based model which CAN BE serialized in XML, among many others, but it is also isomorphic to other graph-based models such as RDF. Interestingly the authors refer to a paper (Cassidy, 2010) that makes this very point within the same paragraph. Overall, the authors seem to be relatively unaware of a lot of work that is going on in the Computational Linguistics world (as opposed to the bioinformatics/BioNLP world) that embraces RDF-like models, and in particular the increasingly wide-spread use of JSON-LD (another way to serialize RDF) to represent linguistic annotations in projects such as the LAPPS Grid, Cassidy’s Alveo project, and interest in its adoption in major annotation projects such as DKPro and CLARIN.

(3) The point made about the problem of the OA model is an important one, and the preferred approach boils down (More or less) to the ability to provide different named relations (properties) instead of everything being cast as the target or body of an annotation—i.e., the inability of the OA model to handle what the authors call “multifaceted annotation”. It feels a bit like the paper is making more of this difference than is warranted, but at the same time, the fact that many people don’t seem to see this difference means it is probably worth discussion.

(4) There are some pieces of the puzzle that this paper attempts to address that I had hoped to hear more about but didn’t feel that the paper quite reached. The authors say on page 8 that “vocabularies with richer semantics have to be developed”. In the NLP world, this is a problem that several projects are addressing, but we need not only vocabularies, but also a model of which things are the so-called objects and which are the (named) relations. This may seem simple for cases such as the paper’s NP example or the association of a text span with an object in a database, but once one dives deeper there are tricky cases and sometimes, no right or wrong solutions but simply the need to make a consistent choice that everyone can live with. For example, is “part of speech” a property of a word (or token, which of course might not be a word)or is it an object in its own right? Decisions about this seemingly trivial distinction ultimately affect the ways in which the information is processed, so once you make a decision as to which you prefer, your software is wedded to a basic model that might be hard to adapt. Things become even more slippery for “relational” annotations such as coreference and temporal annotation that relate two words/tokens/text spans/entities, in terms of what is reified as an object and what is a named property.

(5) On page 6 the authors say “the type of annotation shown in Figure 6 is called a named entity grounding or normalization, which often means linking named entities in text with corresponding database entries”. This is made clear with the protein example, but could this not be the same for NP? If not why not? If so please make it clear.

(6) The English needs some work, notably, there are many places where articles are dropped (e.g., “OA model”) and some awkward phrasing. Some suggestions:

p.2 : “these two knowledge resources” — referent (in the preceding paragraph) was not clear to me at first
: “The practice of creating links between these two Webs is often referred to as annotation; in which some portion …” replace semi-colon with comma
: “as has been done in the CRAFT corpus of scientific literature, [20,4,49,21], as well as automatically, such as in the CALBC project for the scientific literature” > “as in the CRAFT corpus of scientific literature, [20,4,49,21], as well as automatically, as in the CALBC project for the scientific literature” (or use “for example” Instead of “as in”

p.3 : reference for LAF is “ISO 24612, 2012” not “ISO 2008”. The best paper citation is Ide, N., Suderman, K. (2014). The Linguistic Annotation Framework: A Standard for Annotation Interchange and Merging. Language Resources and Evaluation, 48:3, pp. 395-418.

p. 8 : “relevant resources should be able to associated without limitation” > “relevant resources should be able to be associated without limitation”
: “to implement the step 5” > “to implement step 5”
: “necessry” > “necessary”
: what does the “ex” prefix in “textspan:a-synuclein ex:refers_to uniprot:P37840 ex:anno3 .” mean or refer to? This is the kind of detail that would need explanation for those unfamiliar with RDF etc.

p.9 : “However, we are also free to group relevant statements into same graphs” > do you mean to say “However, we are also free to group relevant statements into common graphs” or “the same graphs”?
: TM annotation — what is this? I do not find it defined earlier in the paper.
: “Even with such a kind of annotation” > “Even with annotations like”

p. 10 : “does not make much sense if open world assumption is applied” > “does not make much sense if the open world assumption is applied”
: “Thanks to the separation, queries over the annotation, particularly those for annotation content become tidier” > insert comma after the word “content”
: “when searching for specific content of annotation” > “when searching for the specific content of an annotation”
: “In this section, we demonstrate it using the annotation example in Figure 5” > what is the “it” referring to?

p. 11 : “To make it compatible with OA model, however, we need a little bit of cost.” > “To make it compatible with OA model, however, we incur some cost”
: “ little bit of modification is required to OA model” > “ some modification of the OA model is required”

p. 11 : “we surveyed existing approaches to annotation representation in semantic web” > “we surveyed existing approaches to annotation representation in the semantic web” Also maybe say “we surveyed some existing approaches”