The Document Components Ontology (DoCO)

Tracking #: 822-2032

Alexandru Constantin
Silvio Peroni
Steve Pettifer
David Shotton1
Fabio Vitali

Responsible editor: 
Oscar Corcho

Submission type: 
Ontology Description
The description of document layers, as well as of the document discourse (e.g. the scientific discourse in scholarly articles) in machine-readable forms is crucial in facilitating semantic publishing and overall comprehension of documents by both users and machines. In this paper we introduce DoCO, the Document Components Ontology, i.e., an OWL 2 DL ontology that provides a general-purpose structured vocabulary of document elements to describe document parts in RDF. In addition to the formal description of the ontology, its utility in practice is showcased through several in-house solutions and other works of the Semantic Publishing community that rely on DoCO to annotate and retrieve document components of scholarly articles.
Full PDF Version: 

Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Almudena Ruiz submitted on 06/Oct/2014
Minor Revision
Review Comment:

The authors describe DoCO, the Documents Components Ontology an ontology for describing the components parts of a document and their rethorical connotations. The paper is well written and clearly describes the ontology, the design process and rationales. The article gives a very good overview of DoCO, what it does, its structure and its strengths. The paper also presents some examples on the adoption of the ontology in different scenarios.

While the methodology to create the ontology is described and interesting, the introduction needs to be revised to indicate clearly the goal and the problem addresed by DoCO. The authors said "...the number of distinct vocabularies adopted by publishers to describe these requirements is quite large, and a need arises to integrate these different languages into a single, unifying framework that may be used for all content, regardless of provenance and scientific context", how DoCO solve this?

From my perspective I think it would have been helpful that the example of the use of DoCO appeared in the main text. I checked the URL corresponding to the example, but that link doesn't work, (Firefox, Chrome on Windows 7, 1,2,6 October 2014).

The following is a list of minor corrections:
- Missed reference (sect 1. p 1 l 22)
- The acronym for RDF are introduced later (sect 2. p 1 but RDF acronym are employed in sect 1 p3).

Review #2
By Francesco Ronzano submitted on 20/Oct/2014
Minor Revision
Review Comment:

This paper introduces the Document Component Ontology (DoCO) defining a formal framework to characterize the structural and rhetorical elements of a document. After discussing relevant related ontologies and document annotation schemas, the authors provide a detailed description of the most improtant DoCO elements useful to mark the components of scientific papers. The focus of this description is put on the way DoCO combines and takes advantage of both the Pattern Ontology and the Document Element Ontology to respectively characterize the structural and rhetorical traits of a document. The authors provide examples of adoption of DoCO in tools (PDFX, Utopia, algorithm to map XML markup to DoCO elements), projects (Biotea, Alghieri's Convivio, SLOR), external ontologies (HuCit, modelling of scholarly documents and math expressions) and RDF datasets (ParlBench).

The paper is clear and well written. It provides an accurate description of the main components of the DoCO ontology.

Minor remarks:
- in Section 3.2, some of the rhetorical classes of the Discourse Element Ontology (DEO) are listed and described (Reference, Matherial, Method, RelatedWork, etc). It would be great to add some detail on how (by which methodology or approach) this set of DEO rhetorical classes has been identified and validated.
- in Section 4, concerning the Adoption and use of DoCO, if you have any data, it could be interesting to add an analysis of the feedback on the adoption of DoCO: are there situations in which DoCO has been extended to support peculiar or domain specific needs when modelling the structure of a document? (for instance, extended to annotate peculiar structural markups or rhetorical classes that are relevant to a domain, but not modelled by DoCO)
- it could be helpful to include a simple example / image of a document annotated by the Document Component Ontology (DoCO) - after Section 3.3

- Page 1, Introduction: "...XML vocabulary of scientific journals to be acceptable for inclusion in PubMed Central [superscripted link?]" - to fix [superscripted link?] with link or footnote.

Review #3
Anonymous submitted on 14/Nov/2014
Minor Revision
Review Comment:

This article presents an ontology for describing documents components, both in terms of their structural and rhetorical behaviour. This work is clearly a relevant contribution to the Semantic Publishing community, it has been already used in some applications, and it is getting the attention of researchers in different communities. The ontology, called DoCO, provides a generic vocabulary that can be used for the annotation of academic documents elements in RDF, facilitating the processing and understanding of the document content both by humans and machines. The exploitation of such semantic data, can have multiple applications and benefit the different parties involved in the process of writing and publishing academic content (scientists, researchers, publishers, etc.), as pointed by the authors. In fact, the potential applications of the data generated with the application of DoCO is what makes it highly compelling. Ranging from an improved navigation of documents, retrieval of particular documents parts, selective text mining, to automatic document validators, comparators, plagiarism detectors, etc. Besides the proposed vocabulary has the potential to become a reference model that can be used to map and interoperate between the different XML vocabularies used by publishers. Thus, overall this is a relevant and high-quality work.

Regarding the presentation of article, the content is clear and well-structured, facilitating the reading. The article provides a good overview of the related work, and points out the gap that DoCO is filling in the state of the art. The figures and tables are clear and relevant, and the description of the ontology itself is good enough. Finally, the discussion regarding the adoption of the ontology is fair.

There are few points that may be improved or discussed in a final version of the manuscript.
Something I miss in the article is a discussion on the design principles used for the development of the ontology. For instance, did you followed some principles in the selection of axiom for the description of resources, or labels, or the identification of properties. Similarly, it would be useful to mention if you followed any particular methodology during the ontology engineering process, at least partially. For instance, if you determined some competency questions that were used later for validating the final ontology, or what was the criteria for choosing the ontologies you reused, e.g., DEO over ORB.
In the description of the structures in DoCO, the table class is defined in terms of containers (identifying a row). However, as mentioned in the previous section a po:Block may be a cell in a table. Why table was not defined using cell elements? In the example DoCO of your own paper, i would appreciate including also an example of a table.
The figure class says that is modelled as a flat element without textual content. The formal definition is in terms of po:Milestone and po:Meta, which are Markers, not Flat elements.
In the definition of captioned boxes, you use dcterms:hasPart for FigureBox and po:contains for tableBox. Was that a typo? In the ontology source both are dcterms:hasPart (as with Bibliography class). In fact, is a bit unclear when you use one or the other property. Of course po:contains is subroperty of dcterms:hasPart, and in footnote 28, you mention that po:contains is particularly used in elements having type po:Structured. But since it is used also for hybrid or rhetorical elements, is not that clear the criteria for using one over the other.

Regarding the applications using or applying DoCO, unfortunately PDFX service was unavailable, so i couldn’t use it and see the actual output of the document organisation analysis. Additionally, i tried Utopia, but the usage of PDFX to analyse the document structure is hidden to the user, and one cannot see the output of such analysis, so again, I couldn’t really see what i was hoping for. Additionally, viewing your article in Utopia did not result in the identification of tables, only figures, so I was not able to see or export the table data as you mention. The prototypical implementation of the algorithm that aims at associating a particular DoCO class to each markup element used in these XML article sources looks promising. I look forward for the release of such application. However, what i think is still missing and i don’t see it in your future plans is a more holistic solution that could process and annotate (semi) automatically (PDF) documents, in order to generate an output as the example annotation you provided for your article. Obviously, doing such annotation would enable all those benefits mentioned, but doing it manually is definitely not a viable practical approach. Are you having in your plans such application? Finally I would be glad to read your thoughts on how do you think such annotations may be complemented with the ones produced with related ontologies, for instance, for the description of the scientific discourse or for the description of bibliography and citations.

Two additional minor comments:
Section 3 first paragraph, “such at” -> “such as”
Section 3.3 figure class paragraph: In DoCO, it is disjoint with the previous classes is modelled… “and”?