Review Comment:
This article presents an ontology for describing documents components, both in terms of their structural and rhetorical behaviour. This work is clearly a relevant contribution to the Semantic Publishing community, it has been already used in some applications, and it is getting the attention of researchers in different communities. The ontology, called DoCO, provides a generic vocabulary that can be used for the annotation of academic documents elements in RDF, facilitating the processing and understanding of the document content both by humans and machines. The exploitation of such semantic data, can have multiple applications and benefit the different parties involved in the process of writing and publishing academic content (scientists, researchers, publishers, etc.), as pointed by the authors. In fact, the potential applications of the data generated with the application of DoCO is what makes it highly compelling. Ranging from an improved navigation of documents, retrieval of particular documents parts, selective text mining, to automatic document validators, comparators, plagiarism detectors, etc. Besides the proposed vocabulary has the potential to become a reference model that can be used to map and interoperate between the different XML vocabularies used by publishers. Thus, overall this is a relevant and high-quality work.
Regarding the presentation of article, the content is clear and well-structured, facilitating the reading. The article provides a good overview of the related work, and points out the gap that DoCO is filling in the state of the art. The figures and tables are clear and relevant, and the description of the ontology itself is good enough. Finally, the discussion regarding the adoption of the ontology is fair.
There are few points that may be improved or discussed in a final version of the manuscript.
Something I miss in the article is a discussion on the design principles used for the development of the ontology. For instance, did you followed some principles in the selection of axiom for the description of resources, or labels, or the identification of properties. Similarly, it would be useful to mention if you followed any particular methodology during the ontology engineering process, at least partially. For instance, if you determined some competency questions that were used later for validating the final ontology, or what was the criteria for choosing the ontologies you reused, e.g., DEO over ORB.
In the description of the structures in DoCO, the table class is defined in terms of containers (identifying a row). However, as mentioned in the previous section a po:Block may be a cell in a table. Why table was not defined using cell elements? In the example DoCO of your own paper, i would appreciate including also an example of a table.
The figure class says that is modelled as a flat element without textual content. The formal definition is in terms of po:Milestone and po:Meta, which are Markers, not Flat elements.
In the definition of captioned boxes, you use dcterms:hasPart for FigureBox and po:contains for tableBox. Was that a typo? In the ontology source both are dcterms:hasPart (as with Bibliography class). In fact, is a bit unclear when you use one or the other property. Of course po:contains is subroperty of dcterms:hasPart, and in footnote 28, you mention that po:contains is particularly used in elements having type po:Structured. But since it is used also for hybrid or rhetorical elements, is not that clear the criteria for using one over the other.
Regarding the applications using or applying DoCO, unfortunately PDFX service was unavailable, so i couldn’t use it and see the actual output of the document organisation analysis. Additionally, i tried Utopia, but the usage of PDFX to analyse the document structure is hidden to the user, and one cannot see the output of such analysis, so again, I couldn’t really see what i was hoping for. Additionally, viewing your article in Utopia did not result in the identification of tables, only figures, so I was not able to see or export the table data as you mention. The prototypical implementation of the algorithm that aims at associating a particular DoCO class to each markup element used in these XML article sources looks promising. I look forward for the release of such application. However, what i think is still missing and i don’t see it in your future plans is a more holistic solution that could process and annotate (semi) automatically (PDF) documents, in order to generate an output as the example annotation you provided for your article. Obviously, doing such annotation would enable all those benefits mentioned, but doing it manually is definitely not a viable practical approach. Are you having in your plans such application? Finally I would be glad to read your thoughts on how do you think such annotations may be complemented with the ones produced with related ontologies, for instance, for the description of the scientific discourse or for the description of bibliography and citations.
Two additional minor comments:
Section 3 first paragraph, “such at” -> “such as”
Section 3.3 figure class paragraph: In DoCO, it is disjoint with the previous classes is modelled… “and”?
|