LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain

Tracking #: 3108-4322

Authors: 
Vasile Pais
Maria Mitrofan
Carol Luca Gasan
Alexandru Ianov
Corvin Ghiță
Vlad Silviu Coneschi
Andrei Onuț

Responsible editor: 
Harald Sack

Submission type: 
Dataset Description
Abstract: 
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Furthermore, GeoNames identifiers are provided for location entities, when linking was possible. The resource is available in multiple formats, including span-based, token-based and RDF. The Linked Open Data version, in RDF-Turtle format, is available for both download and interrogation using a SPARQL endpoint.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Apr/2022
Suggestion:
Minor Revision
Review Comment:

This new version of the paper addresses my main remarks about ontology description and RDF mapping of the extracted entities. In order to meet the requirements for a "Data Description" paper I suggest the authors to group and to give more evidence to the information about quality and stability of the dataset, as well as the modality to access and to reuse it.

Review #2
Anonymous submitted on 22/Apr/2022
Suggestion:
Major Revision
Review Comment:

Thank you for revising and resubmitting your manuscript. The revised version addresses most of my comments but there are still various aspects that need to be taken into account still.

My main comments regarding the revised version relate to aspects of presentation.

1. Introduction: Please shorten the introduction by clearly stating what this paper presents. All remarks and projects that belong to the category of related work should be moved into the Related Work section (especially page 1, right column, lines 27 ff. until page 2, left column, line 9).

4. Corpus description: like many other parts of the paper, Sections 4.1 and 4.2 are very verbose and detailed. The level of detail is partially way too fine-grained and can be trimmed down accordingly (also see my last remark down below).

Figure 1 and Figure 2 seem to be screenshots. Please recreate these in LaTeX to improve the quality and use both columns to present the resulting figures (the current versions are simply too small).

Section 4.2: While this subsection is too long and too detailed (see below), the presentation needs to be improved. As this subsection essentially presents the different vocabularies used in the annotation, my suggestion would be to use a description list, to use one vocabulary per \item and to include one paragraph per vocabulary to (a) present this information in a more structured way to the reader and (b) to keep the amount of detail on a reasonable level.

Figure 3: please include this graphics not as a bitmap but as vectorised PDF so that the quality is improved.

Appendix A: In its current form, this appendix is not helpful. If you want to keep it, then the appendix needs one or two paragraphs of explanation. You should also consider including comments in the actual code.

Furthermore, please note that a "Data Description" paper at SWJ should be "a concise description of a Linked Dataset". While most of the aspects of your dataset, as specified in what a "Data Description" paper is, are properly addressed in the manuscript (title, repository, use cases etc.), most of the paper is simply too verbose and too detailed. Many of the very detailed aspects can simply be removed from the paper. To provide rough guidance, all in all, from the revised version of the paper, at least two to three pages can be safely removed without compromising the key informational aspects of your work, i.e., your core research results. I'm suggesting this especially since the paper is a rather straightforward data set/corpus paper, i.e., nearly all of the readers of this article will be familiar with the main approaches followed in this paper. The suggestion is to remove all the unnecessary detail from the paper that does not directly relate to your core research results, for example, the short history of NER at the beginning can be trimmed, the explanation what geonames is can be trimmed, the information which columns were added to which files to accomplish certain annotation layers can be trimmed, so can many other aspects that are too detailed or too operational or too technical.

There appears to be an over-use of quotation marks in the paper. Whenever a term is already marked in itself (for example, when a prefix is used that is separated from the rest with a colon), then quotation marks do not really need to be used.

There are various overfull boxes that need to be fixed.

Finally, the English needs to be substantially checked, ideally by a native speaker.

Review #3
Anonymous submitted on 14/Jun/2022
Suggestion:
Accept
Review Comment:

The authors have addressed the concerns that were raised in the last review in the updated version of their paper. The addition of the federated query and the example has improved the comprehensibility of the paper. The SPARQL endpoint also worked properly this time.

However, the readability of the paper could be further improved by adopting the following minor suggestions:
1. By explicitly mentioning the contributions of the paper in bullet points at the end of the introduction.
2. Even though a short description of the key concepts and relationships mentioned in Figure 3 is provided in Section 4.2, the explanation of the meaning of the concepts and the relations are missing. It would be nice to have a brief description of each of the concepts and relations separately in form of a table.
3. It would be nice to somehow fit the sentences in proper columns unlike Line no L28, and R10 in Page 5.