Linked Legal Data: A SKOS Vocabulary for the Code of Federal Regulations

Paper Title: 
Linked Legal Data: A SKOS Vocabulary for the Code of Federal Regulations
Authors: 
Núria Casellas
Abstract: 
This paper describes the application of Semantic Web and Linked Data techniques and principles to regulatory information for the development of a SKOS vocabulary for the Code of Federal Regulations (in particular of Title 21, Food and Drugs). The Code of Federal Regulations is the codification of the general and permanent enacted rules generated by executive departments and agencies of the Federal Government of the United States, a regulatory corpus of large size, varied subject-matter and structural complexity. The CFR SKOS vocabulary is developed using a bottom-up approach for the extraction of terminology from text based on a combination of syntactic analysis and lexico-syntactic pattern matching. Although the preliminary results are promising, several issues (a method for hierarchy cycle control, expert evaluation and control support, named entity reduction, and adjective and prepositional modifier trimming) require improvement and revision before it can be implemented for search and retrieval enhacement of regulatory materials published by the Legal Information Institute. The vocabulary is part of a larger Linked Legal Data project, that aims at using Semantic Web technologies for the representation and management of legal data.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Decision/Status: 
Reject
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-is...

Solicited review by Stefan Dietze:

This manuscript describes a SKOS vocabulary for the Code of Federal Regulations. The paper is very interesting, well-written and highly relevant to the scope of the special issue. However, some general criticism have to be raised and addressed. The paper contains a very thorough analysis of the state of the art, which is worth to read in itself but sometimes even gives the impression of reading a survey paper (which was not intended by the authors I suppose). Also the process of creating the vocabulary is described very exhaustively, but the paper falls short on key issues such as the design rationale, or an actual evaluation and elaboration of the vocabulary itself. In particular with respect to the evaluation, the authors introduce some sections (e.g. 3.3) which assess the generated data and extraction mechanism from a quantitative point of view (i.e., how many concepts etc were extracted), but the authors do not provide an actual evaluation of the accuracy and usefulness of the extracted vocabulary. What was the added value of using the SKOS vocabulary? To what extent did it improve interoperability of the data? Section 4 contains some considerations in this direction, but these seem fairly preliminary. For instance, one wonders, why the authors suggest to related the DrugBank dataset to their vocabulary but do not actually realise this worthwhile idea?

In the motivation part, the authors introduce the idea of an entity-centriy view on different regulations concerning related subject matter. That seems very worthwhile and provides a good context for this work. My suggestion would be to further elaborate this aspect, e.g. by introducing some sort of motivating scenario, which might also help in identifying evaluation criteria and settings.

Minor:
- Formatting is not consistent (eg footnotes, some text expands beyond margins)
- Overly extensive reference list (for a traditional journal submission)
- Spelling of "thesaurus" (sometimes upper-case, sometimes lower)
- All listings should have a caption. Without, reading becomes cumbersome.
- p.2: "futher"
- p.3: "to search and retrieval" => "...retrieve"
- p.5: "Dbpedia" => "DBpedia"
- p.7: "sa", "see", etc introduced but just explained later. Please explain on first introduction.

Solicited review by Marta Sabou:

This paper describes a set of approaches used in order to create a SKOS vocabulary for the Code of Federal Regulations, including: (1) vocabulary reuse; (2) the conversion of existing thesauri and (3) extraction from text. All the approaches have been inspired by previously published work, and the novelty here relates to their use within the legal domain. The paper concludes with some initial results of linking the created vocabulary with the DrugBank linked dataset.

On the positive side, the paper presents an interesting case of using SW technologies in the legal domain and therefore is well suited for this special issues. Additionally, the material is very well documented providing a thorough overview of existing legal vocabularies, taxonomies and ontologies.

A major concern with the paper relates to the fact that the techniques it describes, besides having been adopted from earlier works, have lead to no or suboptimal results, and therefore, the contribution of the paper is not so much in describing a successful approach to the problem of SKOS vocabulary creation but rather in functioning as an "experience report" from which other practitioners faced with the similar problem could learn. As such, the paper's appropriateness for a journal publication is doubtful.

Additionally, there are several issues that weaken the contribution of the paper, primarily related to the lack of clear evaluation for the text-based learning part that would support the conclusions drawn by the authors. Although the text-based learning approach is a major part of the paper, the evaluation is performed only from a SKOS-perspective (in terms or structural features such as orphan concepts, hierarchy cycles etc) but there is no insight given into the quality of the produced vocabulary (i.e., correct/incorrect labels, correct/incorrect relations, coverage of the domain etc) besides the author stating that the output was "defective and uneven". Additionally, the author provides no details about the number of extracted concepts and relations. The author should significantly improve this part of the paper by providing more details as well as a clear evaluation of the resulting vocabulary in terms of ontology learning evaluation metrics.

Chapter 4, referring to the linking approach applies a very simplistic exact match strategy. It is not clear how many links have been established with this approach and table 7 is difficult to interpret given that only the Drug ID's are shown, but not the labels of these concepts.

In terms of paper presentation, it would help adding captions to the various listings included in the paper. Also, section 3 is overly large and could be easily split into 3 individual sections corresponding to the 3 approaches taken by the author.

Solicited review by Rinke Hoekstra:

General remarks:
* The paper and individual sections do not properly introduce what they are about. This makes the paper harder to read than necessary.
* The SKOS vocabulary is not adequately evaluated. Not with respect to the quality of the extraction method, nor with respect to its usefulness for improving access to the CFR.
* The SKOS vocabulary does not take into account the highly specific meaning of legal terms depending on their location in legal texts (e.g. deeming provisions). Also, the authors are far too optimistic about harmonization of terms across different jurisdictions. Of course, this may or may not be very problematic depending on the audience of the vocabulary (legal experts vs. laymen). But even so, you wouldn't want to confuse citizens with articles that use the same terms, but are about different things.

introduction:
* It would help if the author was more precise about the purpose of the SKOS vocabulary. The current introduction is too general, and introduces the use of SW technology and vocabularies in general for the legal domain.

section2/p2:
* Introduction of Linked Data principles is not really necessary for a paper in the SW journal.
* "on one hand" -> "on the one hand", "on the other" -> "on the other hand"
* I don't see why RDF/RDFS/OWL/SPARQL should be contrasted with URI naming & linked data principles. Consider rephrasing.

section 2.1/p2,3:
* The first sentence is too long for me to parse
* 'initially' suggests a temporal relation, but CLIME is much older than LRI Core/LKIF Core. Also, FOLaw is not really a core ontology (rather more foundational/upper... or even an epistemology).
* The overview of legal ontologies is very comprehensive, but does not really serve a purpose. Consider introducing only those ontologies relevant to the development of the SKOS vocabulary. Perhaps simply refer to the author's PhD thesis. The distinction between core, domain and "heavily targeted" may be useful, but only if necessary in positioning the SKOS vocabulary.
* "Currently, the Simple...." -> why the 'currently'?

section 2.2/p4
* "on one hand" -> "on the one hand" (... but there's no 'other hand' in this paragraph!)
* The first sentence of the last paragraph of 2.2 is too long

section 3/p4,5,6
* The SKOS vocabulary promises a lot for making the code of federal regulations more accessible. This makes me wonder whether it lives up to this expectation.
* The footnote 26 is nowhere to be found. Also, the most important feature of Linked Life Data is that it includes UMLS (the unified medical language system), incorporarting MeSH, NCI Thesaurus, MedDRA, and many others.

section 3.2/p6,7
* Consider upgrading this subsection to a full section. It seems to have formed the bulk of your work (or change the title of section 2)

section 3.3/p8,9
* It is interesting to see that the conversion to SKOS (even though SKOS is very limmited in expressiveness) already allows you to debug the existing thesaurus. I feel that the issues raised in this section are not necessarily problems of the translation, but rather of the existing thesaurus. I therefore also think that the SKOS conversion and a 'digital curation' should not be seen as separate steps, but that the conversion itself is one of the methods through which the index can be curated.

section 3.4.2/p11,12
* I expected a more elaborate evaluation that discussed the use of existing technology/grammars/vocabularies/taggers for legal texts (i.e. legal texts differ substantially from the newspaper articles that are typically used to train POS taggers). Could you show some figures on how well the technology peforms on your corpus? How does this automatic extraction compare to a manual effort or some other gold standard? For instance, what part of the relevant parts of the converted thesaurus are also extracted using your method?
* Also, do you really need full parsing to extract the relations? Focusing only on noun and verb phrases should provide you with the bulk of interesting relations & concepts.
* Your extraction method glosses over the fact that legal terms can have a very precise meaning depending on where they occur (e.g. consider deeming provisions). Have you spent any thought on this, or is it irrelevant for your purposes? If so, why?

section 5/p14
* The conclusion summarizes the paper. It would have been nice if this summary appeared more to the front of the paper, allowing the reader to know what to expect.
* The conclusion/evaluation does not give a satisfactory answer to the promised potential of the SKOS vocabulary. Does the SKOS vocabulary perform better in retrieving CFR articles than the existing thesauri? How does the SKOS vocabulary relate to existing other ontologies (introduced in 2.1)... if you don't relate it, perhaps you shouldn't list these ontologies as extensively.

Tags: 

Comments