Tools for building an event-based knowledge graph from legal decisions

Tracking #: 2833-4047

Authors: 
María Navas Loro
Víctor Rodríguez-Doncel

Responsible editor: 
Guest Editors Event-centric Analytics

Submission type: 
Tool/System Report
Abstract: 
This paper describes a toolset to transform a legal decision in English language into a collection of events represented in RDF supported by an ontology. Two different sources for judgments have been used for demonstration: the European Court of Human Rights (ECHR) and the European Court of Justice (ECJ). Text documents, preferably structured, go through a pipeline where they are analyzed, annotated and finally ingested in a triple store that can be queried through an open SPARQL endpoint. A translation service permits transforming time related information from/to different formats. The related ontology is publicly available online, the source code is accesible in an open modality and a web portal demonstrates the toolset. The adoption of standards and the service-oriented architecture favor the interoperability and extensibility of this framework respectively. A set of predefined queries facilities retrieving information from the knowledge graph.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Aug/2021
Suggestion:
Major Revision
Review Comment:

The authors propose a pipeline which, given a set of documents describing legal cases, generates a knowledge graph about the events mentioned in such cases. The different steps of this pipeline - rule-based event extraction, a time ontology, a conversion between XML and the time ontology, and the knowledge graphs are described. While this is an impressive amount of work and a well-thought combination of several approaches (partly already described in the authors' previous work), I see several issues with respect to the motivation, the paper structure and the overuse of technical details. Moreover, the suggested tools are not yet easily ready to use due to missing code documentation.

== Paper ==

1) Motivation: The paper starts with a reference to Wittgenstein about the importance of events which is kind of nice but also very high-level and not very close to your actual goals. After that paragraph, you immediately jump into the definition of event knowledge graphs. There is no proper motivation for why the legal domain plays an important role in knowledge modelling and what are the actual goals.

2) Intuition: Personally, it was not easy for me to follow the idea of the paper mainly because of my missing domain knowledge in the legal domain and the missing examples. Throughout the whole article, you are not giving a single example of a law case. Later, when mentioning the EventsMatter corpus, I kind of get an idea of what it is about, but it is still rather vague (even in Section 3.1, you hide the actual example behind "Decision" placeholders). I would appreciate seeing an extensive example (e.g., as part of the introduction) of what such documents look like and what you actually want to achieve by modelling them as knowledge graphs.

3) Overview: The article introduces a pipeline consisting of several components/tools and input and output data. A clear overview of them and their dependencies is not provided, though. Specifically, the "WhenTheFact" component is not even named before the caption of Table 1. I would suggest extending Fig. 17 and show/describe it even before Section 3 starts. Similarly, Fig. 7 should be shown at the beginning of Section 3. Then, it is more clear how the sub-components (structure extractor, the trained resources) are connected.

4) Use of the knowledge graph: At the end of Section 6 (i.e., too late for providing intuition), you mention several use cases for the extracted knowledge graph, e.g., specific queries (e.g., "Give me cases ... where the driver was a man") . Later in the conclusion, you, however give the impression that such queries are not yet executable given the lack of semantic annotation in the knowledge graph ("Once this is achieved, queries will be able to retrieve for instance the timeline of one actor's involvement in a case"). Clear information about what is possible and what is not yet possible would be great. To do so, you might even add a section that discusses the current possibilities of using your tools vs the current challenges and how to solve them in the future.

5) Technical details: The paper mentions many technical details during the "approach" sections. Examples include Java data types ("HashMap", "ArrayList"), Java classes ("readFrames.java"), file names ("events.ser", "frames.ser"), file types ("a txt file") and libraries ("nltk", "framenet"). This distracts from the actual ideas and research that should be independent of the code. I would like to see more abstraction here. Implementation details should be provided in a different section.

6) Structure Extraction: The structure extraction Section 3.1. has four large figures which do not contribute much to the understanding and the approach behind the structure extraction. The actual method behind that subtask is conveyed only in two brief steps discussed at the end of Section 3.1, which do not actually explain how it works ("detects the structure and divides it into parts" and "looks for the most relevant section"). More detail about how this works is needed.

7) FrameNet training: You write that "we found the most general ones [frames]" "to our task". But what is the task here, and how did you proceed to select the frames?

8) FT3 Ontology: Before describing the developed time ontology, I would appreciate some clear examples and reasons why the existing ontologies are not sufficient for your specific case.

9) Legal knowledge graph: This section should provide information about a knowledge graph that can be extracted with your tools: the actual data used for the creation, the number of generated triples, example queries, etc.

10) Event types: Can you describe in more detail what are the "procedure" and the "circumstance" event types? How come some events such as "marry" (see Fig. 5) are seen in both types? What are you doing with this information; how do you decide which "ft3:hasType" you use in the end?

11) Simple sentences: At the beginning of Section 3.2.1, you mention that you generate "simple sentences", some information regarding specific words ("lodge"), and the inclusion of the "They" term. Without any examples, this procedure is rather unclear to me.

Figures: Several figures need an update:
- Fig. 1 - Fig. 4: no axis labels
- Fig. 2: y axis has too many overlapping numbers
- Fig. 7: What does "WORD" mean? Why the capital letters? I do not understand the "NO" branch from "special case?" to "deppar" and from "has event?" to "deppar".
- Fig. 8: too small`
- Fig. 9: too small

Minor:
- Abstract: "acces[s]ible"
- Section 1: "Different[ from]..." (twice)
- 1: Explain "journalistic event"?
- 1: "corpus temporal annotation work"?
- 2.1.1: "a[] specific"
- 2.1.1: You could remove the part about TIMEX2 if it not used anyway.
- 2.1.1: "literature[ ][13, 14]
- 2.1.1: "and focus[es] on"
- 2.1.1: "MUC[ ][21]"
- 2.1.1: "based [o]n" (twice)
- 2.1.2: The section starts with ontologies about time but then also discusses events. Make that clear at the beginning of the section.
- 2.1.2: "LKIF[ ][32]"
- 2.1.2: "LegalRuleML[]19"
- 2.1.2: "knowledge.[ ]It"
- 2.2: "reviewed 150 events extracted 18 sentences from"?
- Fig. 2: "Number[]"
- 3: "Based on a previous work[]"
- 3.2.1: "adding has generic subject 'they'"?
- 3.2.2: "all [this] information"
- 4: "On the [other] hand"
- 4.2: "[All] these examples are discussed"
- 4.2: "the problematic existing between"?
- 4.2: The link does not break the line.
- 5: "in a different format[]"
- 5: "in ou[r] ontology"
- Table 4: "information contained different types"?
- 6: "The junction ... allow[s]"
- 6: "WhenTheFact process[es]"
- 6: "events f[ro]m a specific year"
- 6: "exploit the [k]nowledge"

== Code and Data ==

Your main contribution of the paper is the provision of services/tools. However, the website mainly focusses on demo use cases and the tools are not properly documented and structured to be ready to use for a broader audience (even though you state "The code ... can be freely adapted").

a) The GitHub repository (https://github.com/mnavasloro/FromTimeToTime/) does not have any documentation in the readme file.
b) The code documentation could be extended.
c) There are no dedicated entry points to the tools (e.g., Main class with CLI input). The class oeg.eventextractor.Main is empty.
d) File paths are personalised (see https://github.com/mnavasloro/FromTimeToTime/blob/main/whenthefact-core/...).
e) It would be good to have human-readable versions of the frames.ser and events.ser file so users can easily update it and use it in the code.
f) Resources are duplicated across projects (e.g., the events.ser).
g) The knowledge graph itself is not available for download.
h) The text about footnote 25 mentions that the zenodo link also contains the documentation, but it does not.

== Website ==

The website looks nice and gives a good impression of the work. But for the same reasons I mentioned about the code, also the website is important for the provision of tools and needs to be updated:
i) The "QUERY" part does not work (Error message: "JSON.parse: unexpected end of data at line 2 column 1 of the JSON data").
j) The GitHub link on the bottom links to the Annotador repository.
k) The Pipeline page (footnote 6) is not linked on the website. In general, I like that one can test the whole pipeline, but I don't think the result should be published in the public knowledge graph.
l) It would be nice to have an introductory paragraph about the whole idea/goal of the website before jumping into the tools (WHENTHEFACT etc).

Review #2
Anonymous submitted on 14/Oct/2021
Suggestion:
Reject
Review Comment:

Despite the consideration made by the authors in the cover letter, this paper must be reviewed as "Tools and Systems Report". The paper is far from a full research paper and does not add relevant scientific contributions to the Semantic Web field; instead, it uses existing tools to create an event-based knowledge graph.

The topic addressed in the paper is very relevant, timely and well- introduced and motivated, inspiring its reading. On the other hand, the related work section is long, does not explain the paper's contributions in relation to the state-of-the-art and is difficult to follow.

The authors mention "first step for building a knowledge graph was to decide the source of the documents, since there are important differences among jurisdictions, even when they share the language". A suggestion to improve readability is to provide an example. I can only guess what the differences are.

The authors also mention "From the analysis performed in the EventsMatter corpus, we can confirm the importance of the sections in identifying which events are relevant and which are not". It would be better if the authors would let the readers 'confirm' by providing in-depth analysis and supporting data. The figures presented do not seem enough to confirm the importance of the sections rather than the frequency of facts.

The authors also mention that the "Structure Extractor is currently able to handle the structure of the ECHR and ECJ documents, but in such a way that a new document type can be easily added". Looking at the code and the lack of documentation/comments, it does not seem trivial to change a single line in the code provided. Paths are hard-coded, filenames are hard-coded and no design pattern was found to handle the extensibility claimed in this paragraph. An example is the method "parseAndTag". In general, section 3 is confusing and terms are mentioned without a proper introduction. For example, the authors say the EventsMatter corpus and only in a subsection it is 'properly' explained (same for the FrameNet). Section 3 requires restructuring. As mentioned before, a running example would help the authors to explain the process.

The presented ontology is also not well-explained. For example, the class "ComposedTemporalExpression" can be represented in other ontologies. Simplification can lead to other issues. Again, examples would allow readers to see the 'importance' of such class.

The remaining sections suffer from similar issues.

Another main concern about the paper is the evaluation of the tool. The authors did not provide any evaluation/experiments—a critical point for a tool.

As previously mentioned, the code is available but not easy to run, lacks documentation and coding standards are non-existent (https://www.oracle.com/java/technologies/javase/codeconventions-contents...).

Review #3
By Simon Steyskal submitted on 09/Nov/2021
Suggestion:
Major Revision
Review Comment:

## Abstract
---

```
[p.1, left, 15-16]: "legal decision" == "judgements" ?
[p.1, left, 17]: "preferably structured" as in?
[p.1, left, 20]: "is accesible in an open modality" -> "is available as open source" or s/accesible/accessible/
```

## Introduction
---

```
[p.1, right, 41]: how exactly is the event-centric approach different from your event-based one?
[p.1, right, 51]: isn't [2] addressing event processing?

[p.2, left, 19]: what are "nationals"? do you mean national courts?
[p.2, left, 22]: here Knowledge Graph is written in uppercase while e.g. in [p.2, right, 5] it's lowercase.. be consistent!
[p.2, left, 31]: an event's relevance?
[p.2, right, 3]: remove "expressly"
[p.2, right, 7]: "familiar with semantic web"
[p.2, right, 20]: what are time-related formats?
[p.2, right, 31-32]: what European Courts? also, you most likely don't extract events from the courts themselves but from documents, legislation, ... those courts "create" right?
[p.2, right, 35]: we created for translating between different ... formats. (what formats?)
```

## Related Work
---

```
[p.3, left, 3]: what are those "several tasks" involved?
[p.3, left, 10]: has been tackled in what contexts/domains? any?
[p.3, left, 12]: missing some initial references for those ontologies/schemata
[p.3, left, 13]: "top approach" == "top-down approach" ?
[p.3, left, 14-17]: a bit wishy-washy.. what about OWL-Time (https://www.w3.org/TR/owl-time/)? what "real world realizations" are you talking about here?
[p.3, left, 19]: "focus on identifying predefined temporal patterns"
[p.3, left, 19-24]: but can't you use ontologies (maybe together with SHACL) to "specify subtypes and expected arguments for each kind of event"
[p.3, left, 30-35]: missing references
[p.3, left, 35]: a specific use case
[p.3, left, 36]: ISO TimeML standard
[p.3, left, 37-38]: according to whom?
[p.3, left, 43]: references to what?
[p.3, left, 44]: what are examples of day times larger than a day? (just curious as I didn't find a clear answer in ISO 24617 directly)
[p.3, left, 45]: "the lasting of something" -> rephrase; "repetitive" -> repeating (or interval?)
[p.3, left, 46]: who's "them" that TimeML marks up?
[p.3, left, 49-51]: what relevance do those tags have for the present paper? what do they mean? why mentioning them if you don't elaborate on them further?
[p.3, right, 1-2]: remove "we find"
[p.3, right, 2-3]: "in which was partially based TimeML" -> rephrase
[p.3, right, 4]: corpora such as? add refs
[p.3, right, 5]: no longer used because ..?
[p.3, right, 11-12]: There are also TimeML extensions for specific domains such as the THYME project for the medical domain.
[p.3, right, 16]: what challenges are you talking about?
[p.3, right, 17]: them == ?
[p.3, right, 18]:
[p.3, right, 19]: what literature? add refs
[p.3, right, 20]: focuses; events such as ..?
[p.3, right, 21]: ERE == ?
[p.3, right, 31]: , and annotates
[p.3, right, 33-34]: add reference
[p.3, right, 34]: aimed -> tasked?
[p.3, right, 35]: events of what?
[p.3, right, 39]: year's edition
[p.3, right, 43]: expose -> mentioned

[p.4, left, 4]: remove "in this line"
[p.4, left, 10]: "Inside the wide universe of" -> rephrase or remove
[p.4, left, 12]: "protest-event representation options" -> what?
[p.4, left, 14]: on previous approaches; what approaches? add refs
[p.4, left, 14-16]: according to whom? what makes projects/phd theses so special? I've seen journal papers that had more substance than some phd theses ;) also plural thesis -> theses
[p.4, left, 18-22]: add refs
[p.4, left, 29]: why's ACE a "challenge" ? aren't those guidelines?
[p.4, right, 49]: http://dhlab.fbk.eu/Timeline_events/ redirects to http://dh-server.fbk.eu/Timeline_events/ which doesnt load
```

## Event Extraction
---

```
[p.5, right, 31]: share the same language
[p.5, right, 46]: the remainder

[p.6, Fig. 1]:
.) certain bars are basically not visible at all.. neither printed out nor digital;
.) why does the y-axis start from the top down?
.) x-axis is missing its label
.) events per paragraph per section

[p.6, left, 39-40]: I highly doubt that the EventsMatter corpus is "the >only< available corpus of judgments annotated with events".. a quick google search returned e.g. https://www.coli.uni-saarland.de/conf/linc-04/grover.pdf
[p.6, left, 41]: represents
[p.6, left, 50]: no such paragraph
[p.6, left, 51]: light-blue is basically not visible at all.. neither printed out nor digital
[p.6, right, 1]: remove "form"
[p.6, right, 2]: why at most 6? because there were at most 6 events per paragraph?
[p.6, right, 3]: what is "This is" referring to?
[p.6, right, 7]: paragraphs
[p.6, right, 10]: what "Chamber"?
[p.6, right, 17]: s/appreciated/seen/
[p.6, right, 19]: this section in more detail
[p.6, right, 48]: "belies it" -> what? do you mean https://www.merriam-webster.com/dictionary/belie? how does this fit in this context?
[p.6, right, 49-51]: so FINAL DECISION is not uniform then? ;)

[p.7, Fig. 2]:
.) y-axis is not readable
.) light-blue is not readable
.) again why does the y-axis start from the top?
.) caption: s/Numbert/Number/

[p.7, Fig. 3-4]:
.) light-blue is not readable

[p.7, right, 31]: what structure? the 5 sections shown in the previous figures?
[p.7, right, 34]: what? what types? what parents?
[p.7, right, 35-36]: most relevant in what context?
[p.7, right, 42]: what kind of document types?
[p.7, right, 51]: "semantic and syntactic considerations" -> what does that mean? what are examples of "syntactic considerations" you used?

[p.8, left, 18]: "with it" -> with what?; "adding has generic subject" -> what? rephrase!
[p.8, left, 21,40-41]: what's a frame?

[p.8, right, 7]: so for general kinds of texts it's less than 14% passive verbs?
[p.8, right, 8]: that 14%
[p.8, right, 10]: s/couples/pairs/
[p.8, right, 16-17]: adding new sentences and their respective types to the files.
[p.8, right, 18]: what main class?
[p.8, right, 19]: events.ser?!
[p.8, right, 26]: double underlined is barely readable
[p.8, right, 28]: how the frame would be
[p.8, right, 31-32]: what array?
[p.8, right, 33]: In passRels
[p.8, right, 39]: "plays on" ?!
[p.8, right, 40]: what percentages?

[p.9, right, 39]: most relevant == ?
[p.9, right, 41]: the frames == the most relevant frames?
[p.9, right, 43]: what Python script?
[p.9, right, 50]: lexical units?!

[p.10, left, 21-29]: the whole paragraph needs to be rephrased;
[p.10, left, 23]: it is just needed -> rephrase
[p.10, left, 26]: and it is this file
[p.10, left, 27]: remove "latter"
[p.10, left, 29]: what's the "pos"?
[p.10, left, 34]: s/pipeline/workflow/
[p.10, left, 48]: what's "application lodgement" and why is it a special case?
[p.10, right, 27]: similarily to the events
[p.10, right, 49]: "annotated xml and as a visual HTML" -> what's a visual HTML and how is it different from a "non-visual" HTML? why is xml lowercase while HTML is uppercase?
```

## FT3 Ontology
---

```
[p.11, left, 46]: s/double-folded/twofold

[p.12, left, 26-28]: what implementations?
[p.12, left, 28-29]: abstract classes? so you are talking about classes in your implementation and not ones of the ontology?
[p.12, left, 30]: such as for the class temporal..
[p.12, left, 32]: remove "as an exemplary"
[p.12, left, 34]: according to whom?
[p.12, left, 44]: there's a whitespace in `ft3:Temporal Expression`
[p.12, left, 47]: s/Time./time./
[p.12, left, 48]: "the Time ontology" -> OWL Time?
[p.12, left, 51]: also add
[p.12, right, 40-44]: I would appreciate some .ttl that go along with your claim
[p.12, right, 50]: s/about/between/

[Fig. 8-9]: Printed out, both figures are barely readable as the labels are super small.. Maybe try spanning them over both columns to make them bigger?

[p.13, left, 12]: s/correference/coreference/ (-> fix throughout the whole paper!)
[p.13, left, 14]: an annotation attached
[p.13, left, 17]: what's a "midpoint" in that context? union, intersection, hybrid, merge, ..?
[p.13, left, 20]: "actual happening" -> rephrase
[p.13, left, 25]: remove "it is"

[p.13, right, 39]: what are "periodic temporal"?
[p.13, right, 40]: what's the "only expression"?

[p.14, left, 5]: specific use of what?
[p.14, left, 13]: s/Another way/
[p.14, left, 27]: what are the two possible results?
[p.14, left, 30-34]: please shorten and rephrase the entire paragraph..
[p.14, left, 34-38]: but there >is< a correct way of interpreting an event.. it just depends on the situation and particularities, the context, and requirements as you state.
[p.14, left, 39]: guarantee sounds a bit too ambitious.. even with the best and most extensive documentation out there, there is no guarantee that people will use your ontology.

[p.14, left, 44]: link overflows the columns
[p.14, left, 45-46]: "mainly consisting about minor coments and with no critical pitfalls" -> what?
[p.14, left, 46]: checked on

```

## FromTimetoTime Converter
---

```
[p.14, right, 1]: on [p.11, left, 45] you introduce FT3 as "fromTimeToTime" -> fix section header
[p.14, right, 3]: "lacks" is not correct here, rephrase!
[p.14, right, 6]: what further tasks?
[p.14, right, 7]: options for what?
[p.14, right, 9-10]: bridge between DL and pure NLP tasks?
[p.14, right, 11]: "lack" is also the wrong word here; also, "this lack" == ?
[p.14, right, 17]: s/in/into/
[p.14, right, 18-40]: try using a \begin{description} environment.. maybe this makes the list a bit more readable
[p.14, right, 44]: what's a "pivot class"? are you talking about the java implementation again?
[p.14, right, 45]: "interlingua" means what in that context?
[p.14, right, 47]: and what's the value of the map?
[p.14, right, 48]: s/each/an/
[p.14, right, 49]: how, where and why are metatypes "assigned"? what are those "metatypes" in the first place?

[Fig. 12]: s/xsd:String/xsd:string/

[p.15, right, 46]: what's the "Document format"?
[p.15, right, 48]: what "new format"?
[p.15, right, 51]: "ontology format" == ? ft3?

[Fig. 14]: In contrast to Fig.12, `ft3:hasID` uses """ (BigString?) here instead of xsd:string.

[p.16, left, 37]: where did you mention that?
[p.16, left, 44]: "we created one individual in out ontology" -> what?

[Fig. 15-16]:
.) I'm pretty sure 3^^xsd:.. is not correct.. -> "3"^^xsd:..
.) is a property?
.) why are the values of time:weeks, time:months xsd:decimal while ft3:repetitionTimes is xsd:nonNegativeInteger?
.) alternative values -> alternative to what? s/values/representations/

[p.17, left, 27]: on the; add link to webpage via footnote
```

## Legal KG
---

```
[p.17, right, 1]: processes
[p.17, right, 2]: which are?
[p.17, right, 7]: "updated to the KG" == ? added to?
[p.17, right, 10]: queried via
[p.17, right, 11]: On this endpoint
[p.17, right, 14]: rephrase the last sentence
[p.17, right, 18]: what security reasons?
[p.17, right, 21]: the way triples are stored
[p.17, right, 24]: URL
[p.17, right, 26]: s/exploit/utilize/
[p.17, right, 27]: knowledge graph vs. Knowledge Graph (e.g. in [p.17, right, 35-36])
[p.17, right, 32]: s/alabi/alibi/
[p.17, right, 37]: for time-related
[p.17, right, 38]: compared to what?
[p.18, left, 1-8]: Something like this doesn't exist already?
```

## Conclusions
---

```
[p.18, left, 17]: rephrase
[p.18, left, 31-32]: rephrase
[p.18, left, 37]: not related to temporal
[p.18, right, 4]: foreign -> not familiar with

```

## References
---

```
[2,4,59]: I was about to remark that the E in ESWC stands for Extended (since 2010 when it changed from European to Extended, cf. https://2022.eswc-conferences.org/history/) however, both the official ESWC twitter account https://twitter.com/eswc_conf as well as e.g. ESWC on Springer https://link.springer.com/conference/esws do call it European.. So I'm not sure anymore :D
```

## fromtimetotime.owl
---

1. what's the purpose of namespace `rdf1`? I guess that's a typo?

```turtle
@prefix rdf: .
@prefix rdf1: .
...
rdf:type owl:ObjectProperty ,
owl:SymmetricProperty ;
rdfs:domain ;
rdfs:range ;
rdf1:type owl:ObjectProperty ;
...
```
2. why not use `ft3:` as a prefix? this would make the whole ontology way less verbose
```turtle
@prefix ft3: .
...
ft3:and rdf:type owl:ObjectProperty , owl:SymmetricProperty ;
rdfs:domain ft3:ComposedTemporalExpression ;
rdfs:range ft3:ComposedTemporalExpression ;
...
```
3. Wrong namespace for `vann` -> the leading : makes Protege to believe the URIs actually belong to the `:` namespace (which can be seen by the automatically generated comments)
```turtle
### https://fromtimetotime.linkeddata.es/ontology/fromtimetotime#vann:prefer...
:vann:preferredNamespacePrefix rdf:type owl:AnnotationProperty . #wrong
vann:preferredNamespacePrefix rdf:type owl:AnnotationProperty . #correct

### https://fromtimetotime.linkeddata.es/ontology/fromtimetotime#vann:prefer...
:vann:preferredNamespaceUri rdf:type owl:AnnotationProperty . #wrong
vann:preferredNamespaceUri rdf:type owl:AnnotationProperty . #correct
```