World War 1 as Linked Open Data

Tracking #: 458-1635

Juha Törnroos
Eetu Mäkelä
Thea Lindquist
Eero Hyvonen

Responsible editor: 
Jens Lehmann

Submission type: 
Dataset Description
The WW1LOD dataset is primarily a reference dataset meant to bind together collections dealing with the First World War. For this purpose, the dataset gathers events, places and agents related to the war from various authoritative sources. These are then made available for indexing and other use through a variety of interfaces and APIs. Additional information on the entities is also collected, in order to be able to answer more complex questions relating to them. The approach is being evaluated using a concrete WW1 online collection.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Francois Scharffe submitted on 03/May/2013
Minor Revision
Review Comment:

The paper presents a dataset focus on world war 1 events in Belgium. The paper is well presented. The dataset is well structured using CIDOC CRM. The dataset is relatively small with about three thousand instances.
DBPedia links to events were found automatically but not verified. Verifying 100 sameAs links is a matter of a few hours (I did it recently). This should be done.
Genonames and DBPedia sameAs links are mentioned but a query to the SPARQL endpoint only returns sameAs links between events inside the dataset.
As the dataset is focused on Belgium the title could be more precise like "World War 1 in Belgium as linked Open Data"

"This seemd to fit well"
"ffor the Battle of"
missing ref:
"1914-1918 [? ],"

Review #2
By Aidan Hogan submitted on 13/May/2013
Major Revision
Review Comment:

This paper discusses the authors' efforts to seed structured data about WW1 as Linked Data. Given the wealth of diverse historical information about WW1 locked away in various textual resources, the authors propose Linked Data as a means of collating together relevant information into machine-readable formats. To organise this collection, the authors use CIDOC-CRM (an existing standard for integrating historical data) as inspiration, whereby events play a central role in the modelling and where diverse information can then be related to these events, allowing the integrated data to be "clustered" in effect. The authors propose a Linked Data model with a similar event-centric aim and instantiate it with 261 wartime events associated with various places and times. These events form the core around which other data can then be linked by various parties. The authors go into detail on one such facet of WW1, creating a dataset linked to the higher-level events: atrocities committed by the German army in Belgium in 1914. These data are then exported as RDF, with search and some faceted browsing features enabled, an RDF dump and a SPARQL endpoint.

What I most like about the paper is that it gives interesting insights into the difficulty of structuring and interlinking cultural and heritage data: for example, the types of sources that are available, expectations of historians, difficulties in linking places since some no longer exist, dealing with uncertain time-frames, etc. I also like the premise of building a high-level description of WW1 events as the core basis for a (possibly decentralised) Linked Data knowledge-base to "grow" from. Finally, I can appreciate that this is not just "some data" for the authors, who have given time and thought into how best to represent WW1 knowledge as Linked Data.

My concerns for the paper are then primarily on the technical side and also with respect to the preliminary nature of the work.

From a technical perspective, I have four main concerns:

1) Unless I'm mistaken, the URIs used by the authors in the dataset are not dereferenceable for RDF content. I'm taking for example:

I cannot seem to be able to dereference RDF content from this URI using typical accept-headers. This lack of suitably dereferenceable URIs means that there is no Linked Dataset here (although a dump and SPARQL endpoint is indeed offered, the core point of Linked Data is precisely to move away from dumps of RDF).

2) Similarly, looking up the vocabulary namespace:

I couldn't even get a reasonable HTML description of the terms.

3) I had difficulty accessing the SPARQL endpoint from:

There is indeed a Fuseki installation there, but no HTML interface which to play around with. I got one of the queries to work using an external API, but again, a Web-based SPARQL form would help usability.

4) As a dataset description, I would welcome more statistics about the data: how many triples in total, (aside from the core model) how many entities in total, how many classes and properties (internal/external), how many links to external datasets and which datasets, what external vocabulary terms are re-used, etc.

My final concern is that the work is still preliminary. For example, in the dump, I count approximately 20,000 triples. The authors indeed acknowledge that there is more work to do, and although I value the direction of the work, I am admittedly doubtful as to whether or not there is enough concrete contribution and impact *yet* to warrant a journal publication describing the dataset. I'm sitting on the fence and I'll leave this to the other reviewers and editors to also judge.

As a minimum requirement for acceptance, however, I think the four technical points must be addressed as major revisions.

* "The origins of the dataset are in user needs research" ... rephrase or hyphenate?
* Fig 1. "type of event" edge points the wrong way?
* Some references did not compile (Principal Events, 1914-1918)
* Better to use an en-dash for intervals ("1914--1918" in TeX)
* "reasearch" / "ffor" / ... spell check.

Review #3
By Michael Martin submitted on 29/May/2013
Review Comment:

*** Summary
The authors describe in the paper a dataset dealing with events, locations and agents from the first World War which is announced as a reference dataset.
This dataset was published using various interfaces and applications such as a SPARQL endpoint.

In the following i will give a detailed review among published dimensions from the call:

##################### Usefulness (or potential usefulness) of the dataset,
The paper is described as a reference dataset. A specific use case is not given, but readers can imagine that there would be interesting queries possible.
At least if a few more links to other datasets are included (which is as well noted in the section 5).
I think historians would be happy to have this and similar datasets. I heard the first time about datasets adressing historical war events.
Maybe it would be nice if such a dataset can act as a starting point in this domain.
The given examples show that the dataset can be used to extract information about agents, locations and events from the first World War.

##################### Clarity and completeness of the descriptions.
Overall the paper is well written. But the paper must be enhanced on a few points. There is still room for improvement on the 6th page.
As described in the next section, the dataset itself its concepts and the description about the process undertaken to create the dataset must be enhanced.

The criterion "completeness of the dataset" can be voted as rather uncomplete, or better: as a reader iam not sure about the completeness.
It would be very helpfull to know for instance how much events happend during the WW1 (approx.) and a comparision to those you included in the dataset.
I only get the information that 690 events are addressed in the datasets.

##################### Quality of the Dataset

**** Name, URL, versioning, licensing, availability, source for the data ****
I had a look if there is a landing page / project page announced in the paper but was not able to find one. In my opinion this is mandatory to give further descriptions and
maybe linking the interfaces, describing the maintainers and maintenance at all. As well the examples can be listed there.

The name of the publication is "World War 1 as Linked Open Data".
Having a look into the dataset (the linked dump file) give me no information about that.
I was not able to find a resource of type owl:Ontology or similar where such an information would be expected (label, versioning, licensing, authors, maintainers etc.)

As described, the dataset is published under CC-BY-SA 2.0 which is almost fine. I only wondered why the authors did not published the dataset using the current/latest version of the license.
Having a look on gives me the hint that there is a newer version of the license available.
This is not a problem to me, its only a hint to think about updating the licensing rule.

All given URLs in the Data Access section (section 3) were returning results fastly. The content seems to be correct.
After i downloaded the dump of the dataset i tested a few properties used in the dataset if they are dereferencable.
So i tried for instance:
and the results were in both cases a 404. This should be changed in order to publish the dataset properly.

**** Purpose of the Linked Dataset, e.g. demonstrated by relevant queries or inferences over it ****
The description contain a section about example queries which are well described. I tested them using the SPARQL endpoint and both returns the desired results.

**** Applications using the dataset and other metrics of use ****
This dataset is used as a reference dataset. I could not find any application that uses / reuses the dataset at the moment.
Maybe this should be described / discussed in the paper. A few use cases would show the usefullness / impact of the dataset.

**** Creation, maintenance and update mechanisms as well as policies to ensure sustainability and stability ****
The creation of the dataset was sketched and the the contributers were referred. A description of the maintenance and update mechanism are missing and
would be nice to include into the paper.

**** Quality, quantity and purpose of links to other datasets ****
The section 2 contain a description about the instances of the dataset.
An overall count of triples is not included. But using rapper shows that the dataset is rather small:
rapper -i turtle -c ww1lod.ttl
rapper: Parsing returned 20254 triples

The given table 1 is nice to have and show the instances of the respective types. What i not understand is the acronym given in headline of column two.

The authors describe that they created automatically a little over 100 owl:sameAs links to DBpedia. In fact that this is not very much and can be improved for sure, it would be interesting
how these automated links where created. There are much tools available such as Silk or Limes.

**** Domain modeling and use of established vocabularies ****
Within the description (paper) a core data model is given illustrating the vocabulary on an abstract level. I would prefer a illustration or at least a description
that denotes not only the abstract concepts but as well the namespaces where they came from. This would give readers the impression of what is published from your
side and what was re-used.
For instance the concept agent can maybe be re-used from the foaf vocabulary but used is which is not de-referenceable.
The concept place can be re-used from linkedgeodata / dbpedia / spatialHierarchy etc.
Longitude and latitude are taken from the WGS84 vocabulary which is a perfect choice but it is not described in the paper.

**** Examples and critical discussion of typical knowledge modeling patterns used ****
Examples are given in section 4 which i was able to use for testing. I tested them using the public sparql endpoint given in section 3.
One smaller issue was to copy them from the paper. I had to refine them in order to get them working (but could be a problem of my pdf reader).
Maybe it could be an advancement to add here links (purl / tiny urls) to the project page and the paper to get these examples running more easy.

A few discussions (but not critical ones) are included such as modeling places/locations and their temporal relations.
The authors discussed that this is an difficult topic, especially in the domain of war events.
It would be nic to add here more description how it was solved. For instance the dbpedia ontology contain a few concepts to encode locations and vocabularies
such as the spatialHierarchy addresses the same topic of how to encode a temporal dimension in a spatial domain:

**** Known shortcomings of the dataset ****
Shortcoming are not explicitly described but as written in section "5 Discussion and Future steps" the creation and enrichment of the dataset is still ongoing.

##################### Minor remarks
Page two section2 first column:
"This need was discovered early on in indexing the primary sources" -> "This need was previously discovered in indexing the primary sources"

Page two section2 second column:
[?] -> citation not well rendered

Page three section2 second column:
reasearch -> research

Page four section4 second column:
Here, the concept selection function ... are key. > there is something missing. So iam not sure what the semantics of this sentence

Page 3 footnote,
Instead of using a dbpedia link i recommend to use a citation. There are many that can be taken from