The debates of the European Parliament as Linked Open Data

Tracking #: 1106-2318

Authors: 
Astrid van Aggelen
Laura Hollink
Max Kemman
Martijn Kleppe
Henri Beunders

Responsible editor: 
Natasha Noy

Submission type: 
Dataset Description
Abstract: 
The European Parliament represents the citizens of the member states of the European Union (EU). The accounts of its meetings and related documents are open data, promoting transparency and accountability, and are used as source data by researchers. This paper presents LinkedEP, a Linked Open Data translation of the verbatim reports of the plenary meetings of the European Parliament. These data are integrated with a database of political affiliations of the Members of Parliament and linked to three other Linked Open Datasets. The resulting data of over 25 million triples are available through a user interface and a SPARQL endpoint, enabling queries about the monthly sessions of the European Parliament, the agenda of the debates, the spoken words and their translations into other EU languages, and information about the speakers such as affiliations to countries, parties and committees. The paper discusses the design and creation of the vocabulary, data and links, as well as known use of the data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Konrad Höffner submitted on 11/Jun/2015
Suggestion:
Major Revision
Review Comment:

The paper describes a Linked Open Data version of the debates of the European Parliament. The significance of the data is well motivated, both for scientific purposes and for EU citizens as voters. Evidence for corresponding third-party uses needs to be expanded, however. While web page visits are a positive indicator for the usefulness of the data, graphs of page visits and number of searches performed by date are unnecessary and can be removed to comply with the page limit. The existing sentences describing the frequency of the page visits should be removed or compressed to one sentence. Instead, a list of specific use cases should be added, ideally accompanied by existing third-party applications. The material is mostly there so it just needs to be extracted from the given examples, reworded (e.g. for Example 2 the use case is "study of financial discussion by social scientists") and properly referenced.

Basic attributes such as name, URL and topic coverage are given, but version date and number, the specific license (it is only described as "open"), data stability and quality analysis, method of creation and maintenance are missing and need to be added.

The benefit of the conversion to Linked Data compared to the source data regarding the provided use cases needs to be demonstrated as well.

Both the original 5 star rating and the Linked Data vocabulary rating are employed, resulting in 5 and 4 stars, respectively.

With 11 (10 1/2) pages the submission is above the 10 page limit. I suggest compacting some of the figures, so that they fit in one column. Some trivial sentences and examples can be left out as well, for example "To verify this hypothesized match, it was embedded as the subject of an ASK query..." , "leading us to adopt, for instance, the term session instead of part-session." and the polish DBpedia URI.

At the same time, other, non trivial parts need to be more explicit, for example "all triples that could be generated ... by a reasoning engine... such as inverse properties" (list all please), "a number of properties are introduced which (should be "that" or "which, " with comma depending on the intended meaning) are redundant" (please give an example here).

All in all, the dataset and its description are highly relevant and well constructed but the missing and incomplete parts noted above lead me to recommend a rating of "major revision".

Further comments:
- Please clarify how you arrived at the error margin of 8%, was it calculated using a 95% confidence interval or a different method?
- The evaluation of the link set only uses 50 instances. Please use a larger sample size (at least 100) to get a more accurate result.

Review #2
By Alvaro Graves submitted on 30/Jun/2015
Suggestion:
Reject
Review Comment:

The authors describe how they converted the transcripts of the debates of the European Parliament into RDF. They provide details on different aspects of the URI creation, the vocabulary used and how it has been published.

My impression is that this could have been a much more interesting project. So far, the conversion of the data looks pretty straightforward and does not bring novelty related to techniques or procedures that can be applied to data conversion projects. I'm afraid that in general we should stop publishing papers for every dataset converted, unless there is a novel contribution.

Now, the particular case of this dataset presents several problems. First, it is a more or less direct translation of the dataset already available. While it is always helpful to have an RDF version of all the datasets, it is not clear how that can help final consumers; it is important to note that this dataset was converted with users with a background in humanities in mind, who aren't necessarily experts on semantic technologies. My guess is that they could benefit much more form CSV files. Speaking of users and usage, I was expecting a more complete discussion in Section 6 in terms of who uses the data and for what; showing some raw numbers doesn't tell much, I'm afraid.

In terms of modeling (Section 3) there are several things that are incorrect: First, in Section 3.1 the authors mentioned "the parts of the plenary sessions are treated as events and archived documents at the same time. That is, one item is simultaneously as- signed document-like properties, such as textual con- tent, and event-like properties, such as speaker, or properties ambiguous in this respect, e.g. has part. This choice is supported by the nature of the source materials, since verbatim reports are a direct account of reality." This is wrong, since an event, such an a plenary session is disjoint with a document. In this case it would have been better to model two instances (say a foaf:Document) that was generated based on an event (maybe a subclass of lode:Event). The link between both could be a prov-o predicate (wasderivedfrom comes to mind but a more suitable may be available). Mixing both is not only incorrect in terms of modeling but also it could lead to strange results when using a reasoner.

In Section 3.4 the authors mention the use of unclassifiedMetadata as a generic annotation. It would have been much more interesting to model these events (e.g., Applause) and describe them as part of the speech structure. Another idea that would have made this paper way more interesting is trying to model the topics of each speech (using Latent Dirichlet allocation or other technique[2]). Thus, instead of adding regular expressions to SPARQL queries to find themes, final users could simply dereference topic URIs and apply a follow-your-nose approach to discover speeches related to such topics.

In Section 4.1 the authors mentioned that they tried to match http://dbpedia.org/resource/firstname_lastname. They ASKED for that URI and if the result was true, then they had a match. That is a _terrible_ idea in my opinion, since URIs should be opaque [1]: What if we look for John Smith, but dbpedia.org/resource/john_smith is about a different person? or it is a disambiguation page? As described later, the authors found that "a precision [matching] level of 90 % (± a margin of error of 8%)" on a 50 sample. It is not clear why this wasn't done manually: dereferencing 1455 URIs may be boring, but it is perfectly doable in 1 day (assuming a human may need 10 seconds to look at the page to understand what is it about, you need around 4 hours). Another thing that the authors did not explain was why they also chose to include the Polish DBpedia (and why only that one and not others). Finally, it would have been nice a discussion on how to expand and maintain this dataset (do you need to reconvert the whole dataset every time? do you have incremental conversions?

At the implementation level, a simple request using curl -iLH 'Accept: text/turtle' http://purl.org/linkedpolitics returns only HTML, and curl -iLH 'Accept: text/turtle' http://purl.org/linkedpolitics/void returns no content (same happens when asking for RDF/XML). I would expect the site at least to be explorable by RDF agents.
Also, in my opinion, an important part of a submission of this nature is the source code used to convert the data, including notes and other types of documentation used during he process; this not only makes the contribution more transparent for others, but also, allows them to replicate the effort and reuse existing tools. Sadly, I couldn't find any mention to the code or documentation used.

[1] http://www.w3.org/DesignIssues/Axioms.html#opaque
[2] https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

PS: If you feel I did not understood correctly part of your paper or simply have questions regarding this review, please feel free to write to me at alvaro[AT]graves[DOT]cl

Review #3
By Adegboyega Ojo submitted on 14/Sep/2015
Suggestion:
Accept
Review Comment:

Summary

The article describes the process employed for creating a Linked Dataset (LinkedEP) based on the verbatim reports of meetings of the European Parliament (EP). The authors discuss in details the importance and richness of contents of the EP proceedings (e.g. the availability of references to online databases of Members of EP). The paper further discuss in sufficient depth the developed vocabulary (LinkedPolitics), the use of existing vocabs - FOAF, DC, VOID, OMV, PROV; in modelling the different hierarchies and contents of the proceedings. Authors also described their approach in simplifying access to available information in the constructed dataset e.g. through additional relations in ordering speeches and agenda items. Furthermore, they discuss their approach to handling multilingual contents, enriching their dataset with external resources (DBpedia for additional information on members of parliament) and exemplar queries for their datasets.

Comments:

The LinkedEP dataset comprising speeches and presentations at the EP constitute a valuable resource for scholars in different domains such as computing, political science, journalism; and ordinary citizens. As the authors rightly pointed out, it is well situated in the growing ecosystem of machine readable legal, regulatory and policy information in the government domain. The authors adequately described the datasets in terms of name, URL, provenance of the data, how it was created, the usage metrics and information on links to external resources as well as known inward links to the dataset. In addition, authors clearly described their vocabularies and modelling choices and how they implicitly support the use of simple queries to obtain information of interest such as the link between politicians and speeches or temporal order of agenda items or speeches.

However, the authors’ argument for quality of their dataset is directly hinged on quality on the underlying data source, since they claim that the dataset is a direct translation. They also fail to provide information on the use of specific patterns or best practice guidelines was discussed by the authors. Neither is information available on the stability and maintenance of the dataset, e.g. how often is LinkedEP dataset updated based on the published proceedings? There is also no information on the shortcomings of the dataset. The authors are encouraged to provide these information for completeness.
Further ideas worth considering from pragmatic angle –
The example queries provided in the article are definitely not the strongest use-cases for the LinkedEP dataset. It may be useful for the authors to develop some scenarios and user stories for the datasets by interacting with current and potential users of the EP proceedings verbatim reports. For instance, citizens may be interested in knowing if their MEPs are actively engaged in discussions on specific policy topics of interest. Considering the usefulness of competency questions in ontology development, scenarios and user stories could ensure that the dataset to be developed are useful and not just “Yet another Linked Data in the LOD Cloud”. Such scenarios may also inform interesting links to external resources such as regulations and laws (http://eur-lex.europa.eu/homepage.html) passed in the parliament and policy topics in the EU (http://europa.eu/pol/index_en.htm). The policy topics in particular appear very important in linking the speeches to concrete EU policy arena.