Review Comment:
The authors describe how they converted the transcripts of the debates of the European Parliament into RDF. They provide details on different aspects of the URI creation, the vocabulary used and how it has been published.
My impression is that this could have been a much more interesting project. So far, the conversion of the data looks pretty straightforward and does not bring novelty related to techniques or procedures that can be applied to data conversion projects. I'm afraid that in general we should stop publishing papers for every dataset converted, unless there is a novel contribution.
Now, the particular case of this dataset presents several problems. First, it is a more or less direct translation of the dataset already available. While it is always helpful to have an RDF version of all the datasets, it is not clear how that can help final consumers; it is important to note that this dataset was converted with users with a background in humanities in mind, who aren't necessarily experts on semantic technologies. My guess is that they could benefit much more form CSV files. Speaking of users and usage, I was expecting a more complete discussion in Section 6 in terms of who uses the data and for what; showing some raw numbers doesn't tell much, I'm afraid.
In terms of modeling (Section 3) there are several things that are incorrect: First, in Section 3.1 the authors mentioned "the parts of the plenary sessions are treated as events and archived documents at the same time. That is, one item is simultaneously as- signed document-like properties, such as textual con- tent, and event-like properties, such as speaker, or properties ambiguous in this respect, e.g. has part. This choice is supported by the nature of the source materials, since verbatim reports are a direct account of reality." This is wrong, since an event, such an a plenary session is disjoint with a document. In this case it would have been better to model two instances (say a foaf:Document) that was generated based on an event (maybe a subclass of lode:Event). The link between both could be a prov-o predicate (wasderivedfrom comes to mind but a more suitable may be available). Mixing both is not only incorrect in terms of modeling but also it could lead to strange results when using a reasoner.
In Section 3.4 the authors mention the use of unclassifiedMetadata as a generic annotation. It would have been much more interesting to model these events (e.g., Applause) and describe them as part of the speech structure. Another idea that would have made this paper way more interesting is trying to model the topics of each speech (using Latent Dirichlet allocation or other technique[2]). Thus, instead of adding regular expressions to SPARQL queries to find themes, final users could simply dereference topic URIs and apply a follow-your-nose approach to discover speeches related to such topics.
In Section 4.1 the authors mentioned that they tried to match http://dbpedia.org/resource/firstname_lastname. They ASKED for that URI and if the result was true, then they had a match. That is a _terrible_ idea in my opinion, since URIs should be opaque [1]: What if we look for John Smith, but dbpedia.org/resource/john_smith is about a different person? or it is a disambiguation page? As described later, the authors found that "a precision [matching] level of 90 % (± a margin of error of 8%)" on a 50 sample. It is not clear why this wasn't done manually: dereferencing 1455 URIs may be boring, but it is perfectly doable in 1 day (assuming a human may need 10 seconds to look at the page to understand what is it about, you need around 4 hours). Another thing that the authors did not explain was why they also chose to include the Polish DBpedia (and why only that one and not others). Finally, it would have been nice a discussion on how to expand and maintain this dataset (do you need to reconvert the whole dataset every time? do you have incremental conversions?
At the implementation level, a simple request using curl -iLH 'Accept: text/turtle' http://purl.org/linkedpolitics returns only HTML, and curl -iLH 'Accept: text/turtle' http://purl.org/linkedpolitics/void returns no content (same happens when asking for RDF/XML). I would expect the site at least to be explorable by RDF agents.
Also, in my opinion, an important part of a submission of this nature is the source code used to convert the data, including notes and other types of documentation used during he process; this not only makes the contribution more transparent for others, but also, allows them to replicate the effort and reuse existing tools. Sadly, I couldn't find any mention to the code or documentation used.
[1] http://www.w3.org/DesignIssues/Axioms.html#opaque
[2] https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
PS: If you feel I did not understood correctly part of your paper or simply have questions regarding this review, please feel free to write to me at alvaro[AT]graves[DOT]cl
|