The LinkedUp Data Catalogue: A Meta-Dataset of Linked Datasets in the Education Domain

Tracking #: 860-2070

Authors: 
Mathieu d’Aquin
Alessandro Adamou
Stefan Dietze
Besnik Fetahu

Responsible editor: 
Tania Tudorache

Submission type: 
Dataset Description
Abstract: 
The LinkedUp Catalogue of Web datasets for education is a meta-dataset dedicated to supporting people and applications in discovering, exploring and using Web data for the purpose of innovative, educational services. It is also an evolving dataset, with most of its content being contributed by automatically extracting relevant information from external descriptions and the included datasets themselves. In this paper, we describe the purpose and content of this dataset, as well as the way it is being created, published and maintained.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruben Verborgh submitted on 20/Nov/2014
Suggestion:
Major Revision
Review Comment:

This article presents a dataset with metadata about datasets in the educational domain. This dataset has been developed in the context of the LinkedUp initiative. The authors describe how the dataset was created, how it has been / can be used, and how the dataset can be updated in the future.

The dataset by itself is interesting, and the quality is good, although important URLs don't dereference. It is useful for dataset discovery, because it lists the main partitions (i.e., used types and properties) of datasets. However, the approach is not specific to educational data, so the restriction to this scope seems artificial. My main issue with the article is that the description is long, vague, and sometimes irrelevant or inaccurate. In general, the information-density is too low. I therefore recommend to either make the description more accurate and to-the-point, or to shorten the article to the minimum of 5 pages.

Below, I will first discuss the three focus points of the journal regarding Linked Dataset Descriptions (quality of the dataset, usefulness, clarity and completeness). Finally, I will detail some issues I found in the article.

(1) Quality of the dataset

Overall, the quality of properties and values in the dataset is high. It uses common ontologies (such as VoID) in the correct way.

The only serious problem I encountered is dereferenceability.
The URLs used to identify datasets do not dereference, for example:
- http://data.linkededucation.org/linkedup/dataset/data-southampton-ac-uk
- http://data.linkededucation.org/linkedup/dataset/ege-university-linked-o...

The URLs which identify dataset partitions are also not deferenceable, for example:
- http://data.linkededucation.org/linkedup/dataset/nobelprizes/cp/04b45caa...
- http://data.linkededucation.org/linkedup/dataset/nobelprizes/cp/cc2430bc...
(which should be Nobel Prize Awards and Categories).
Why can't those dataset partitions have URLs that lead to data about them? For example:
- http://data.nobelprize.org/sparql?query=CONSTRUCT%20WHERE%20%7B%20%3Fx%2...
- http://data.nobelprize.org/sparql?query=CONSTRUCT%20WHERE%20%7B%20%3Fx%2...
The effort to generate such URL is identical (if not less), but the result is much more useful for clients, an in line with the Linked Data principles.

Finally, keywords etc. are attached as human-readable labels instead of machine-interpretable concepts. It would be worthwhile to invest in machine-interpretable keywords.

(2) Usefulness (or potential usefulness)

The main purpose of this data is to find datasets; i.e., it does thus not expose data that was not available before, but rather serves as a guide into existing data. As such, it is useful for dataset discovery by automated (and, through the website, also manual) processes. The authors also hint at use for federation, but one then wonders whether federation-specific approaches (dataset summarization) would not be appropriate. Actually, I'm curious whether the authors have tried such summarization algorithms and whether they give similar or better results than the currently used services.

The scope of this dataset is restricted to datasets explicitly related to education, and the authors claim that “extending it would […] decrease the value of the dataset, making it less appropriate for discovery.” I fail to see why this would be the case, since all of the discussed techniques, and the resulting dataset, are independent of the education domain. In fact, if “aiiso:School” in Query 3 is replaced by, let's say, “example:Business” or “example:Car”, the mechanism would work equally well. Therefore, I disagree with the authors that the datasets' scope would influence its usefulness.

Furthermore, I disagree that use cases such as data discovery and access federation would be “critical in areas such as education, where very disparate and scarce data are available from many different sources.” It is not necessary to limit their applicability to education scenarios; i.e., this need is not education-specific, and thus not a specific advantage of this dataset.

(3) Clarity and completeness of the descriptions

The description is the weak point of this article. It is written from a very LinkedUp-centric point of view, and few attempts are made to relate the topic to the reader, which is important for any article. A major problem is the focus on the “what”, i.e., descriptions of what steps were taken, as opposed to the “why” an “how” of the decisions. In some parts, the description is frustratingly vague (“an external service”) or unnecessarily long. I would recommend a much more reader-oriented approach, that enables readers to actually use the dataset or the techniques behind it. I.e., the description of the dataset should be an invitation with concrete pointers for usage. At the moment, it sticks too much to (incomplete) details that are not helpful to the reader.

As far as the other criteria for Linked Dataset Descriptions are concerned, the following are missing:
- metrics and statistics on external and internal connectivity
- growth (partly)
- examples and critical discussion of typical knowledge modeling patterns used
- known shortcomings of the dataset (partly)

Below are details on issues I found in the article.

Section 1
- Does better cohesion really lead to better reusability, and how are both defined?
- How is “explicit relevance to learning” assessed?
- Why is SPARQL endpoint accessibility a requirement? As your reference [5] indicates, SPARQL endpoints are not the most reliable sources of data. Why wouldn't a data dump be sufficient (and why are they not linked in the dataset?) The “Web standards” argument doesn't cut it here, because data dumps and Linked Data documents are (plain) Web standards as well. Given [5], I would also strongly doubt that it really “facilitates the building of applications that draw from several of these datasets.”

Section 2
- “The primary aim […] was to support participants […]” => to what extent did you succeed?
- Reference [5] is currently the demo article of a corresponding main track article; in this context, the main track article (“SPARQL Web-Querying Infrastructure: Ready for Action?”) seems more relevant.
- How exactly is the small graph built?
- When probing with the exact same query every time, how do you know the result is not cached? The endpoint might be down for useful queries.
- How did you choose Query 2 and why is it a good match?
- The references to “external services” are rather frustrating. The fact that they are external is irrelevant; if you mention them, readers need to know what they do, how, and why.
- The links and mappings are insufficiently described. Furthermore, the claim that they enable federated queries is not grounded. This claim is then weakened by the (correct) statement that endpoint stability is an issue (and it remains unknown, unfortunately, to what extent the “hope” is justified).
- Query 3 is supposed to show how something is simple, but the query itself is actually quite complex. Are users supposed to come up with this themselves?
- What are the Prod and KIS datasets?

Section 3
- Please quantify “regular basis”.
- Figure 4 has little usefulness besides detailing VoID. Perhaps 3 and 4 are better represented as an example listing?
- The part about mapping is extremely vague. “Existing link/mapping” => Why? How? What? Where?
- Suggesting new mappings => How?
- Manually creating mappings => How? Where?
- Please detail the references to external services, and justify their use.
- How will you make the infrastructure easier to maintain?

Section 4
- As mentioned above, I don't agree with anything in the approach being specific to education, except the choice of datasets.
- Also, why would an extension decrease the value?
- How would you measure the quality of the summarized data?

Miscellaneous
Typo: “a rational for” => “a rationale for”
Typo: “the summary graph are” => “the summary graphs are”
Typo: “it as a limited” => “it has a limited”
Spelling: Various references have capitalization problems, including words such as TEL, SPARQL (x2), and URI.

Once the above issues are addressed, or if the paper is shortened to circumvent them, I think it can be a worthy addition to this special issue.

Review #2
By Guillermo Vega Gorgojo submitted on 11/Dec/2014
Suggestion:
Major Revision
Review Comment:

This paper presents the LinkedUp Data Catalogue, a meta-dataset of educational datasets. LinkedUp is automatically created and can play a key role as a hub of datasets in the educational domain.

With respect to the contents of LinkedUp, Table 1 presents some aggregated figures, while a rough overview of the datasets is given in the last paragraph of Section 1. However, I was expecting more details and presented in a more systematically way. In order to do this, the authors should prepare another table with the main content types (course information, educational resources by type, etc.) and then provide a description using examples when needed. The authors should also include the number of triples, datasets and mappings for each content type (the employed vocabularies can also be helpful here). In summary, the authors have to make more effort to explain the contents of LinkedUp.

I have another concern with the mappings: since this is a dataset paper, the details of the actual method for generating the mappings are not so relevant (although a reference is of course welcome). However, it is important to better communicate the motivation of the mappings in LinkedUp, an overview of the mappings, and, please, real examples. In fact, this part should be better connected with the federated query example in section 2.1, i.e. query 3 should be explained in relation with the mappings. In addition, the mappings are a valuable asset for the educational domain; can you offer these mappings in LinkedUp in an easy way? (This can be quite useful for promoting the development of an educational Web of Data).

Concerning the possible uses of LinkedUp, discoverability is OK, but not very exciting. I challenge the authors to present/foresee other possible uses of LinkedUp (the federation has a lot of potential, especially if connected with real problems from the educational domain).

About the structure, I find awkward to present the usage before the creation of the dataset. I suggest that the authors switch sections 2 and 3 and make the necessary adjustments (taking into account the previous concerns in this process).

Finally, I have some minor comments and typos that should also be tackled:
- The last phrase of the abstract is quite generic; please elaborate which is the purpose, content, how is created…
- “[…] the LinkedUp Project and is being used for data discovery by developers in the education sector other than participants to the competitions” -> any evidence of this?
- “[…] the LinkedUp Project has carried out a number of transformations of existing, non- RDF-based datasets into RDF and linked data, so that they can be provided through a SPARQL endpoint and included in the catalogue.” -> which transformations?
- “obviously cached” -> why is this obvious? A more thorough analysis is recommended
- Reference all the queries in the text, e.g. query 2 instead of the query below
- “The summary graph are”
- “The additional information … are”
- “As such, it as a limited purpose”

Review #3
By Andreas Harth submitted on 11/Dec/2014
Suggestion:
Major Revision
Review Comment:

The paper describes the LinkedUp Catalogue of ~50 datasets in the educational domain.
The dataset descriptions appear to be managed in datahub.io, with the "catalogue" consisting of an auto-generated HTML site which lists the datasets.

The authors use the term "metadataset" to refer to the catalogue data.
I'm not sure that the distinction between data and metadata is a fruitful one - some people might say RDF is metadata (and actually, there is the "Open Courseware Consortium metadata in RDF" entry in the catalogue), so your catalogue data would be metadata about metadata.
Personally I would just avoid the "meta" prefix, even if it might sound nifty.

My view of Linked Data are derferencable URIs, and not necessarily datasets accessible via SPARQL.
So I'd rather have derferencable URIs than a SPARQL endpoint, also given the brittleness of SPARQL endpoints (which the authors also mention: "We however hope that with the increased development of SPARQL-related technologies, these issues will eventually disappear and make query federation a realistic use case for the LinkedUp Catalogue.").
The authors seem to prefer the reverse (2 - it had to be accessible at least through a SPARQL endpoint.)
I have to concede that the four principles are open to interpretation when it comes to SPARQL.

As a side note: Doug Engelbart had the concept of a dataset catalogue in his original hypertext proposal (http://www.w3.org/Architecture/NOTE-ioh-arch).
Berners-Lee's hypertext system got rid of the catalogue, and instead introduced hyperlinks for resource discovery (incidentally, linking is #4 of the Linked Data principles).
So, a catalogue in the web spirit is more like Google - data publishers interlink their datasets, and search engines collect the data and offer query functionality over the data (which the current system does not provide - is it possible to write a query that performs keyword search across all datasets?).

In sum, the write-up is ok (modulo the comments below), but does not offer new insights.
My recommendation is to shorten the paper considerably.

(1) Quality of the dataset

It is nice to have vocabulary mappings done as part of the catalogue activity.
Instance mappings (via owl:sameAs) would also be useful.

I'm not sure why DBLP is in there.
FWIW, the original DBLP site seems to provide RDF now, so it might be a good addition to the catalogue.

(2) Usefulness (or potential usefulness) of the dataset

The dataset descriptions are in datahub.io already, so the only value-add are the vocabulary mappings.

(3) Clarity and completeness of the descriptions

The description could be shorter, the SPARQL queries do not add much, and the use of VoID (Figures 3/4) is standard.

"As a project, LinkedUp had for objective to encourage the use of Web Data standards." -> rewrite sentence ("for objective")