Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset

Tracking #: 830-2040

Authors: 
Davide Taibi
Stefan Dietze
Mathieu d’Aquin

Responsible editor: 
Claudia d'Amato

Submission type: 
Dataset Description
Abstract: 
The Learning Analytics and Knowledge (LAK) Dataset represents an unprecedented corpus which exposes a near complete collection of bibliographic resources for a specific research discipline, namely the connected areas of Learning Analytics and Educational Data Mining. Covering over five years of scientific literature from the most relevant conferences and journals, the dataset provides Linked Data about bibliographic metadata as well as full text of the paper body. The latter was enabled through special licensing agreements with ACM for publications not already available through open access. The dataset has been designed following established Linked Data pattern, reusing established vocabularies such as the SWRC ontology or BIBO and providing links to established schemas and entity coreferences in related datasets. Given the temporal and topic coverage of the dataset, it facilitates scientometric investigations, for instance, about the evolution of the field over time, or correlations with other disciplines, what is documented through its usage in a wide range of scientific studies and applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Maria Keet submitted on 01/Nov/2014
Suggestion:
Major Revision
Review Comment:

The paper describes the LAK dataset, consisting of LOD of bibliographic data and papers in the field of learning analytics and educational data mining of the past 5 years, taken from various venues and sources. The paper is easily readable and describes the motivation, creation, usage, and uptake clearly. The dataset was created semi-automatically, i.e., with some manual intervention, and given that it is used for various activities and software, may seem to have sufficient quality for its purposes. I have some reservations on this, primarily due to the 'schema' used (see below); also, some links/resources were not working.

Concerning usefulness, the dataset is rather focussed and limited, and its construction is rather narrowly focussed on just this case instead of approaching it in a broader context. To make the efforts put in to create the resource useful also beyond this particular dataset---one may like to create a similar dataset, but then on, say, conceptual data modeling research, or some other research field---more detailed guidelines on how to construct it would have been useful. Moreover, one would want to be able to reuse the annotation model, which is not possible at the moment (at least, not based on the information provided in the paper or the website). How is the mishmash of annotation resources useful (if at all) for anyone wishing to create their own extended dataset of bibliographic and paper data, building upon the authors' efforts?
There is indeed a section on "schema/ontology", but I could not find the actual schema used, other than that the paper gives the impression it is patchwork, mixing a bit of bibo, swrc, foaf etc. The data provided in Table 4 does not induce confidence either: foaf:maker has as range foaf:Agent in the FOAF file, but the table says foaf:Person; bibo:content is deprecated in BIBO (hence, ought not to be used, but is, according to Table 4); swrc:affiliation does have domain and range restrictions in the ontology (swrc:Person and swrc:Organization), yet foaf:Person is used according to Table 4, but no mention is made about an alignment between the two. p3, first column, states there are some "implicit mappings" between those ontologies, but just one example is meagre, and, given the content of Table 4, questionable. And why leave them "implicit"? Section 4 also mentions "e.g. by frequently adding new alignments with emerging vocabularies", but I could not find those alignments; in fact, http://lak.linkededucation.org/ doesn't seem to have the file with the schema. It would be useful to have, and its URI could be included in Table 1.

Another shortcoming is that, despite that various resources have been used, it does not discuss similar endeavors, how the one of the authors differs, and what (if any) could have been reused from that; e.g. the OCLC [1], some other domain going the LOD way [2], or BibBase [3], to name just a few that were easily found by a simple online search, and other data sets (see, e.g., [4], which has, e.g., SwetoDblp [5]); to name but a few. Now the authors just assert they have more, which isn't convincing (that the dataset is the first of its kind, being bibliographic + full text). Note: this criticism doesn't mean I'd expect the authors to add exactly these references; just at least add some related references---even if the authors created the dataset in isolation and from the ground up de novo without looking at other efforts, the dataset does not exist in isolation, and it's evidently not the only way of creating a bibliographic dataset.

[1] http://oclc.org/developer/develop/linked-data.en.html
[2] http://www.semantic-web-journal.net/content/migrating-bibliographic-data...
[3] http://www.semantic-web-journal.net/content/publishing-bibliographic-dat...
[4] http://datahub.io/dataset?q=Bibliography
[5] http://knoesis.wright.edu/library/ontologies/swetodblp/

I did try out a few links from Table 1, with mixed results. Some of it may be just coincidence and bad timing (a Friday during work hours), but it does not give a good impression of dataset availability.
http://data.linkededucation.org/resource/lak/conference/lak2013/paper/93 gave a http status 500
http://data.linkededucation.org/request/lakconference/sparql and http://data.linkededucation.org/request/lak-conference/sparql (not clear from the table which one it is): the first one returned a 404, the second one an 'unable to connect' (to the l3s it was redirected to).
Trying to go to the dataset via http://lak.linkededucation.org/ - 'spiralling to the core', then clicking around does allow browsing access. Clicking the 'blue canary' option gave a connection-reset. DEKDIV works.
On the datahub.io/dataset/lak-dataset, the pointers tot he example and sparql endpoint don't work, and when clicking the link of 'source' under 'additional info' [http://www.solaresearch.org/resources/lak-dataset/], it gives a 404 page not found. Further, the last activity to the dataset was over a year ago, which makes me assume the dataset is not as kept up-to-date as Section 4 (p5, bottom) of the paper suggests.
The R-dump link works, which is the one I actually did not expect to work, for it's on a person's homepage of an affiliation, and people tend to change affiliations, so it is quite prone to link rot over time (though, admitted, so are EU project URLs).

Other infelicities
There are multiple footnote numbers int he text that have a space between the word and the number, which should not be there.
Table 1 and Table 4 go outside of the text area.
Table 3 is spread over two columns, and listing 1 over two pages, which shouldn't be.
"references and full text is missing" -> are.
"particularly about," -> incomplete sentence.
reference to references "[8][8]", which probably should be refs 8 and 9; there are several of those.
last section "and beyond," -> "and beyond."
footnote 22 and 23 are redundant, or put them in the references "[8][8]".

Review #2
By Vojtěch Svátek submitted on 13/Nov/2014
Suggestion:
Major Revision
Review Comment:

The paper describes a bibliographic+fulltext dataset focused on the domains of learning analytics and educational data mining.

The strong side of the dataset is its active use by the respective community, even if this use looks a bit unfocused for the moment (by the large variability of LAK Challenge papers styles).
I would love to see the two usage scenarios, "facilitating access to literature" and "scientific analysis of the community", more elaborated, including the distinction of which parts of the dataset are most valuable for each.
It seems to me that for the former, metadata plus links to PDFs (and probably a keyword index) are mostly sufficient. For the latter, which might involve some heavy-weighted text content analysis, more substantial 'lifting of text data into RDF' might make sense. However, doing scientific analysis of a single community in isolation (as the whole LAK challenge seems to indicate, with its emphasis on the sole LAK dataset) seems of limited value. Communities evolve in parallel and influence one another.

Consequently, I am a bit unsure if the choice of serving large blobs of free text inside RDF files (within the same graph) is an ideal model. This brings unnecessary processing overheads for apps that do not want to handle these blobs.

The challenging process of pre-processing the free-text content is only described rather vaguely in Section 4 (e.g., it partly speaks about 'unstructured' and partly about 'semi-structured' documents - does it denote the same thing?; there are neither algorithms nor examples of their application). Depending how important the authors judge this part of work (in the current stage - the usage examples such as SPARQL queries do not seem to refer much to it?), it should be either made more detailed and exemplified, or reduced to an ongoing-work statement.

Among the standard requirements on a data description paper, I was not able to find dataset licencing information. This is particularly important since the fulltext availability relies on 'special licencing agreements with ACM'.

The structure of sections is not a very good guidance. No. 2 reads "The LAK Dataset" - which is actually the topic of the whole paper. 'Usage' sections are interleaved with 'Creation' sections.

The English of the paper is a bit cumbersome in places. It should be edited by a native speaker. Also the typography needs curation (large blank spaces between words due to unbroken URIs, tables cut across pages, blanks before footnote signs, etc.).

Minor details:
- "we specifically discuss the RDF dataset" - but the dumps are RDF datasets as well, aren't they?
- "near complete corpus of research works" - this looks too ambitious, even if you cover the major journals and conferences, i.e. probably nearly all the *influential* works, there might be hundreds of other topically related papers at local or less visible venues.
- Fig. 1 does not seem to be explicitly referenced in the text.
- Why calling an (informal) concept 'Journal Paper' in Table 3? 'Article', as in bibo, seems as traditional term.
- Ibid: should bibo:Journal really be the right class for a Journal *Volume*?
- Table 4 lists the occurrence of entities for pairs of properties that are either equivalent or inverse. The counts are then clearly identical. These trivial counts should be omitted, and, rather, the table should list some less obvious, though less frequent, properties (since you speak about 'enriching the limited metadata with additional properties' - what is in the table rather looks as traditional, limited metadata).
- The reference to [8] is repeatedly doubled: [8][8]
- The first sentence of Sect. 2.3 does not read well.
- Missing link to Semantic Web Dog Food.
- The first SPARQL query in Sect. 3 is broken: there is '?papers' in the SELECT clause but '?paper' in the graph pattern
- "frequently adding new alignments with emerging vocabularies" - this should definitely be elaborated, as vocabulary reuse is a crucial point for a dataset - concrete vocabularies, ideally all of them, should be listed in a table
- As you show the DEKDIV interface in Fig. 4, it should be explained at least in one sentence.

In summary:
- The quality of the dataset seems reasonable (even if the authors admit that occurrence of errors follows from the use of heuristic methods in data preparation).
- The usefulness is also obvious, although the narrow domain focus is a limitation.
- The description is not written in an optimal manner - neither entirely clear nor sufficiently complete, as indicated above. This is the main reason for opting for 'Major revisions'.

Review #3
By Agnieszka Lawrynowicz submitted on 16/Nov/2014
Suggestion:
Minor Revision
Review Comment:

The paper describes Learning Analytics and Knowledge (LAK) Dataset that represents a corpus of bibliographic resources for a research discipline combined of the fields of Learning Analytics and Educational Data Mining.
The dataset covers more than five years of literature from the conferences and journals in the area and provides Linked Data about bibliographic metadata as well as full texts of papers (plain text without figures).
Publishing the texts was enabled via licensing agreements with ACM for publications not yet available through open access. In this way, this work makes contribution to resolving the gap between unstructured publication formats (such as PDF) and structured Linked Data, which is currently a widely discussed topic in the Semantic Web community.
The dataset reuses established ontologies and vocabularies (e.g. SWRC, BIBO), and provides links to related datasets (e.g. DBpedia).

***Quality of the dataset***

The LAK Dataset qualifies as a five stars dataset.
Referring to Section 2.2., I would be interested to see more information on the criteria of choosing these and not the other schema/ontologies. For example, why the authors have chosen NPG ontology for representing citations and not, for instance, CiTO or any other?

I am not sure about modeling the concept "Author" with foaf:Person. "Author" is only a role of a person that may have different roles.

I would be also happy to see more information on any methodology followed or lessons learnt or guidelines after the experience of selecting and using schemas/ontologies for the dataset.

***Usefulness (or potential usefulness) of the dataset***

The usefulness of the dataset is shown primarily in the context of the LAK Data Challenge. This is a notable context, supported by the materials on the website. This is also growing (the challenge in 2015 will take place at the main conference ACM LAK rather than during workshop sessions).

The link to publications on usage ‘beyond the LAK Data Challenge‘ (http://lak.linkededucation.org/?page_id=7) shows publications of the co-authors of this paper, except one publication (by Balacheff & Lund), where in its content I cannot easily find any description of the dataset usage.
Are there notable applications beyond the LAK Data Challenge?

There are several entities involved in maintaining the dataset, and it seems that primarily it used to be the LinkedUp project that has just ended.
What is the current governance model for maintaining the dataset and how the dataset will be maintained beyond the LinkedUp project?

***Clarity and completeness of the descriptions***

The paper is written rather clearly.
The aspects of the dataset, recommended to be included in papers for the SWJ dataset track, such as name, URL, version date and number, licensing etc. are covered.

It would be good if the authors included an example of an annotated publication, with all the common properties. It could be illustrated in a figure.

The website of the dataset provides interesting visualization services, and other supplementary material for the paper that adds value.

Minor remarks:

Authors:
* There is „Knowledge Media Institute, The Open University“ marked as insitution „c“, but no author is associated to „c“.
Abstract:
* not already -> not yet
* following established Linked Data pattern -> following an/the established Linked Data pattern?
Section 2:
* „the the” ->repetition (more than once)
* I am not sure what „R” format means. Is this related to the R project for statistical computing? It lacks a reference or a link.
* the acronym SWRC is not „Semantic Web Conference Ontology” but „Semantic Web for Research Communities”
* it is not clear what „property type associations” are and what are „implicit mappings”
* „the following table provides”->please refer to the number of the table
* “particularly about,“->something is missing here
* „ (a) enriching the limited metadata with additional properties […]For instance, in DBLP and Semantic Web Dogfood, LAK publications are not exhaustively represented, references and full text is missing in both cases and, in the case of DBLP, affiliations are not reflected as explicit entities“->are these (references, full text, affiliations as explicit entities) all additional properties or there are other additional properties?
* „The following figure depicts the total amount of links of resolved respectively enriched LAK entities.“ ->something is missing in this sentence
* The text in Fig. 2 is barely readable
* „[8][8]“ -> repetition (more than once in the text)
Section 3:
* points i) and ii) were presented in different order previously in the text
Section 5:
* „LinkedUp and associated organisations is an annual competition“ ->there is something wrong with this sentence
Section 6:
* last sentence does not end with a dot