Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset

Tracking #: 1080-2292

Authors: 
Stefan Dietze
Davide Taibi
Mathieu d’Aquin

Responsible editor: 
Claudia d'Amato

Submission type: 
Dataset Description
Abstract: 
The Learning Analytics and Knowledge (LAK) Dataset represents an unprecedented corpus which exposes a near complete collection of bibliographic resources for a specific research discipline, namely the connected areas of Learning Analytics and Educational Data Mining. Covering over five years of scientific literature from the most relevant conferences and journals, the dataset provides Linked Data about bibliographic metadata as well as full text of the paper body. The latter was enabled through special licensing agreements with ACM for publications not yet available through open access. The dataset has been designed following established Linked Data pattern, reusing established vocabularies and providing links to established schemas and entity coreferences in related datasets. Given the temporal and topic coverage of the dataset, being a near-complete corpus of research publications of a particular discipline, it facilitates scientometric investigations, for instance, about the evolution of a scientific field over time, or correlations with other disciplines, what is documented through its usage in a wide range of scientific studies and applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vojtěch Svátek submitted on 08/Jun/2015
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

Review #2
By Agnieszka Lawrynowicz submitted on 01/Jul/2015
Suggestion:
Accept
Review Comment:

Based on the revised paper as well as on the work the authors have done for this revision and their answers I am satisfied to accept the paper.

Review #3
Anonymous submitted on 06/Aug/2015
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

I gave a ‘revise paper’ verdict, because I think the problems raised on the previous two reviews are fixable. The authors, however, did not address all the raised issues satisfactorily. As this is the third round, I’m going to reject the paper.

Additional details are reported in the following:
- The issue concerning the lack of lack of methodology, guidelines, or at least a set of practices still remain open since considering an rdf file and some data points is not enough for addressing the lack of best practices. For practices, a kind of process should be figure out while here the authors seem to be more oriented to the object.

- The claim of using a best practice doc is not explicitly addressed in the paper
- The way of reusing vocabularies is not convincing
- A proof of the validity of the process followed for the design process is still missing

- When a schema is changed this generally affect the whole dataset. This aspect should be analyzed within the paper. The analysis of the effect of the schema’s changes on the data set is instead missing. For instance, the annotation of the data and the data analysis etc stay the same regardless what’s in the schema? What is the effect on tools and queries using the modified dataset+schema?