A Linked Dataset of Medical Educational Resources

Paper Title: 
A Linked Dataset of Medical Educational Resources
Authors: 
Hong Qing Yu, Stefan Dietze, Davide Taibi, Daniela Giordano, Eleni Kaldoudi and John Domingue
Abstract: 
With sharing and reusing, educational resources become increasingly important for enhancing learning and teaching experiences, particularly in medical educational domain since these resources are expensive to re-produce. In respect to this, many efforts have been applied to federate the resources to achieve the sharing and reusing goals, which led to a fragmented landscape of competing metadata schemas, such as IEEE LOM or OAI-DC, and interface mechanisms, such as OAI-PMH or SQI. However, the major issue of educational resource federating is the heterogeneity challenge of metadata and data. In this paper, we illustrate a medical educational dataset (mEducator Linked Educational Resources dataset) that is published as part of the Linked Open Data cloud following Linked Data principles. The dataset contains educational resource metadata federated from ten different (medical) educational institutes together with enriched links to related information by using Linked Data techniques and datasets. We introduce a Semantic Web Service based data extracting mechanism that is exploited for services and data integration to address heterogeneous metadata problems. The paper also discusses the dataset accessing APIs, statistics and existing applications of using the mEducator dataset.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Decision/Status: 
Reject
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised manuscript after a "reject and resubmit", then rejected. First round reviews are beneath the second round reviews.

Solicited review by Natasha Noy:

The paper is much improved and is significantly more clear now than in the original submission.

My main concern remains however: the paper still does not describe what is actually in the dataset. Section 2.1 gives generic discussion of why linking is a good thing. After reading the paper again though, I still did not see a single example of what a medical educational resource is. More important, there are no examples of queries or tasks that users would want to do that would require the linking. I was expecting something like "Users would like to find out X. For that they need information Y from resource A and information Z from resource B. We need integrated metadata describing both A and B to make this request work. Here is how our work makes it possible". The dataset description still does not provide any information about the dataset content.

I would suggest focusing much less on the process itself, but rather on the content, as the Special Issue Call requires. It is of course necessary to understand the process by which the dataset was created if I want to use the dataset. But this special issue asks authors to focus on the content. And for now, I don't see what is in it.

Quality of the dataset:

Because the dataset content is not described, it is hard to assess its quality. There is no discussion of any formal evaluation of quality of the annotation process (the one manual -- and most critical -- step).

Usefulness of the dataset.

There is one brief section on the use of the dataset (Section 5). But it is very short and very general. It is not clear at all how the dataset is used, how the interlinking is used, what the project actually does with the data.

Clarity of the description

The English is much much improved. The description is too vague though. Here are a few examples:
- Section 2.2, resource schema: there are general statements on how the schema was created, but no "hard data". How was it evaluated? How large is it? How much of the mEducator resource metadata does it cover?

- Section 2.3 (critical to this special issue): there is a list of different types of queries but no description of what the queries actually do. I think this section needs to be expanded significantly. Side note: putting queries in footnotes makes for a very hard read. A table would have been better.

- Section 3: Enriching with DBPedia and BioPortal data. Again, there is no discussion of what actually happened here. What types of queries where made? What type of data was added? Why these specific vocabularies (listed in Table 2)? Why not others? A search n BioPortal for the example in the paper ("thrombolysis") returns other vocabularies. What data was taken form DBPedia? Diseases? Organizations? Something else? This goes back to my point earlier that it is still not clear at all what is actually in the dataset.

Solicited review by anonymous reviewer:

Unfortunately the paper still has many typos and not small ones either. In fact several sentences and full paragraphs would have to be rewritten (e.g., the full Section 2, which consists of one paragraph). I believe this is not the job of the reviewer.

In comparing the replies to the reviewers, the initial review, and the revised manuscript, again many questions arise. For example, in reply to Reviewer 1, the authors point repeatedly to Section 2.1. For example, they write: "A paragraph is added at beginning of Section 2.1." This is really not possible because Section 2.1 has only one paragraph, which seems even shorter than the paragraph of the initial version.

Then, also in reply to Reviewer 1, the authors write: "We revised Section 3.2 to clearly explain the metadata mapping and lifting process that is a semi-automatic process." The problem here is that there is no Section 3.2.

There are several more problems with the paper and discrepancies between what the authors claim they have fixed and the revised version of the paper.

In summary, the first version of the paper confounded the reviewers and unfortunately the revised version fixes some problems but also introduces a whole new set of problems, which cannot be fixed easily. My suggestion is that the authors rewrite from scratch the paper, ask help from a very good editor, and after this is all done (which may take several iterations) they should submit the paper elsewhere.

Solicited review by Michiel Hildebrand:

=
"However, the three steps mentioned in 3.2 are not completely clear. Can you illustrate the steps with an example? The lifting example is not very convincing and it is not referenced and described in the text. I suggest to enhance or replace with an illustration of the entire "generation" process."
=>
We significantly revised this section to illustrate the entire data processing
approach with a clearer Figure 2. We deleted the previous Figure 2 which only
presented an example of lifted results and replaced it with a URI reference to
show the services annotation examples.
=

Figure 2 is not an example it is a schematic representation of the process. It is still not clear what happens in each step. What is in the existing API and what kind of annotations need to be added? What comes out of this step and what is then still required by the service invocation engine? I suggest to use an example to illustrate what happens in each step of the process. Note, that this issue is related to the other comments about the the content of the datasets. What is in them?

=
"The topics of the different repositories remains unclear. Therefore, it is difficult to get a feeling for the heterogeneity problems and potential new usage scenarios that are enabled by linking the datasets."
=>
We listed different repositories at the beginning of Section 3 and discussed
the heterogeneity problem related to educational resources data on the Web at
the beginning of Section 2.1 and abstract. We also demonstrated the usage of
the dataset by pointing to a number of additional references that highlight
the use case scenarios facilitated by the improved data.
=
In section 3 I still only see a listing of the data sources. The content of the individual data sources is not explained. Therefore, it is very difficult to understand the value of integrating these sources. It is also difficult to determine if the mEducator schema makes sense, or if it is a good idea at all to have a single unifying schema.
Section 2.1 describes in very general terms why interoperability is desired. It is not clear which exact heterogeneity issues are solved in your project. If I understand it correctly, interoperability is achieved by three approaches: (i) schema mapping, (ii) lifting and (iii) internal linking. Can you illustrate by means of examples the different types of heterogeneity among the original data sources and how each of the three approaches increase the interoperability?

First round reviews:

Solicited review by Natasha Noy:

The paper describes a linked dataset that integrates a number of educational resources.

The paper suffers significantly from poor English and it is possible that some of the critique in my review is because I missed something in the description.

Ironically, the description of the dataset contains very little about what is in the dataset. Given that most SWJ would not be familiar with medical educational resources, it would have been helpful to provide more details on what is in those resources. Why would users want to search multiple resources from different countries simultaneously? What are the examples questions or use cases that can be addressed by having this common metadata in the resources. In particular, since it seems that the resources share very little in common (the paper mentions title, description, and a couple of others). So, having a better idea of why this metadata needs to be linked and what users of the dataset can get out of it would have been very helpful.

It appears that the mapping between the metadata fields and the authors ontology was done manually. In order to understand what was and was not captured and how heterogenous the metadata was, it might be useful to show how much of the metadata from each resources was mapped and integrated and how much of it did not find its representation in the authors' ontology.

The authors mention that their document enrichment populates some of the metadata properties. It is not clear though from the description which (what seems like automatic) step determines which property the enriched terms go into. In general, the description of the enrichment would have been better if it has stated more clearly what the goals of the process were. It was hard to understand how the data for enrichment was being selected, which properties form the metadata are used.

The importance of the dataset is one of the criteria for this special issue, and without good scenarios on what such integration would help achieve, it is hard for me to see the use of the dataset beyond a very small set of users.

Solicited review by anonymous reviewer:

Summary: This paper is extremely difficult to read and would benefit from considerable rewriting. More details are provided below.

Below are the answers to specific points mentioned in the submission
guidelines, followed by more detailed comments.

1) Name, URL, version date and number, licensing, availability, etc.

Only Name and URL are included.

2) Topic coverage, source for the data, purpose and method of creation
and maintenance, reported usage etc.

Some usage is reported, but because of the way the paper is written, the statistics provided cannot be (easily) understood.

3) Metrics and statistics on external and internal connectivity, use of
established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language
expressivity, growth.

I believe growth is not discussed.

4) Examples and critical discussion of typical knowledge modeling
patterns used. Known shortcomings of the dataset.

The paper focus more on what has been done (using several APIs) then modeling proper. Potentially interesting aspects, such as enrichment and clustering are not introduced or explained in sufficient detail.

5) Quality of the dataset

Not discussed. Enrichment could be a way of improving the dataset, but the discussion is not sufficient.

6) Usefulness (or potential usefulness) of the dataset

Who is using the dataset? Which queries are being made to it?

7) Clarity and completeness of the descriptions

This is clearly the weakest point of the paper. It is as if the paper was extracted from a much longer one where the background was provided and discussed. However, the background was not included in this paper.

----------------------------------------------------
Detailed comments, including typos

Capitalize Section, for example:

in section 5 -> in Section 5

rewrite: describe the USING discipline

dataset3. -> dataset.3 (footnote should follow the punctuation mark here and elsewhere)

in "The number of enrichment triples in the data store is
1352. Table 2 shows that the enrichment involves a total
of 509 distinct terms from DBpedia."

1) Should it be "enriched triples"? Or maybe "enrichment triples" are
added new triples. Give at least one example.

2) The number "509" (or even "1352")seems low. Or maybe it is not, but
since Table 2 is not well explained, one cannot understand exactly
what is happening.

Section 4.2 starts with "The clustering functionalities have been
integrated in the RDF store to allow the interlinking of resources
originating from different repositories." Not clear what the first
"THE" refers to, because clustering was only mentioned in the
introduction without explanation.

Section 5. Data Usages -> Data Usage

to search and browser -> to search and browse

"The average number of triples per educational resource is 27, ranging
from a minimum of 6 to a maximum of 68." the numbers seem rather
low. Which triples do these numbers exactly refer to?

Here are more examples of hard to read sentences/paragraphs:

For example: the caption of Figure 5 is impossible to understand:
"Figure 5 Frequency of total number of completed fields per resource
(excluding fields that pertain only to repurposed resources)." The
text that refers to it is not very helpful either:

"Figure 5 shows that even though the
metadata imported from external stores usually is very
limited, often covering only less than three properties
(e.g. title, description and resource location), based on our
automated and semi-automated enrichment techniques,
substantially large numbers of properties are provided for
the majority of resources, where all resources have a
minimum of 5 described properties."

"Table 3 provides an overview of property usage frequency across
educational resources." Why is property usage important. Introduction
to this topic is needed earlier.

(1) investigating methods to enable integrate data from
other educational domains;
->
(1) investigating methods to enable the integration of data from
other educational domains;

Solicited review by Michiel Hildebrand:

The paper gives a clear and concise description of the medical educational dataset created in the mEducator project.

The strong points of this paper (and minor suggestions to improve them):

- An important aspect of this dataset is its generation from the original web services (section 3). The use of existing technologies for this process is a plus. However, the three steps mentioned in 3.2 are not completely clear. Can you illustrate the steps with an example? The lifting example is not very convincing and it is not referenced and described in the text. I suggest to enhance or replace with an illustration of the entire "generation" process. Section 3.3 is vague. Is it necessary?

- Another aspect of the dataset is the enrichment (section 4). The use of the two different services DBPedia spotlight and BioPortal API (and the difference in result) is interesting. I suggest that you mention the use of these services in the beginning of the section. Currently, we only "accidentally" find out about them at the end of section 4.1. in your analysis of the results. Do you have any idea about the quality of the links?

- The additional clustering functionality is nice and the representation of the results in RDF using PROV is even nicer.

- The statistics (section 6) are clear and useful.

The shortcomings of the paper:

- The paper does not discuss in detail to what extent the heterogeneity challenge is solved. Section 2 on data modeling only describes the mEducator data schema. It does not discuss the consequence of these modeling decisions. For example, could all original metadata fields be represented or is information lost? What was the cause of the heterogeneity in the first place? What were the main outcomes of the studies of existing metadata standards mentioned in section 2.1?

- The effect of the enrichment step, in terms of heterogeneity, is unclear (section 2.1). How many additional links between documents can be made by the enrichment?

- The paper describes two applications that use the data, but the paper does does show how "real world" use cases can be solved with these applications, or by the dataset in general.

- The topics of the different repositories remains unclear. Therefore, it is difficult to get a feeling for the heterogeneity problems and potential new usage scenarios that are enabled by linking the datasets.

Tags: