Improving Quality in the Publication of LD

Tracking #: 3551-4765

Authors: 
Alex Randles
Declan O'Sullivan

Responsible editor: 
Katja Hose

Submission type: 
Full Paper
Abstract: 
A significant proportion of Linked Data (LD) is created through mapping of data from a variety of sources of data. These mappings define the transformation rules from non-graph based data into graph based Resource Description Framework (RDF) data. The definition of mappings is a complex and time-consuming task which is prone to errors. Oftentimes, the resulting linked data datasets have varying levels of quality. In addition, quality issues are commonly detected after the mapping artefact has been executed and the linked data has already been published. Quality issues in mappings can result in an exponential growth of issues in the resulting dataset, thus greatly decreasing overall quality. In addition, linked data has been described as highly dynamic in nature with source data being continuously changed, which could impact the quality of the linked data and related mapping artefacts. Changes which have occurred in the source data of linked data datasets should be propagated into the resulting dataset to provide an accurate representation of the underlying data sources. These changes can occur at an extremely fast rate which can result in difficulties propagating each change in a timely manner. Surprisingly, despite the growth of linked data publication on the web of data, there exists no standard to address the dynamics of the data. An approach which captures changes in the source data used by mapping artefacts to create linked data datasets will help to address the dynamics involved in the publication process. In addition, capturing information detailing mapping quality and source changes in a machine-readable format will allow software agents to automatically process them and take appropriate actions to preserve the quality of mappings used to create the linked data datasets. It is argued in this article that addressing quality issues within the mapping artefacts will positively improve the quality of the resulting dataset that is generated. Evaluating an approach designed to improve and maintain declarative uplift mappings involved in the publication of linked data is important to provide evidence of sufficient usability. In addition, evaluation allows the requirements of the approach to be validated with domain experts. This article describes the evaluation of the Mapping Quality Improvement (MQI) Framework which aims to guide linked data producers in producing high quality datasets, by enabling the quality assessment and subsequent improvement and maintenance of the mapping artefacts. The evaluation of the MQI framework and associated ontologies used diverse instruments and involved over 100 participants with varying levels of background knowledge.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jose Emilio Labra-Gayo submitted on 10/Jan/2024
Suggestion:
Major Revision
Review Comment:

The paper tackles the important problem of improving the quality of the publication of linked data when this data comes from external sources and needs to be mapped.

The authors propose a framework based on 2 main components: mapping quality assessment and refinement, and source change detection. Both components are supported by two ontologies: MQIO and OSCD which are published and in my opinion follow good practices for ontology publication.

The authors also present several evaluations to assess the quality of their proposal, a system evaluation, followed by accuracy and usability experiments and 2 real world use cases.

In my opinion, the problem addressed by the paper and the framework proposed are interesting and I think the authors have done a good work. With regards to originality, as far as I know, the contents are original and although some aspects have already been presented in some conference, I think the content of the paper is original. With regards to the significance of the results, I also think the results are interesting and significant. In my opinion, the authors missed some related work which I point later to do a proper review with possible alternatives, but in any case, the framework presented and its evaluation is good.

The authors also provide a long term URL for the ontologies they present as well as a github repository for some of the results. Something that I think the authors should improve is the use of a lot of footnotes that refer to google drive documents which may not be so stable. I suggest the authors to include that content in a github repo or something like zenodo/figshare to make it more stable.

In my opinion the paper contains several things that need a major revision before its publication.

In the following, I will comment some of the issues that I found:

- Title: in my opinion, the title shouldn’t contain acronyms (although in our community, most people know what LD refers to, that acronym may not be so popular in other communities or in the future). I also think the current title is too general because the paper is mainly targeted to the publication of linked data which comes from external sources. I would just suggest something like: “Improving quality of linked data that comes from mapping external sources”.

- Some sentences in the abstract and the introduction are not faslifiable or are not academic enough to appear in a research journal. For example: “A significant proportion of Linked data is created through mapping data..”, what is the proportion? How can the authors know that? I would suggest the authors start by a more objective motivating sentence. In a similar way: “Quality issues in mappings can result in exponential growth of issues in the resulting dataset…” Is that true? Why exponential?. Later in the abstract, the authors say: “These changes can occur at an extremely fast rate…”, what is “extremely fast” ? How can you measure that or know that?

- I am not sure it is clear the meaning of the sentence “there exists no standard to address the dynamics of the data” in the abstract…I would suggest the authors to improve that wording.

- In the introduction, the authors include similar non-falsifiable sentences like “Linked data datasets are being published onto the web of data at an exponential rate with…”, are you sure? Do you have any reference that ensures that it is really exponential?

- In several parts of the paper, the authors repeat references…for example, in the introduction, the authors repeat references [3,9] in three parts. This also happens in other parts of the paper where I think the authors abuse of citations, specially when those citations have been done previously. Another example, is OSCD [31] which appears several times…I would suggest the authors to review and clean repeated references.

- Introduction: “due these changes…”

- In the first paragraph of the related work, the authors say that it is divided in “Approaches to support mapping quality assessment and refinement” and “Approaches to support mapping quality alignment”, and I think the section titles 2.1 and 2.2 should probably correspond to those titles. However, section 2.1 corresponds to it, but section 2.2 is “Dynamics of LD”, should it be changed?

- I am not sure if the authors are aware of the paper [1] which compares several mapping approaches and presents a usability study. I think the paper should probably take into account that reference in the related work because in fact, one possible approach to increase the quality of the mappings is to use better tools and languages, and ShExML could be a better tool which can solve some of the issues that the mapping tools employed by the authors have found. Indeed, using a high level and declarative language like ShExML, which could have static analysis tools, could probably be a future line of research in this field.

- “...which enable triples to be add/delete or suggest”, should it be “added/deleted” ?

- In section 2.3, the authors claim about the lack of user testing, but as I pointed before, there is a paper in this domain that has a usability testing so that limitation should probably be changed.

- I tried to access the document in footnote 3 and it says that I don’t have permission to do it…I suppose you need to open it. In that sense, I wonder if it is really necessary to include all those documents as google drive documents and maybe, as the authors already included a github repository, it should be better to put all the extra material in that repository?

- Some of those google drive links are just a simple text page, would it be better to include them as annexes?

- In section 3.3, the MQIO contains references [18, 31], I think the 31 reference is not necessary for MQIO. Later, in section 4.1, MQIO is presented with references [15, 17, 18, 36] why is it necessary all those references for something that has already been presented in the paper?

- I am not sure I understand the footnote 15, I don’t see the relationship in xsd:anyURI.

- “The data was famialiarized by the authors of by repeated..”

- I found the section 4.2, 4.3, 4.4, 4.5 and 4.6 very repetitive. I wonder if those sections could be shortened as they include a bit too much details about the usability studies which are not really so much interesting for a reader.

- I found interesting section 5.1 about the possible use SHACL and its replacement by SPARQL. It seems the decision is based on some technical limitations of SHACL, but I wonder if those limitations (for example, the use of blank nodes in the validation report) are just limitations of some specific SHACL implementation or are more profound. I also wonder if the authors considered ShEx, which instead of a validation report, provides its results with shapemaps, which map the validated/non-validated node with its shape.

[1] ShExML: improving the usability of heterogeneous data mapping languages for first-time users , Herminio García González, Iovka Boneva, Sławek Staworko, Jose Emilio Labra Gayo, Juan Manuel Cueva Lovelle, PeerJ Computer Science - 2020, doi: http://dx.doi.org/10.7717/peerj-cs.318

Review #2
Anonymous submitted on 11/Feb/2024
Suggestion:
Reject
Review Comment:

This work proposes a framework called MQI (Mapping Quality Improvement) to support the identification and improvement of quality issues in declarative uplift mappings. The framework includes two ontologies, MQI and SCD, designed to represent mapping quality assessment and refinement information and source data changes, respectively. Lack of standard procedures to address dynamic changes in the data can result in difficulties in maintaining the quality and accuracy of the Linked Data publications. The authors provide a comprehensive evaluation that includes usability experiments, an accuracy experiment, and expert validation experiments. The results achieved include the validation of the MQI framework with end-users. 100 participants were involved in the evaluation

The problem studied in the paper is interesting in my opinion. Moreover, the topic fits perfectly within the scope of the journal. The structure of the paper is clear, and the introduction describes the topic faced in the paper well enough.
While the topic is undoubtedly relevant, the manuscript requires substantial improvements in its presentation.

I see a few negative aspects that I describe below but then provide details for each section.

A problem with this work is the total lack of justification for the proposed quality metrics for mappings, vocabularies or data. Those metrics are only described in a document that is not in the paper but outside the paper, a document that does not present a permanent location specific for those resources. So, if the authors decide to deactivate the link, then those metrics become unavailable.

Furthermore, why were these specific metrics chosen, and what criteria guided their selection? The delineation behind the choice of these quality metrics remains unclear. Moreover, what provisions are in place for incorporating additional metrics if deemed necessary? Implementing a formalization to define the function of each metric could prove beneficial, facilitating a unified understanding and enabling clear visualization of the input and output generated for each metric.

A third critical concern is the absence of a comparative analysis. A comparison with the authors' previous works could elucidate the evolution and novelty of the MQI framework. Additionally, benchmarking MQI against existing tools would offer insights into its strengths, weaknesses, and distinctive features, thereby establishing its novelty and relevance more convincingly.

Furthermore, the unavailability of the tool repository raises questions about reproducibility and accessibility, essential for validating and building upon the proposed framework.

RQ: To what extent can the detection of declarative mapping quality issues and source data changes, facilitate the creation and maintenance of high-quality Linked Data (LD) datasets?

* Introduction
The introduction lacks clarity regarding the applied approach, abruptly introducing the two ontologies without adequately contextualizing their necessity alongside the framework.
Why do these two ontologies serve alongside the framework?
What are the scientific contributions? They must be listed explicitly.
Reference 5 - > seems more into completeness why this consider 27 metrics? [Please Check]

Remove repetition for instance introduction three times the same references in three consecutive sentences [3,9]

The related work section is generally acceptable, although I found an error in subsection 2.1, stating that approach 2 extends the Luzzu framework when it's more likely the other way around. Luzzu came after Zaveri et al.'s work.

Instead of having section 2.3, I would prefer to integrate those limitations as an overview within sections 2.1 and 2.2, respectively. I'm uncertain if this approach makes sense.

Section 2.4 seems misplaced; it would be more appropriate to include it in the introduction.

Notably, the expression "is hoped the framework" should be avoided in a scientific paper. Instead, a statement like "we believe" would be more appropriate. Additionally, clarification is needed regarding the cause of the quality issues. How did they arise? Furthermore, in what way did the framework facilitate the validation of the design? What design is being referred to here?

*Evaluation Strategy
The MQI framework incorporates multiple methods, but what exactly are these methods and why are multiple methods necessary? While the inclusion of multiple metrics is understandable, clarification is needed regarding the variety of methods utilized.

Additionally, there is a repetition of the phrase "uplift mappings used in the publication of linked data." It remains unclear how the mapping quality aspect differs from data quality or vocabulary quality aspects. For instance, considering the metric "Valid term type definition," it's essential to understand who performs this task and when it is executed. Is it done manually or automatically?

The paper presents a list of mappings that could be beneficial if explained further. For example, when encountering an issue like "A class defined in the domain is not included in the mapping," how is this problem identified and addressed? What is the process for resolving such issues?

Regarding refinement, what level of expertise is required to address the quality issues identified?

Finally, in the "Select from suggested values" process, what is the time required to choose the correct value? Clarity on these points would enhance the understanding of the evaluation strategy.

Experiment 6
In the context of auxiliary usage of the mapping quality assessment and refinement component of the framework, where inconsistencies in ontologies are detected, it is important to clarify the process of identifying and resolving such issues. The identification of problems where a property should have been defined as an objectProperty instead of a datatypeProperty can be attributed to a combination of manual inspection by domain experts and automated support provided by the system.

To enhance clarity in the discussion, it would be beneficial to provide more details on how the system supports the identification of such issues, the specific checks or validations employed, and the collaboration between automated processes and manual inspection by experts.

Regarding the discussion of the two use cases in terms of efficiency and effectiveness, it would be relevant to analyze and compare aspects such as the time taken for mapping quality assessment and refinement, the accuracy of issue detection and resolution, the impact on data quality, and the overall performance of the framework in each use case scenario. By evaluating the efficiency and effectiveness of the framework in different contexts, researchers can provide valuable insights into its practical utility and performance across diverse applications.

Typos need to be checked throughout the paper
* table 1: check Experient 1

At its current state, I am not confident it can be published. I hope the authors can take into consideration my comments.

Review #3
By Ben De Meester submitted on 13/Feb/2024
Suggestion:
Major Revision
Review Comment:

- originality: partially, but not clearly described which parts are original and which aren't
- significance: reasonable, impact does not seem to be huge (many 'it is hoped' statements in the conclusion but no explicit impact statements) but there's impact nonetheless, evidenced by real use cases
- quality of writing: fluctuating, some sections are good, others are poorly structured an hard to follow
- resources URL: good, some minor improvements (especially concerning google drive links) are expected

### High-level review

As the paper is written now, it reuses substantial pieces of text of published work,
without explicitly notifying the reader about it.
As it is written, it even feels as if this self-plagiarism is actively hidden away from the reader.
For instance, the authors' previous version of their work was called MQV, whilst their extended version is called MQI. This is nowhere explicitly stated in this SWJ paper (MQV isn't even mentioned). I listed a couple of duplicate texts below (initial page numbers are the ones from the to be reviewed SWJ journal), but I'm pretty sure there's more (some related work paragraphs are also copied, but I don't mind that).

https://doi.org/10.3233/SSW220006 evaluates MQV
- p5-6 is almost verbatim from https://doi.org/10.3233/SSW220006 p25
- p12 is almost verbatim from https://doi.org/10.3233/SSW220006 p26
- p13-15 is almost verbatim from https://doi.org/10.3233/SSW220006 p29-32
https://openreview.net/pdf?id=R4LtDKSj6Fb evaluates MQI
- p15-18 is almost verbatim from https://openreview.net/pdf?id=R4LtDKSj6Fb p8-12

I totally understand that journal papers include existing published works to be self-contained (i.e., a delta of 1/3 is enough),
but I would expect this to be explicitly stated.
Without such explicit statements, this personally feels a bit unethical.
I believe in giving the benefit of the doubt that it was not the intention to misguide the reader, and believe that the journal can contain sufficient delta.
For a next revision, I expect an explicit contextualization of previous work is integrated, very clearly described and well-argumented.
Other high-level remarks:

- The structure of your introduction needs a major revision (and the broad title doesn't help, I would also suggest to revise the title). it's unclear what the actual contribution is, but the paper reads as if it's both an approach and an evaluation (whilst the conclusion only mentions the evaluation, so I'm confused). I have the feeling you should more properly introduce the rationale behind the approach (MGI, OSCD and MQIO) before introducing the evaluation. Now it's backwards. After reading the introdcution, it should be clear which specific problems the MQI should solve, and how your evaluation plan properly evaluates those specific problems. Right now, it's unclear why you need, e.g a usability experiment: it didn't read as if being able to _easily_ assess quality was the problem here.
- The argumentation for the need for a user study for mapping quality assessment and refinement is lacking. Introduction nor related work provide sufficient arguments to clarify the importance of this need. Also, it's very unclear what kind of user evaluation is needed an why: I would expect a usability study, but that's not clarified.
- Related work is generally poor: [2] is very poorly summarized, references use wrong DOIs. This needs a thorough revision.
- Presenting your contributions as part of the related work section makes very little sense in my opinion.
- Listing of Quality Metrics at https://drive.google.com/file/d/1vCxcCK0BMuG4ujwSMnulyea6TPTkrT-J is not accessible
- You are referring to google drive links for tables. This isn't sustainable. You could just as well include them as appendices.
- same for https://forms.gle/FMzH9fmFcyyKi5AH6 : this is a live form. Add the actual questions as an appendix.
- section 3 gives almost no arguments but just describes components and functionalities without providing any rationale. I see very little academic value in this section, certainly not enough to prove this is a contribution
- 3.3 and 3.4 introduce an ontology (a contribution), but nowhere is mentioned how this ontology was created, which method was followed, etc.
- "The results indicated sufficient ability to detect quality issues as a total of 228 quality issues were detected in the mappings." --> This is not an accurate statement: Would you say this is sufficient if there turned out to be 2000 quality issues? I expect metrics such as precision and recall to validate such claims
- The evaluation method is a contribution, but there's no assessment that this method is in fact valuable outside the presented work. A threats to validity analysis should be the very least to be included.

### Detailed review

#### Introduction

- Please revise your first sentences: first of all, LOD cloud statistics do not suggest at all an exponential rate, it rather remained almost stationary since 2017. Also the number 1200 is at least a year incorrect. So I would timestamp your claim. Also, the LOD cloud allows for publishing dataset _metadata_, not actual data.
- I would start a new paragraph at "Data quality is often referred to as “fitness for use” [1,2] ..."
- "Interestingly, metrics such as undefined terms (54%), incorrect domain/range (60%), licensing (11%) and basic provenance (12%) scored worse." Why is this interesting?
- You state "Research [5] [demonstrated poor LOD cloud quality levels]. More recent research [4,5] [demonstrated LOD cloud quality levels remained poor]". How can [5] be more recent than [5]?
- p2 column 1: You mention a lot of times [3,9] as reason for your claims. But these are your own works. I'm pretty sure [21] makes the same claims. So you could add [21] to [3,9], and thus in my opinion raise the believability of your claims since you're not only basing them on your own work.
- In general p2 column 1 is a huge paragraph with subsequent 'in addition' statements. Meanwhile, you're making your most important points in this paragraph. Please revise
- "Freshness relates to the age and occurrences of changes in data and has been described as one of the most important aspects of linked data quality [12].": [12] does _not_ mention linked data/rdf/semantic web in any way. The wording is very misleading and must be improveed. https://www.semantic-web-journal.net/system/files/swj773.pdf (for example) provides some argumentation on why timeliness is important (p23 column 1), and I'm sure version-aware RDF publications such as Ostrich, TailR, R&Wbase, Memento, or LDES will provide good argumentation why freshness is important when publishing RDF.
- "This article describes an evaluation which was undertaken to validate an approach designed to improve the quality of mappings used in the linked data domain, in addition, to resolving identified state of the art limitations." clarify: is your contribution only the evaluation, or also the approach? In both cases, the title should be improved to better cover the actual content of your work (it is too general at this moment)
- "Usability testing, a method for collaboration between computer scientists and domain experts [13], was used to iteratively refine and validate the design of the developed MQI Framework [9,10,14–16] and two associated ontologies: Ontology for Source Change Detection (OSCD) [10,14] and Mapping Quality Improvement Ontology (MQIO) [17,18] were designed to facilitate the generation of high-quality mappings." --> revise this sentence, it's too dense and unclear phrasing
- properly (re-)introduce the MQI acronym, you only did that in the abstract

#### Related work

- please rework the introductory paragraph so that referring to section 2.3 comes after introducing 2.1 and 2.2. And if you refer explicitly to one section, refer explicitly to all
- "The state of the art in mapping quality frameworks for linked data has been reviewed. Evaluating the quality of linked data tools with potential end users will demonstrate the usefulness of the design [6]." --> The design of what? Also, [6] in no way (nor R2RML nor the disjoint DOI) back up this claim. As it feels this claim is very important to claim your user study as a contribution, I would expect much more (and more accurate) argumentation for this.
- your "Mapping Quality Assessment & Refinement" section structure is backwards: start with introducing the SotA, thén provide discussion. Now you start with "none of the approaches described have published an evaluation", whilst you haven't described any approach yet. Also, please describe each approach consistently: if you mention user studies for EvaMap, mention them also for Luzzu/ResGlass/...
- "The approach presented in [2] extends an existing linked data quality assessment framework named Luzzu framework [20]." --> That's a very inaccurate summarization of the survey. The survey itself in fact does not extend anything. It... surveys existing work. And at the time [2] was published, [20] was not. So this is very poor phrasing at the least. Did you mean to refer to something else than [2]? Also the remainder of the section seems to refer to [20] as topic and not [2], but then the topic of the first sentence of this paragraph should have been [20], not [2].
- "Luzzu generates two machine-readable reports, however, the problem report is the focus of the work" --> if there are only 2 types of reports, please describe both instead of 1.
- "However, there was certain cases where ontologies could not be retrieved and queried" --> so what?
- "within the rules used to generate linked data datasets" --> given the paper's focus on mapping rules, I would expect consistently used terminology, properly introduced in the introduction. Personally, I would not use "rules used to generate linked data datasets" but rather "RDF mapping rules": Linked Data assumes that the used URIs are dereferencable, whilste (R2)RML etc. only generate RDF, they don't publish RDF.
- "The inconsistencies are detected using a rule-based reasoning system [9]" --> [9] is not the right citation for this
- It's unclear that "Approaches to support mapping alignment" is meant by the section with title "Dynamics of LD"
- Similar to above, your "Dynamics of LD" section structure is backwards
- I have the feeling you should have a 'background' section in your related work section, now you introduce R2RML is at p3 bottom-right but already mention it p2 middle-left. Currently the related work section reads quite unstructured.
- "which models source data changes instead of resources" --> it's unclear what you mean by this
- DELTA-LD is introduced as both an approach and a change model.
- sparqlPuSH is 14 years old: is this still relevant? nothing newer that does something similar? If not: do you have any proof that this is in fact relevant SOTA?
- "While approaches exist to support the improvement of mapping quality and maintenance." -> that's not a sentence
- "The evaluation strategy applied to MQI framework includes multiple methods, metrics and participants" --> the strategy includes participants?

#### Design and Implementation

- I would suggest to paint the broad picture first before diving into the components: how does source change detection relate to mapping qaulity assessment: are both needed a the same time, or is source change detection something you do after the quality assessment?

#### Overview of Evaluations

- footnote 10 is not accurate at all. Using this as a reference for your work makes me doubt the rigor of your work in general.
- 4.2.1 it's not clear what these lessons entail: are this limitations of the presented contributions?
- Although via drive links, all evaluation data and results are well-provided
- "It was estimated based on manual examination that resolution of these quality issues potentially positively impacting 1750 triples" --> improve grammar

### Minor

- Some wording in the paper is a bit too subjective for me, making claims that are not really substantiated. E.g. abstract begins with 'A significant proportion', but that wording is no longer in the introduction, nor is that claim argumented. Introduction begins with "Linked data datasets are being published [...] at an exponential rate": is it in fact exponential? Do you have reference? I would suggest to try and revise those kinds of statements into a more neutral tone
- [7] and [53] are duplicate references
- I prefer consistent Oxford comma usage (but feel free to ignore this comment :) )
- The DOI of [6] does not relate to the R2RML spec.
- "The work uses YARRRML [8] mappings, which are a human readable representation of RDF mappings" --> "The work uses YARRRML [8] mappings, which are a human readable representation of **RML** mappings [7]"
- "An evaluation has not been completed on the framework from what we have seen published." --> you mean a usability evaluation?