DCQE: A RDF Dataset Quality Evaluation Mechanism for Decentralized Systems

Tracking #: 1985-3198

Li Huang
Zhenzhen Liu
Fangfang Xu
Jinguang Gu

Responsible editor: 
Jens Lehmann

Submission type: 
Full Paper
The current decentralized system has developed rapidly, especially with the development of blockchain technology. The quality evaluation of RDF data sets in the decentralized system has also received extensive attention. Therefore, from the perspective of data quality evaluation, this paper proposes a RDF data quality evaluation model in decentralized environment, and points out the new dimension of RDF data quality. The blockchain is used to record the data quality evaluation results and the update plan of the quality evaluation results is designed in detail. Finally, the feasibility of the above system is verified and the quality evaluation model is verified. The purpose of this paper is to study how the decentralized system can provide users with better cost performance when the knowledge is independently protected. This paper named this scheme DCQE.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Damien Graux submitted on 11/Sep/2018
Review Comment:

[Groups like "at page XX line YY column (left|right)" will be shortened using the following pattern pXXlYY(l|r).]


This article aims at tackling the problem of data quality in the context of RDF datasets loaded in a decentralized system thanks to the use of the Blockchain technology. The authors named their system DCQE. The article is structured in 5 sections. Authors first present the general context of their study in Introduction citing related articles dealing with data quality for RDF datasets. They then describe their quality evaluation model in Section2 before presenting their system in Section3. They finally present some experiments in Section4 before concluding in Section6.

Nowadays, the evaluation of RDF data quality is a hot topic since more and more RDF datasets are available and prone to change (dynamicity of the triples which can often be updated). This article proposes a novel approach to offer data quality evaluation strategy in the context a decentralized architecture by using the Blockchain to record transaction information and quality evaluation result information.

Major Comments

I have several *big & major* concerns:

I/ The paper is not self-contained.
> Indeed, it is hard to follow for readers that do not have prior knowledge about Semantic Web, Decentralized Architecture and Blockchain Technologies. I admit that the Semantic Web Journal mainly deals with semantic-web oriented readers, nonetheless key concepts could be recall briefly e.g. RDF, SPARQL (which are so far not cited in the paper)… More generally, a preliminary/background section is missing to recap what are RDF and SPARQL and the decentralized strategy and *most important* to present the blockchain technology which is not yet obvious for everybody.

II/ Their are oddities in formulæ presented in Section2.
> My concern mainly deal with the formula number (2) and as a consequence with all the furmulæ which are involving it. Authors defined the "number of subject average attributes" as VP=1-SPO/DS where SPO and DS are respectively "number of triples" and "number of unique subjects". My problem with this formula is that its values are included in ]-∞;0] which then lead to strange results (e.g. QRDF could be lower than zero)… Indeed, let's consider the two extreme cases:
- the dataset contains k triples with k different subjects thus SPO/DS=k/k=1 then VP=0.
- the dataset contains k triples which are *all* having the same subject thus SPO/DS=k/1=k and then VP=1-k which has -∞ as a limit when k goes to infinity…
As a consequence, the following sentence "The larger the value, the more data sets use the triples to describe the subject." (see p2l17r) is false!

III/ There is no Related Work section!
> The only paragraph discussing briefly some previous studies is located in the Introduction. In my opinion, such topics like Data Quality and Decentralized Systems (for RDF or not) should be presented apart to properly present how the current study is providing novel aspects to research.

IV/ The system is not presented completely.
> The section describing DCQE is, in my opinion, suffering from a lack of details. First of all, it would be interesting to have access to the code of the system if opensource and if it is private then I would have liked a justification. Indeed, having access to the system's code would allow reviewers to glance at the project and to test it (even to reproduce the experiments see V/ in this regard).
> In addition, the system description is to high level and should be more detailed for instance thanks to the uses of several examples explained step-by-step.
> Finally, I also have some specific remarks such as:
- What is the query language used, I suppose SPARQL?
- What is the decentralized system used? Is it for instance ipfs?
- Author do not seem to take into consideration monetary aspects because "price factor is different in different systems", it would have been great to have a comparison between several of them…

V/ The experimental section must be reset completely in my opinion.
> First of all, all the experiments are not reproducible by readers of the paper, which is to me really problematic. (This remark should be considered with the remarks dealing with the sources of DCQE - see prior.) For example, authors declare "The experimental data sets use the ArchiveHub data set." but without producing a citation nor a reference nor a footnote; I typed this in my favorite search engine and was not able to find any relevant websites… Following the same direction, I was not able to understand properly how the test protocol was set up and then realize, for instance, how the updates are done. Authors say they are using "100 queries", even though the query language is not specify here, it would also be interesting to be able to see those queries.
> Second, author test their system using a dataset of 431,088 triples which is in my opinion not enough to maybe discover bottlenecks of performance since they realized their experiments on a computer having "16GB 2133MHz LPDDR3 memory".
> Third, using only one computer to test a *decentralized* system is in my opinion a weakness, especially since authors did not provide a fair description of how the decentralized system is working.

VI/ The quality of the paper should be improved.
> Some sentences are hard to follow, I could find some typos., in addition there are some figures which are not referenced in the text, and some other ones are hard to read because of the font-size…

Minor Comments

Please find here some minor remarks (in comparison to my first 6 ones):

* The Introduction could be better motivated thanks to a use-case or an example which could be described.
* More generally, the claims of the paper could be stated more clearly in the Introduction.

* This Section suffers from a lack of examples during the description of the concepts to help readers understanding what's happening.
* The first paragraph of Section2 is describing the restriction on which the paper will next focus on. I'm a bit disappointed that at the end of the discussion/paper, the "node service quality" is not considered again, even briefly.
* In p2l23r, what is the "RDF medical report", is it something new or could the authors provide a citation for this concept?
* In p2l24r, a "certain number", what could be an estimation?
* In p2l51r, "data updates" and "modifications" should be better introduced and formalized.
* In table1, maybe the third column could be removed.
* In (4), what happens if it is always the same triple which is modified, does it mean that Verifiability goes to +∞?
* In (6), what is "Eachother" since it hasn't been introduced?

* "DCQE" -> what does it mean? Why such a name?
* I do not see the interest of Fig1.b in the rest of the paper development…
* The title of Fig.2 isn't perfectly matching the content since there is no "Blank Node" written in the figure itself.
* More generally, it's confusing to use "Blank Node" to name a "temporarily creating node" (see p5l33l) since it exists also a concept of blank nodes in the RDF specifications.
* I could find in the text a reference to Fig.3.
* In p5l38r, what are those "previous studies" the authors are referring to?
* Section3.3 would require an example to make the understanding easier.

* Authors seem to use SPARQL as a query language however I couldn't find any occurrence of the term "SPARQL" in the article and the canonical citation associated to is missing too.
* Author have set k1=k2=k3=k4=1, why such a choice?
* In table3, what are the "masters"?
* In table3, what is representing the 6th column which has no title?
* In p8l35r, "each node generates its own RDF entity record table", it would have been interesting to see example of such record tables.
* In p9l13l, "Different systems" -> which ones?
* In p9l23l, authors said that "51 percent of attacks can still be launched", I think they wanted to say that the 51%-attack can still be launched. Which also lead to the fact, that it would be appreciable to have a reference pointing on this attack description since it is really specific to the Blockchain area.
* In Section4.2, Fig.8 is at the end not really described.

* In p10l18l, authors are mentioning "new dimensions", it should be interesting that they provide some of them.
* In p9l38r, what are typically the size of those "backups"?

* [3] is not giving the journal nor the conf. nor the book where it has been accepted
* [13] has some problem with the encoding of special letters


Here are some typos. I found:
- title: "A RDF Dataset"->"An RDF Dataset"
- p1l51r: "de centralized"->"decentralized"
- p7: Fig.7 "hash0-3"->"hash1-0"?
- p7l35r: is it "physical" or "medical"?
- p7l46r: "verify the verifiability"
- From p8l39r to p8l48r: this paragraph is hard to follow.
- p9l28l, "there are 5 common attributes", no there are 4 according to the line before id est 9, 5, 31 and 7.
- p9l29r: "Descrip"?

Review #2
By Anisa Rula submitted on 01/Oct/2018
Review Comment:

The work describes a data quality evaluation model for decentralized systems and introduces new quality dimensions that are related to the new environment. The authors propose the usage of blockchain for managing the quality assessment results of several datasets included in the system because it ensures the truth of the results over a decentralized network structure.

This manuscript was submitted as 'full paper' and I will review along with the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Regarding the writing of the paper, there are spaces for improvements like rephrasing the sentences and correcting grammar errors. However, from the scientific point of view, the terminology should be coherent and needs to be revised. I will bring some examples in the Minors.

The work is on topic and interesting to the Semantic Web community. However, I have a few concerns about the clarity and the originality of the paper which makes the paper difficult to accept.

First, I haven't found a related work section in this paper and related to that, I saw a very limited number of citation in the paper. There is no in-depth comparison with the state of the art approaches.
Second, I would suggest motivating better why the quality assessment of decentralized systems is so relevant. Could you explain that this can be of benefit to any practical use case or to more than one application? This can bring also to the definition of the problem that you are trying to tackle which will be of great value to be defined in a journal paper.
Third, there is no clear definition of the main concepts, like a triple, blank node and also other RDF terms. I know this is the community that those things are already defined but I am afraid that some terminology here is confusing. What do you mean by the attributes of the subject? Suddenly, I found "The RDF medical report model"
Forth but not less relevant, I didn't see clearly from the introduction what are the contributions of this work.

Regarding the experiments, there are missing explanations that may seem clear to the author but not to the reader. I don't understand which criteria are used for splitting the dataset. Why 6 copies? Why do you replicate the information in AH1 and AH6? Why do you simplify by choosing all ki = 1? According to which criteria are these parameters adjusted?
Then, the experiments continue with smth like "verify the verifiability of the model". What is the verifiability of the model? How do you measure it?

Minors (only a short list of examples):
*DCQE -> can you provide a better explanation of this acronym. It is not so intuitive
*rephrase -> Compared to RDF data quality evalua-tion in decentralized systems, previous RDF data set-s must publish RDF data sets to the internet for shar- ing, and their quality evaluation is expensive to main-tain and potentially contaminate Internet data
**sharing what?
**"contaminate Internet data" -> is not an appropriate scientific way of describing this problem.
*decentralization or de centralization?

Review #3
By Amrapali Zaveri submitted on 12/Nov/2018
Major Revision
Review Comment:

(1) Originality
The article “DCQE: A RDF Dataset Quality Evaluation Mechanism for Decentralized Systems” proposal of using blockchain technology for performing quality assessment in a decentralized setting. They describe the design of a quality evaluation model, the system design and results of an experiment.

Even though the idea of using blockchain for data quality assessment, specifically in a decentralized setting is new, the overall aim of the paper is unclear. It is proposed to use blockchain technology to store quality evaluation so that “the centralization effect of the authority can be reduced” - but why is that important?

Also in the abstract is it mentioned “and points out the new dimension of RDF data quality” - which is this new dimension? Additionally, the mention on “and update the plan of the quality evaluation results is designed in detail” is unclear. However, in the conclusion it is mentioned that this paper “discusses how to evaluate and update RDF data in the context of the rapid development of Semantic Web and decentralized systems”. However, updating RDF data is not discussed in detail and nor is the focus of the paper.

Moreover, a discussion of related work is missing, which should highlight current data quality assessment methodologies and their possible drawbacks to further motivate the use of blockchain technology.

(2) Significance
In a decentralized setting and with use cases where there might be tampering with data quality evaluations, the proposal to use blockchain technology is justified. However, the significance of the proposed system, method and experiments is marginal.

The authors mention that there is a need for “authoritative central agencies” - but where is are there currently trust issues? I would like to see a strong real-world case for using blockchain in Linked Data quality assessment versus how it is currently being handled. The second reason mentioned is so that the quality results are not tampered with. Where is the proof that currently they are being tampered with? The third argument is that this will provide users with better cost-effective results. Where is the evidence that it is expensive now?

Then the second contribution is that the authors “design and implement a quality reporting model for RDF data” I don’t see a “model” per say and what about the existing W3C RDF Data Quality Vocabulary https://www.w3.org/TR/vocab-dqv/? The third contribution are the metrics proposed. Why are only those metrics listed in Table 1 chosen? What about the other metrics listed in reference Zaveri et al.?

Here are further questions that arise while reading the manuscript, which need to be clarified either with more explanation, adding sufficient evidence via references and/or by providing examples:
- “have greatly been improved” - add references backing this claim
- “The quality of an RDF data set means the correctness and availability of data.” - are “correctness and availability” the only two dimension? See reference Zaveri et al for more dimensions
- “RDF data quality evaluation has been favored by many researchers” - what is meant by “favored”?
- “five types of Linked Data quality evaluation principles.” - which are those?
- “Many domains use RDF data structure for transaction processing, so it is very important to carry out RDF quality evaluation in different fields.” - the argument is unclear
- The metric ‘subject average attributes” is unclear. The argument that having more attributes in a dataset makes the knowledge more complete is not necessarily true in all cases. This claim needs an example and evidence.
- The proportionality mentioned at the end of Section 2.1 needs to be explained with examples and numbers.
- In Section 2.3, the sentence “In contrast, part of the entities, all nodes are owned and the contribution of the entity in the node is smaller” is unclear
- The “contribution model” needs to be explained by a real-world example
- The “RDF medical report” column in Table 3 is unclear and the last column needs a heading
- In Section 4.1.1, the selection of the specified URIs in Statement 1 and 2 need to be explained further
- In Section 4.1.1, which is the statement 5.2 that is mentioned?
- How and what exactly is the user feedback that is added?
- Section 4.2 needs to be explained in further detail with an example

(3) Quality of writing
The paper needs to be proof read by a native English speaker. I list errors I encountered throughout the paper:
“the new dimension” - “a new dimension”
“The blockchain” - You haven’t mentioned it before so “The” should be removed (In the Abstract and Introduction sentences)
“RDF” add the full form and reference at its first occurrence
Add references for “applications that use RDF as a data framework”
“Zaveri et al. [5] summarized more than a dozen articles” - there were 30 core articles that were part of this survey
“Systemically speaking,” - please rephrase
“de centralization” - remove space
“system explode” - “the availability of decentralization systems exploded”
“Many domains use RDF data structure for transaction processing,” - please add references
“some symbols” - “a list of symbols”
In Table 2, “The only number of” - should be rephrased
“Relatively, Uniqueness” - “Uniqueness”
Please add full form of DCQE at its first occurrence in the paper
“an blank” - “a blank”
“needs to be recertification” - “needs to be recertified”
Avoid starting a sentence with “Because”
Please explain “Merkel tree” and add a reference
Please provide a link to the “ArchiveHub” dataset and describe it in brief
In Section 4.1.2, the second paragraph, first sentence needs to be re-written.
“Descrip” - “Description of”
“in detail” - “is provided in detail”
All the references need to be formatted correctly
The paper mentions a reference to Zaveri et al [5] but this is missing from the reference list