Review Comment:
This paper is very poorly written both in terms of English and structure, but I believe that there is some really interesting work backing up the article. However, the two repositories limitation implies that the approach cannot be generalised to arbitrary set of distributed data sources.
If I have understood the paper correctly the authors have developed an approach to improve the execution of distributed queries. The minimally connected graph for the bgps in the query is computed based on a graph theory result involving Eigen vectors. However, the paper should be substantially rewritten to explain the approach that the authors have developed. Figure 2 is given to demonstrate the approach, but it is not explained what the nodes and edges represent, or why some edges begin with a bolded line/
The authors claim that their approach would work for arbitrary queries in any domain, but have only demonstrated it within the life sciences (that is not a major problem). However, the queries behind the current implementation have not been presented, nor sufficient details of how they are then expanded to span all the available datasources.
The approach seems to depend on mappings to a global schema. It is based on these mappings that the minimum spanning graph is created. If this is the case, the authors should be more explicit about this.
A query is shown in Listing 2 which is the output of the process. What is unclear is what is the input to that process, i.e. what is the specific form of the query prior to it being expanded to cover all the datasets? Do the instance level mappings take into account the differences in chemical strucutre between datasets?
The presentation of the experimental evaluations is very poor. They do not provide sufficient details fo the experiment setup. That is,
- What is the purpose of the experiment?
- What data sources, and specifically version of data sources have been used?
- What queries have been used as input?
- What are the dependent and independent variables?
- What is the baseline that you are comparing against? That is, how do you know if you are improving over the current state of the art?
In particular, table 3 presents numbers from two different translations of DrugBank. It is not surprising that Chem2Bio2RDF does not contain as many targets as it relates to a very old version of DrugBank (~2009).
Figure 3 presents a line graph for discrete data points, this is not appropriate. Also the caption should explain the ordering that has been applied. I'm also unclear as to which line represents your system and what substances have been used/
How is the performance of the whole expansion process versus the time to pose the query? How do the answers generated compare with other integration systems, i.e. are the answers correct?
The SUS evaluation is meaningless for the main contribution of the paper, i.e. it does not evaluate the distributed query processing engine. I would expect to see an evaluation that investigates
1. The efficiency of the distributed query processing engine: by centralising the datasets and comparing with other systems such as FedX or DARQ for the speed of result response while eliminating network delays. it would be good to also perform the same experiment over the remote endpoints and compare.
2. The correctness of the answers generated
There are a variety of typos and missing citations throughout the paper, too numerous to list.
|