Review Comment:
This paper discusses how a SPARQL execution engine can be implemented on top of Façade-X to answer to Façade-X queries. Façade-X specifies how to map data in different formats (e.g., CSV, JSON, HTML etc.) as-if they were in RDF. The paper studies two strategies: a complete materialization strategy where the data is fully transformed to RDF to answer to a query and a sliced materialized view strategy where the data is segmented, and RDF views are generated for each segment. Both strategies are optimized by filtering the triples which do not match with triples patterns in the query. The two strategies and their optimizations are compared to the on-disk alternative where the data is temporarily stored.
The paper still needs significant improvement before it gets published. The related work needs to be reworked to better cover the relevant state of the art. The evaluation needs to be extended to position the systems and its results with respect to the state of the art. The strategies section needs to be curated to better describe the different algorithms. The paper also needs thorough proofreading due to an excessive amount of typos and grammar errors. Last, it would be good if the authors create zenodo entries for the resources they used for the system they developed and the evaluation they performed.
Regarding the evaluation, in particular, I would suggest to further run evaluations to cover the following aspects. The rationale behind each of these points is explained in my more detailed comments under the relevant section:
- Different Façade-X representations to assess the impact of the Façade-X representation to the queries implementation.
- More nested JSON data sources and XML data sources.
- More slices’ sizes to assess the impact of the slices’ size to the queries’ answers.
- More benchmarks to have a better view of the results.
- Compare with state-of-the-art systems, e.g., Ontop, SPARQL-Generate and SANSA.
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
(1) The originality and significance of the evaluation results are questionable as state-of-the-art solutions have already proven these results for other systems. The strategies are applied for the first time in the case of Façade-X but it is not clear based on the current version of the paper what is novel in the way that these strategies are implemented.
Long-term stable URL for resources
(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data:
The URL for the resources points to the adjusted benchmark but there is no URL that points to the SPARQL Anything system which was used to implement the different strategies. The repository with the benchmark contains a READ.ME file with some basic instructions. The data is available via a link to the repository to the original benchmark.
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
At least the link to the release of SPARQL-Anything that implements the different strategies is still required to be able to reproduce the results.
Introduction
-----------------
The introduction is very focused on Façade-X and it does not reflect neither on the bigger picture, i.e. what the actual problem is, nor on how other approaches address the problem, e.g., what strategies were followed by other non-Façade-X approaches to address similar issues?
The different strategies are introduced in the introduction, as well what data formats are considered, and which benchmarks were considered. These comments will come back in the following sections as well, but I will already outline my concerns:
- The paper only looks at the Façade-X case and the performed evaluation fails to position the Façade-X implementation strategies with respect to the state of the art. These strategies are interesting but what if the best performing strategy is still less performant than other approaches which perform the same task but without Façade-X?
- It is hard to assess how innovative these strategies are as the paper does not discuss the state of the art. Were these strategies considered by other approaches? Were other strategies considered, and if yes which ones? Why weren’t existing approaches considered and why these strategies are considered as the most promising for Façade-x?
- JSON files are considered but when hierarchical data is considered, XML formats bring more challenges than JSON data sources. In particular in the case of the slicing strategy, creating slices over XML data sources may be significantly more challenging than with JSON data sources as the XML data sources may have both multiple nested elements as well as attributes.
- Why was only GTFS considered and not other benchmarks, such as LUBM (https://github.com/oeg-upm/lubm4obda) and BSBM (http://wbsg.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/), the COSMIC testbeds (https://github.com/SDM-TIB/SDM-RDFizer-Experiments/tree/master/cikm2020/...), NPD benchmark (https://github.com/ontop/npd-benchmark) for which we have the original data as well?
Section 2: Façade-based data access
------------------------------------------------
It is mentioned at page 3, line 20-21 that g_ds,q is the minimal, optimal. Is there a proof of this statement? How do you define the minimum optimal? Are the minimal set of triples and optimal set of triples the same? Is there a proof for this?
At the same page, line 31-32, it is mentioned that the answer to the query should not be the minimal/optimal but any superset. Would that mean that more triples could be returned than the triples that answer the query?
At page 4, lines 20-21, it is mentioned that a Façade-X engineer can use either IRIs or blank nodes for the containers. I’m wondering what’s the impact of this choice for the evaluation of the queries and thus if that was considered in the evaluation.
At page 4, lines 23-24, it is mentioned that a Façade-X engineer can design connectors to an open-ended set of resource types. What resource types are meant here? And how this affect the potential implementation?
At page 4, lines 27-28, it is mentioned that instead of using container membership properties, to use the first row of the CSV file to create named properties for the inner list instead of membership containers. I’m wondering what the impact of this choice is on the implementation of the SPARQL Engine. Did the authors take this into consideration in their evaluation? If not, which version did they use and why? I’d suggest that both options are evaluated.
Section 3: Strategies for executing Façade-X queries
--------------------------------------------------------------------
Page 5, lines 29-30 and 34-35: It is mentioned that in Figure 2, the components which are in gray color are given whereas the components which are in green are parts of the proposed system. But then in lines 34-35, it is mentioned that the system creates the query plan. Thus, is the query plan part of this system or another system? I’d expect of this one given that the paper describes how SPARQL queries are answered over a Façade-X.
- I’d suggest to the authors to clearly indicate which components in the figure are part of this system and which are not, and if they are not, discuss on which systems the complete solution depends on.
- As the query plan is integral part of the query answering, I’d suggest to the authors to include the algorithm that produces the query plan.
- I’d also suggest to the authors to provide the algorithm for the triple-filtering, which is not discussed at all in the text, as well as the algorithm for the streaming of the results and their assembly in particular for SPARQL queries that contain e.g., aggregations.
Page 5, lines 48-49: It is mentioned that the user can indicate how the CSV and JSON data source can be segmented. However, it is not indicated how the user expresses it in the case of CSV data sources as opposed to JSON data sources where the users can indicate this with JSONPath expressions. It is also not specified at which point the users indicate the segmentation method, e.g., when the query is performed?, nor whether these users are the data consumers, i.e. the ones who query the Façade-X or the data owners, i.e. the ones who own the original data. Then again, if it is the data consumers, that means that they know the structure of the original data. And what is the strategy of the system? Does the user choice overrule the system choice? What is the fall-back strategy if the user does not indicate it? How does the system decide on the segmentation strategy? All these questions should be answered in the text of the paper and the corresponding algorithms should be provided.
--> This comment is partially answered in the evaluation section but the comment for the clarification here still holds.
Figure 4 barely extends Figure 3, thus figure 4 is enough and figure 3 is not needed in the paper. The difference between the two can be indicated in the caption.
Section 3.3. This section makes some strong statements which can be debatable and others which are not fully correct.
- I’d suggest to the authors to refer to the paper which prove some of the statements, e.g., that systems fail with large data sources if they load the data sources in memory.
- The authors should discuss the difference between their triple-filtering approach as opposed to solutions like Ontop which opt for query rewriting.
- Lines 33-34: The statement that filtering should be beneficial for reducing the resource requirements should either be supported by references or be turned into a research question. I question this statement as the process of filtering may add overhead. Are the authors aware of what its impact is e.g., on the performance?
- Lines 42-43: The authors mention that the slicing strategy might be less efficient for answering queries as it adds overhead. That is true but if this is a discussion section, the impact to the performance should also be discussed. As before, I’d suggest to the authors to turn this into a research question that needs to be answered. What is the impact of the sliding size to the overhead?
- Lines 48-49: The authors claim that the memory footprint should favor the sliced approach. Again, I would suggest to the authors to add a reference to this claim or turn it into a research question to be answered by this paper. The memory footprint might still depend on the size of the slices. A large slice might again challenge the system’s memory as small slices might as well impact the memory given that intermediate results need to be maintained in memory.
- State of the art solutions consider parallelization to improve the performance as opposed to slicing, such as Morph-KGC (https://doi.org/10.3233/SW-223135), RMLStreamer (https://doi.org/10.1007/978-3-031-19433-7_40) and SANSA (https://ceur-ws.org/Vol-3471/paper8.pdf). Have the authors thought of including a parallelization strategy? I would suggest to the authors to include such a parallelization strategy in their evaluation and compare their system with the state of the art solutions.
Section 4: Evaluation
----------------------------
What are the segmentation sizes which were used for the evaluation? I would suggest to the authors to run the experiments with more segmentation sizes and present the results so we can assess the impact of the segmentations to the query results.
The results for size 1 are trivial. The columns for size 1 could be removed from all figures and replaced by 1 sentence saying that all were ok. This way, the tables could be grouped by 2 which would make the comparison of the results easier.
Figure 6 shows a peak for sliced+triplefiltering which should be discussed in the paper. The same for results q12, q13, and q14. This comment holds for all outliers which are shown in the figures of the results.
Page 8, lines 28-29: It is mentioned that a benchmark was designed but in fact a benchmark was reused. That should be rephrased.
Page 8, lines 30-31: it is mentioned that queries of varying complexity were performed. What it is meant as complexity should be clarified.
Page 8, line 49: it is mentioned that python scripts were used to analyse the results of the benchmark. Why didn’t you reuse the scripts of the benchmark?
Page 9, lines 29-30: It is mentioned that the strategy should be mentioned in the SPARQL query. While this should be clarified earlier where the strategy is presented, I find it odd that the user who wants to query some data should specify how the query will be resolved.
Page 9, lines 28: It is mentioned that the results of the queries as produced by the system were compared with the results of the GTFS benchmark. However, I would suggest to the authors to clarify how this comparison took place. Was the number of triples or the isomorphism of the graphs? Or other comparison?
Page 10, lines 29-30: The GTFS sizes which are considered are low compared to other benchmarks. It is also discussed in the state of the art that other benchmarks provide different challenges and by executing more of them, we can have a more holistic view of the situation.
Page 10, line 39: The section header is Discussion and the the section starts by saying that the results were presented in the previous section, but no results were presented in the previous section.
Page 12, line 49: the queries which perform well with triple-filtering are the queries which have the simpler patterns. This needs to be further discussed and investigated.
Page 14, line 49. The paper concludes based on the results that the execution time is independent of the format. This cannot be concluded by the provided results. The authors did not try different levels of nesting objects and arrays for JSON to assess their impact nor did they try with XML which is typically a more challenging hierarchical structure than JSON, nor did they try other data formats e.g., HTML.
Page 15, line 11-12: I do not think that there is enough evidence based on the results of this evaluation to claim that the results can be generalized for other formats for the same reasons as my previous comment.
Overall, the results are trivial. The paper does not present results which were not already proven before or were not expected. The paper eventually concludes that the complete materialization strategy is the most performant solution with respect to time but they authors do not compare their system with state-of-the-art solutions. State-of-the-art solutions use different strategies, e.g., parallelization, which significantly improves the performance of the systems as opposed to complete materialization approaches. Moreover, state-of-the-art solutions rely on query rewriting which is also proven to be significant more efficient on certain occasions, but the authors also did not compare with such systems. The fact that a certain strategy may be the fastest for a system does not make it the best strategy unless it outperforms the state of the art as well.
Section 5: Related Work
--------------------------------
A large part of the related work is not covered by the paper. Surveys on the domain covering both materialization and virtualization systems are not mentioned at all: doi.org/10.1162/dint_a_00011 and doi.org/10.1016/j.websem.2022.100753
Relevant benchmarks are not discussed in the related work section, so it remains vague why the authors have chosen this benchmark to assess their system.
The related work does not touch papers which were published related to querying SPARQL endpoints with different schemas and where the queries needed to be rewritten nor does the paper compare with such systems.
Typos and grammar errors
------------------------------------
Some typos are mentioned here but the paper needs to be thoroughly checked:
Page 4, line 20: two times of
Page 8, line 16: we don’t --> we do not
Page 8, lines 48-49: which it is reasonable to keep --> which is reasonable to be kept
Page 11, line 48: 69.8% of the use cases not supported --> are not supported
Page 13, line 44: %202 and %50 --> 202% and 50%
|