Ontology-Based GraphQL Server Generation for Data Access and Data Integration

Tracking #: 3327-4541

Authors: 
Huanyu Li
Olaf Hartig
Rickard Armiento
Patrick Lambrix

Responsible editor: 
Elena Demidova

Submission type: 
Full Paper
Abstract: 
In a GraphQL Web API, a so-called GraphQL schema defines the types of data objects that can be queried, and so-called resolver functions are responsible for fetching the relevant data from underlying data sources. Thus, we can expect to use GraphQL not only for data access but also for data integration, if the GraphQL schema reflects the semantics of data from multiple data sources and the resolver functions can obtain data from these data sources and structure the data according to the schema. However, there does not exist a semantics-aware approach to employ GraphQL for data integration. Furthermore, there are no formal methods for defining a GraphQL API based on an ontology. In this paper, we introduce a framework for using GraphQL in which a global domain ontology informs the generation of a GraphQL server that answers requests by querying heterogeneous data sources. The core of this framework consists of an algorithm to generate a GraphQL schema based on an ontology and a generic resolver function based on semantic mappings. We provide a prototype, OBG-gen, of this framework, and we evaluate our approach over a real-world data integration scenario in the materials design domain and two synthetic benchmark scenarios (Linköping GraphQL Benchmark and GTFS-Madrid-Bench). The experimental results of our evaluation indicate that: (i) our approach is feasible to generate GraphQL servers for data access and integration over heterogeneous data sources, thus avoiding a manual construction of GraphQL servers, and (ii) our data access and integration approach is general and applicable to different domains where data is shared or queried via different ways.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Jun/2023
Suggestion:
Minor Revision
Review Comment:

This paper presents a novel approach for the automatic generation of GraphQL servers using semantic resources (ontology and mapping between ontology and input sources). The paper covers an important gap in the state of the art, i.e., avoid data silos for GraphQL services as there is no standard guideline for them to use shared vocabularies and also implement semantic-aware data integration systems. The system also uses standard technologies from the web, widely adopted by the community, for their solution (e.g., mappings in RML). Finally, the paper is evaluated against three different scenarios: two well-known synthetic benchmarks (GTFS-Madrid-Bench and Linköping GraphQL Benchmark) and one real use case in the materials design domain. The proposal is compared w.r.t. standard OBDA systems (i.e. SPARQL2SQL) and GraphQL systems (i.e., GraphQL2SPARQL). The paper is well-written, clear, and easy to follow, and provides the formalization and algorithms implemented for translating Ontology + Mappings into functional GraphQL servers. I’ll follow the journal guidelines for the rest of the review.

# originality
As I already mentioned, the paper covers an important gap to avoid data silos over GraphQL servers. The position w.r.t. the state of the art is clear, there are previous solutions such as Morph-GraphQL but they didn’t go far as OBG-Gen. I’m missing at least a mention in the related work of virtual KG constructors over data beyond RDB (i.e., Ontario[1] and Squerall[2]). Additionally, I think the proposed solution fits perfectly with the idea proposed in [3] about translating mapping rules but it extends it including knowledge from the ontology. I would suggest to add more discussion on these two aspects.
Secondly, I was expecting some discussion w.r.t. proposals that use ontologies for generating another kind of API (e.g., REST API) which is a field widely studied in the Semantic Web community and could maybe have some effect over the proposal presented in the paper, for example[4]. Maybe could be some similarities in the exploitation of semantic resources to provide access to data over the web.

# significance of the results
This is the weakest point of the paper at this moment. First: the research problem presented in the introduction is not clear enough and IMHO lacks a clear research contribution. My concern is that how the paper is presented, could be seen more as a resource contribution that an actual research paper. I would suggest to the authors to rewrite/present the research problem to make clear the research contribution of the paper (e.g., semantically equivalence between GraphQL and semantic assets?). I think the paper presents a very nice idea that is not correctly evaluated, if the contribution aims to exploit ontology and mappings to automatically generate GraphQL servers, why the evaluation is testing the query execution time of the solution? IMHO there is a mismatch between the contribution and the evaluation, but I also think that it could be easily solved by improving the research problem (introduction) and research questions (evaluation). I really liked the idea of analyzing the performance of the solution w.r.t. similar solutions, but maybe the contribution of the paper should be different (e.g., optimizations in the translation of ontology+rml mapping into GraphQL servers).
Additionally, the results reflect that behavior could be similar to SPARQL2SQL virtual KG engines, although the latter usually provides better results. Finally, the coverage of the SPARQL operators by the solution proposed is very limited (see the number of queries that can be answered in the GTFS-Madrid-Bench). I was expecting a longer and more structured discussion about performance, the impact of the not supported SPARQL operators on this GraphQL framework, and also about the scalability (as the authors mention several times that data size clearly affects to OBG-Gen) and also the rest of the relevant points, as for sure could impact over new research contributions in this field.

# quality of writing
The paper is well-written and easy to follow. However, there are two main points that IMHO could be improved. First, although I really like the use of the same example to present the contributions of the paper, references to figures/listings are sometimes difficult to follow (e.g., the introduction section mentions listing 1 which is actually in another section of the paper). I would suggest to the authors improve the readability of the paper by following a more structured way of using the references to figures, listings, etc because it’s easy for the reader to get lost. Second, Section 3 seems a summary of Sections 4 and 5 I don’t know if it would be better to merge them into one or two independent sections where the contribution can be seen as integrated.

I really appreciate all examples and instances in the core sections (3-5) of the paper, but are Figure 4, Figure 5, and Fig 6 not actually listings? What means CSV-formatted and JSON-formatted data that are supported by OBG-Gen?

[1] Endris, K. M., Rohde, P. D., Vidal, M. E., & Auer, S. (2019). Ontario: Federated query processing against a semantic data lake. In Database and Expert Systems Applications: 30th International Conference, DEXA 2019, Linz, Austria, August 26–29, 2019, Proceedings, Part I 30 (pp. 379-395). Springer International Publishing.
[2] Mami, M. N., Graux, D., Scerri, S., Jabeen, H., Auer, S., & Lehmann, J. (2019). Squerall: Virtual ontology-based access to heterogeneous and large data sources. In The Semantic Web–ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18 (pp. 229-245). Springer International Publishing.
[3] O. Corcho, F. Priyatna and D. Chaves-Fraga, Towards a new generation of ontology based data access, Semantic Web 11(1) (2020), 153–160. doi:10.3233/SW-190384
[4] Garijo, D., & Osorio, M. (2020). OBA: An ontology-based framework for creating REST APIs for knowledge graphs. In The Semantic Web–ISWC 2020: 19th International Semantic Web Conference, Athens, Greece, November 2–6, 2020, Proceedings, Part II 19 (pp. 48-64). Springer International Publishing.

Review #2
Anonymous submitted on 16/Jul/2023
Suggestion:
Major Revision
Review Comment:

This paper presents how a Graph-QL server can be generated for data access via an ontology. To achieve this, the paper proposes a framework for GraphQL-based data access in which the ontology drives the generation of the GraphQL schema, The paper proposes a method to generate a GraphQL schema based on an ontology and a generic algorithm for a GraphQL server that leverages semantic mappings that use the same ontology to answer queries based on the heterogeneous data sources of the semantic mappings. In the end, the paper validates its solution based on a real-world data integration scenario in the materials design design domain as well as using two benchmarks: the LinkGBM and GTFS-Madrid-Bench. The proposed framework is compared with other GraphQL systems (UltraGraphQL and HyperFraphQL) and other OBDA systems (Morph-RDB and Ontop).

The paper is overall well-written and presents original novel results. The translation method is novel and the paper shows that it fulfills its purpose. The GraphQL server architecture design and the translation algorithm are original compared to other solutions. Even though the framework does not outperform the other systems in most cases against which it was evaluated, the results are still relevant and could be inspiring for the community. I do not have any major concerns for the paper to get accepted but I list bellow a few detailed clarifications, questions and suggestions:

Clarifications:

- In section 4.1, it is mentioned “In our current work, considering the knowledge representation language we use for the ontology (see next section), we do not need directive and union types.” could the authors clarify what they mean?

- In section 5.4, it is mentioned that “the relevant triples maps are identified based on the root node type” but what happens if there are multiple triples maps that correspond to a certain root node type? For instance, what if there is and and each one refer to a different logical source? Would the framework consider both? What if each of these triples maps cover a different subset of the ontology’s predicates for this type?

- In algorithm 3, line 10, the evaluator iterates over all poms of a triples map, but are always all poms needed? Could a query request only a subset of all possible predicates associated with a type? If so, do all poms need to be processed?

- In algorithm 3, different cases of object maps are covered which are all correct, but the case of template-based object maps is not considered. Do they fall under the same category as the reference-based object maps? If yes, it should be clarified, if not, it should be added.

- In algorithm 3, I am not sure if the constant-based triples map generates an IRI which is the correct according to the specification or a literal data value (which should be correct only if it is acompanied by a declaration that the term type should be literal)

- In algorithm 3, different cases of object maps are considered but in all cases the default term type is considered I think. However, it is not explained what happens if the object map contains a custom term type declaration. If the algorithm does not support termtypes, that should be mentioned among the limitations of the proposed solution.

- In section 6 it is mentioned that Morph-CSV expects additional annotations for tabular data, which are these additional annotations?

- In section 6, it is mentioned that OBG-gen supports RDBs, CSVs and JSONs, what about XML? Would it be difficult to also cover XML and in general to extend the framework for other data formats?

- In section 7.1 it is mentioned that Morph-GraphQL and Ontology2GraphQL were not considered because they could not be executed, but did you try to ask for help by contacting the authors or opening an issue? (I think Morph-GraphQL is deprecated but it would have been nice to at least have some results)

- In section 7.1, it is mentioned that Morph-RDB and Ontop are used but I think that none of these systems supports RML. Did you use R2RML instead? If so, please clarify.

- In section 7.1, I am confused with the statement that “Morph-RDB and Ontop are provided with two MySQL database instances”. Not sure about Ontop as it is not a native [R2]RML implementation (if R2RML is provided to Ontop, the R2RML is translated to its own mapping language) but at least Morph-RDB, if it implements correctly the R2RML specification, it should be able to be connected to only 1 RDB. However, you mention that there are 2 RDBs but that should not be possible, unless there are two concrete installations of each system. If that's the case I do not think that performing joins would be possible across different DBs with R2RML. Did you pwrhaps mean 2 different tables in a single database? (This comment is under the assumption that R2RML was used with Morph-RDB, RML does not put a restriction on the amount of DB connections per mapping document)

- In section 7.4, it is mentioned that Ontop does not support other data formats, but this is not correct. I would suggest to have a look at this article Botoeva, Calvanese, D., Cogrel, B., Corman, J. L. M., & Xiao, G. (2019).  Ontology-based data access - Beyond relational sources. Intelligenza Artificiale, 13(1), 21–36. https://doi.org/10.3233/IA-190023” and have a look here: https://ontop-vkg.org/guide/databases/generic.html for all possible database connections that Ontop supports, among others data integration frameworks that enable creating virtual tables for other data formats. It would be interesting to see how Ontop behaves for heterogeneous data and how it compares with OBG-gen if heterogeneous data is considered for Ontop.

Suggestions:

- While I see the technical merit of the paper, it is harder to imagine use cases where such an architecture may be desired. I would suggest to the authors to include a few concrete use cases where having an ontology and some raw data, a GraphQL server is a desired solution for accessing this data. For instance, in the material real-world example mentioned, what are the possible use scenarios?

- What are the available benchmarks for OBDA and why these benchmarks were selected? I would suggest to the authors to include a paragraph in related work where they summarize the available OBDA benchmarks and eventually explain why they chose these particular benchmarks.

- What are the limitations of the proposed framework? I would suggest to the authors to add a section where the limitations of the proposed solution are clearly discussed. For instance, the paper mentions “the Φ function can be extended in the future for mapping any datatype besides the above four types from a TBox into a custom scalar type in GraphQL.” or “ordering and paging are not considered currently by OBG-gen”. I would think that these are limitations of the proposed framework in its current version that could be summarized together with the rest limitations in a dedicated section.

- What are the potential improvements of the proposed framework? While there are a few cases that the proposed framework outperforms other systems that achieve similar results, in most cases the proposed framework is in the same order of magnitude but it still does not as perform as good as other systems that rely on more mature technologies. I would suggest to the authors to discuss potential improvements that could make the framework more competitive.

- I would suggest to the authors to clarify how data integration is promoted in their framework as this was mentioned as one of the contributions/motivations. In the end, in the evaluation section it was not shown how the integration capabilities of OBG-gen are compared to other systems.

Minor suggestions:

- What the arrows of Figure 1 indicate is explained in the text but not in the caption of the figure. I’d suggest to the authors to include it in the caption of the figure too

- Why is the Listing 1 and 2 listings and Figure 3 (and 4 and 5) figures while they all show a data extract? I would think that they are all lists.

- I think that the following two papers might be useful for this paper to find references that clarify a few concepts that are considered in this paper:

* I think this paper “Afnan Alhazmi, Tom Blount, and George Konstantinidis. 2022. ForBackBench: a benchmark for chasing vs. query-rewriting. Proc. VLDB Endow. 15, 8 (April 2022), 1519–1532. https://doi.org/10.14778/3529337.3529338” gives a good overview of materialization and virtualization approaches.

* I think this paper “Dylan Van Assche, Thomas Delva, Gerald Haesendonck, Pieter Heyvaert, Ben De Meester, and Anastasia Dimou. 2023. Declarative RDF graph generation from heterogeneous (semi-)structured data: A systematic literature review. Web Semant. 75, C (Jan 2023). https://doi.org/10.1016/j.websem.2022.100753” provides a good overview of virtualization systems and mapping languages.

- Some of the observations in the evaluation section related to Morph-RDB and Ontop are trivial as they show similar observations as in past papers.

Typos\Grammar\Syntax:

- page 2, line 9: “However, a semantics-aware approach to employing GraphQL for data integration does not exist. The approaches in [2] and [3] introduce how to use GraphQL for data federation. However, they are not semantics-aware.” 2 times however in row.

- page 23, line 6: “The filter expressions for Q6 and Q12 are more simple”

- page 25, line 40: “but is more sensitive to the change of datasets increase”

- page 26, line 49: “While for Morph-RDB and Ontop, based on semantic mappings, a SPARQL query is translated to a single SQL query. For queries with filtering conditions, all the three engines (OBG-gen-rdb, Morph-RDB and Ontop) can take the advantages of rewriting filter conditions into SQL queries so that the increases of QETs as data size increases are not obvious”

Review #3
Anonymous submitted on 28/Aug/2023
Suggestion:
Minor Revision
Review Comment:

This paper introduces a framework which informs the generation of GraphQL server that answers requests by querying heterogeneous data sources.
The paper contains original contributions and can potentially make an important contribution to the aggregation of heterogeneous data sources in various domains. The approach is sound in general and is thoroughly described with sufficient details and running examples. Experimental evaluation has been carried out on three different scenarios/datasets, demonstrating the good performance of the proposed system. The paper is well written.

Some minor comments:
- Please consider to add the element “data sources” into Fig.8, given that there are several arrows relevant for data sources pointing at different directions without starting and end point, makes the fig itself difficult to understand.
- Would be good to have discussion on the advantage of the proposed framework in Section 6.
- Fig 11, unclear how the boxes are coded with different colors. Also please consider to organize the fig in a more structured way.
- P23L7: “Query features of queries” -extra word?
- Fig 14-16 contains extensive experimental results, however, not discussed or interpreted. It would be good to have some more explicit discussions.
- Maybe some assessment on how much effort it would take for different users of the framework, and discussion on how user experience is considered in the design process.