Creating Restful APIs over SPARQL endpoints with RAMOSE

Tracking #: 2543-3757

Authors: 
Marilena Daquino
Ivan Heibi
Silvio Peroni
David Shotton2

Responsible editor: 
Armin Haller

Submission type: 
Tool/System Report
Abstract: 
Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of these data by common Web users, engineers and develop-ers unfamiliar with Semantic Web technologies, who normally rely on Web RESTful APIs for querying Web-available data and creating applications with them. To solve this problem, we have developed RAMOSE, a generic tool developed in Python to create REST APIs over SPARQL endpoints, through the creation of textual configuration files which enable the querying of SPARQL endpoints via simple Web RESTful API calls that return either JSON or CSV-formatted data, thus hiding all the intrinsic complexities of SPARQL from common Web users. We provide evidence for the use of RAMOSE to provide REST API access to RDF data within OpenCitations triplestores, and we show the benefits of RAMOSE in terms of the number of queries made by external users to such RDF data compared with the direct access via the SPARQL endpoint. Our findings prove the importance for suppliers of RDF data of having an alternative API access service, which enables its use by users with no (or little) experience in Semantic Web technologies and the SPARQL query language. Because RAMOSE is generic and can be used with any SPARQL endpoint, it represents an easy technical solution for service providers who wish to create an API service to access Linked Data stored as RDF in a conventional triplestore.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jonathan Yu submitted on 27/Aug/2020
Suggestion:
Minor Revision
Review Comment:

Criteria:
(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).

The paper presents a new software package called RAMOSE for enabling the development of REST APIs as a façade over underlying SPARQL APIs using configuration files. The research question in scope is what is a generic mechanism for enabling web developers and scholars to query RDF data available in triplestores exposed via SPARQL without having to write SPARQL (primarily via REST APIs), and secondly, how can Semantic Web data providers deploy REST APIs that expose RDF data efficiently and easily.

The design of RAMOSE is sound and addresses the challenges and pain points from both users of RDF data - provides a REST API interface that is familiar to most web developers, rather than SPARQL API, and Semantic Web data providers - provides an easy way to develop and deploy REST APIs. The approach is sound and allows for customisation (pre- and post-processing functions) and extensibility (via addons) by those wanting to deploy REST APIs. JSON and CSV are the main output formats, which are sufficient for the web developer community. The approach taken in this work implements a well-known design pattern (the "Façade" pattern), and is used in a number of web applications in developing REST APIs. The novel aspect of this work is the ability to define these in a configuration file without having to write code, and API level features from the RAMOSE application out of the box (like filtering, defining pre- and post-processing, auto-generated API documentation).

The authors present a case study using the OpenCitations scenarios and comparing quarterly access statistics via server logs showing the number of SPARQL queries versus the number of REST API queries issued. The authors argue through this case study that there are benefits to using RAMOSE implemented REST APIs vs SPARQL, which is validated for OpenCitations. They demonstrate the increase in usage of OpenCitations data generally, and greater use of the REST API queries rather than via SPARQL API.

Suggested revisions:

1.1. A number of prior art is listed in the Related work section, e.g. BASIL, grlc.io. While, the related work is listed, a suggestion would be to compare RAMOSE and the listed alternatives, perhaps presented as a table to compare and summarise the differences. This would provide the paper with an evaluation of RAMOSE and help readers understand RAMOSE in context with the other tools. I'd also suggest including a comparison to another framework called pyldapi, which does a similar job but requires mode development effort to achieve similar results to RAMOSE (result is greater control over the final API).

1.2. In terms of impact, while an interesting case study involving a large and interesting RDF dataset (OpenCitations) and evidence was provided on the usefulness of REST APIs developed with RAMOSE for OpenCitations, other than deployment at Workshops on Open Citations, it appears that RAMOSE is only deployed for OpenCitations. On the other hand, the authors demonstrate that RAMOSE has helped the uptake of the OpenCitations data via other software and data services - VOSviewer, Citation Gecko, DBLP, etc., which demonstrates outcomes as a result of the development of RAMOSE. The other factor is that RAMOSE is relatively new (since 2019) so it hasn't had time to gain uptake. I suggest that authors look at providing evidence of other groups using RAMOSE in order to demonstrate the impact it has had in terms of uptake.

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

The paper is well written and the figures, tables, and code listings used are formatted nicely and appropriate. The authors have described the RAMOSE tool well and have highlighted its design goals/principles, and features in a logical manner that flows.

Minor revisions suggested:

2.1. Figure 1 and Figure 2 - check the resolution of these images. Suggest higher resolution for these.

2.2. Section 3.2, first para: "An hash-format document…" --> "A hash-format document"

2.3. Section 3.2, first para: The authors introduce the hash-format file based on Markdown, but at this point of the paper, it isn't clear what the Markdown content acting as a value is for (this significance is apparent later in the paper - for automated documentation generation in Section 3.4.1). Suggest that the authors signpost the significance of the Markdown feature here in this section as the configuration document is actually a key part of RAMOSE.

Summary:

The authors have presented a novel software application, called RAMOSE, that enables the creation of REST APIs over SPARQL endpoints as a façade using a configuration over code approach. The design of RAMOSE is innovative in that it allows customisation, extensibility and automatic generation of documentation are features that will certainly aid deployers of REST APIs. The features included with RAMOSE for allowing users to query, filter and view documentation are useful additions. The importance of tools like RAMOSE comes at a time where a number of RDF datasets are being published but yet there are barriers to entry for web developers to these datasets. RAMOSE is certainly has broad appeal to the Semantic Web community for those who have RDF data and are faced with challenges of engaging web developers with learning and using SPARQL APIs. Recommend the paper for acceptance as it is in scope and is a valuable resource for the Semantic Web community, with some minor revisions as highlighted above.

Review #2
By Sergio Rodriguez Mendez submitted on 02/Sep/2020
Suggestion:
Minor Revision
Review Comment:

* Summary: the article describes RAMOSE (the "RESTful API Manager Over SPARQL Endpoints"), an open source Python software artifact that allows to create Web RESTful APIs over any SPARQL endpoints by editing a configuration file. It also generates automatically HTML-based documentation and a Web server for testing/monitoring purposes.

* Overall Evaluation (ranging from 0-100):
[Criterion 1]
+ Quality: 95
+ Importance/Relevance: 85
+ Impact: 85
[Criterion 2]
+ Clarity, illustration, and readability: 90
[Criterion 3]
+ Stability: 100
+ Usefulness: 90
[Perspective]
+ Impression score: 75 | some design improvements can be made

* Comments:
- The tool could have a better code design (there's room to improve): modularity, parametrization.
- Architecture diagram: UML classes? Main function dependency?.

* Major:
Improve/clarify figure 1:
"N" represents the number of SPARQL endpoints (and configuration documents). However, it's also used as the number of operations per enpoint/document.

In [Figure 2 / "apply filters & refinements"]: why the "sort" functionality is not included as part of the SPARQL query?
3.3.2. Filtering rows: why this functionality is not included as part of the SPARQL query using the FILTER function?
3.3.4. Shouldn't be easier just to look the "Accept" HTTP header for the formatting output?
3.3.5. "each row of the final JSON table": conceptually, JSON doesn't have rows and tables. Please, rephrase using correct concepts.
3.3.5. the example shown in listings 6 and 7 implies that the splitting operation only looks for the first occurrence of "". Is this how it works? How to generalize for any number of occurrences?

* Minor corrections:
pag. 02: "Markdown-like format" should be a link?
pag. 03: ExCITE and Venice Scholar Project links.
pag. 04: "The RAMOSE application file...:" <1>, <2>, ... *, and* "setting up..." # include ", and" before the last aspect
pag. 07 / sec. 3.3.5: apply correct formatting to ""
pag. 09 / sec. 3.4.3: the first line shouldn't be "RAMOSE main Python class"?
pag. 09 / sec. 4: "Opencitations" change to "OpenCitations"
replacements: "Restful" to "RESTful"; "web" to "Web"

* Others:
@https://github.com/opencitations/ramose / Configuration / Requirements
The link https://github.com/opencitations/ramose/requirements.txt is broken: "Not found!"

Review #3
By Pierre-Antoine Champin submitted on 04/Sep/2020
Suggestion:
Minor Revision
Review Comment:

This paper presents RAMOSE, a tool for easily publishing REST APIs on top of SPARQL endpoints. The rationale is that REST APIs are easier to use by the average Web developer. Therefore RAMOSE improves the reach and the usability of the underlying RDF data.

The paper is clear, well written, and overall pleasant to read. The tool is available for testing, both as a deployed instance at OpenCitations.net, and as an open-source project on GitHub.

Section 3 is a detailed account of the features of the tool. Except for a few things (see questions & suggestions below), it is clear and easy to follow. I sometimes had the feeling that it was even too technical for a research paper -- is it really useful to describe every possible directive of the configuration file, or every possible option of the CLI tool? A slightly less granular presentation would have been sufficient, I think.

Section 4 presents an analysis of usage logs of OpenCitations.net, showing the influence of the introduction of the REST API. While this data is clearly interesting, I think their accurate interpretation would require more information.
Indeed, comparing the number of calls to the REST API with the number of calls to the SPARQL endpoint is biased, because SPARQL is more expressive than the proposed REST operations; hence, some SPARQL queries can provide, in one single call, information that require several calls to the REST API. This could account for the higher number of calls to the API. So I do not think that these figures prove that the API has increase the *usage* of OpenCitations.net services. Instead of (or in addition to) the raw number of calls, I believe the number of unique users (or unique IP addresses) would be a better indicator.
If this information is not available, at least the authors should acknowledge this bias in their interpretation of the data.

Finally, I do not consider that section 4 demonstrates the value of RAMOSE itself; instead it demonstrates (modulo my remarks above) that *there is a need* for REST APIs, and hence for tools such as RAMOSE. But we have no way to know if the usage data would have been different if the authors had, for example, built their API from scratch. I still find Section 4 (and the paper in general) interesting and valuable, but the authors should be careful not to set wrong expectations. In particular, the first sentence of Section 4 should be rephrased: replace "RAMOSE" in this sentence with "a REST API".

Actually, while I think this tool is very interesting, and might even use it in future projects for rapid prototyping of REST APIs, I have some concerns about the *availability* of the exposed API. Indeed, SPARQL endpoints are known to have low availability in general (https://labs.mondeca.com/sparqlEndpointsStatus/index.html). Building a REST API on top of a SPARQL endpoint adds ever more overhead. Furthermore, I am not convinced of the relevance of offering post-processing options in the API itself -- this might make user's lives a little easier, but at the cost of more centralized processing, further reducing availability. I think that a study on the performance implications of RAMOSE, compared to other tools or ad-hoc development, should be mentioned as a perspective to this work.

Questions / suggestions:

* in Table 2, the description of the #preprocess directive is not entirely clear. From what I gathererd by reading the documentation on Github, (but I am not entirely sure), 'lower(dois)' takes as input the URL paremeter 'dois', *and replaces it with the output of the function*, so that '--> encode(doi)' accessed the *transformed* version of 'doi'. This should be made clearer (unless you decide to make the presentation less granular, as suggested above).

* I find the 'exclude' URL parameter oddly named: the meaning of e.g. "exclude=cited" is really not intuitive. I suggest renaming this parameter to 'require'.

* exposing the API using the OpenAPI specification (http://spec.openapis.org/oas/v3.0.3) would further improve its usability by developers. Furthermore, there are several tools for producing API documentation automatically based on an OpenAPI (https://github.com/swagger-api/swagger-ui, https://github.com/Redocly/redoc), so you would not have to implement/maintain your own.

* in Section 4, if you are able to provide the numbers of unique users or IP addresses, it would also be interesting to see how many *new* users/addresses called the REST API compared to the SPARQL endpoint. This would help quantified how much the *reach* of the data (not just its usability) was increased.

Review #4
By Victor Charpenay submitted on 30/Sep/2020
Suggestion:
Minor Revision
Review Comment:

The paper introduces a tool to provide domain-specific Web APIs over SPARQL endpoints in a declarative manner. There are enough examples to understand the capabilities of the tool and the figures on OpenCitations are a good indicator of its relevance.

Besides that, the tool in itself has a rather small code base (~1500 lines of Python code, including the config format parser), which suggests it did not represent a major challenge to implement.

Comments on each review criterion:

(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).
-----------------------------------------------------------------------------------------------------------

The motivation around REST is somewhat not convincing: ref [1] is a blog post that doesn't directly support the claim (citing Fielding's thesis, chapter 3 or 5, is enough); BNB seems only to have a SPARQL endpoint (at least from the provided URI); the URI to data.gov returns 404 but it seems that APIs either don't use RDF or have no REST interface (CKAN, for instance, documents its API as an RPC-style interface); as far as I know, Wikidata is exposed via the MediaWiki API which is far from being a REST interface (and doesn't claim to be).
In the end, the question addressed by RAMOSE is less that of providing a read-write interface to RDF datasets (there has been the Graph Store Protocol spec, and now the Linked Data Platform, for that) than to provide a domain-specific view on such datasets (such that each resource bundles pieces of information that are likely to be relevant to specific users). What the paper demonstrates, to me, is that the SPARQL mappings provided to OC users suit their need. The motivation of the paper may insist more on that aspect.

As far as I understand, the configuration format used by RAMOSE is custom. (At least, it is not known by Rouge or Pygments, which have lexers for hundreds of languages.) Calling it a Markdown format is ambiguous, because Markdown has several dialects but also because dialects disagree precisely on using the '#' character (among others). The syntax is somewhat misleading to me: some key/value pairs belong together (depending on the value for 'type') but it seems there is no explicit separation between groups of key/value pairs. By the way, the authors use the expression "conceptual sections" here, which was not immediately clear to me. Why not use JSON, YAML or a more widespeard format?

The main change I would like to see in the paper is the following: in section 5, the authors mention that RAMOSE mainly distinguishes itself from grlc (the closest alternative) by the fact it also provides support for filters and custom pre/post-processing functions. Yet, the paper doesn't demonstrate the importance of these two features in section 4. Are the API definitions for the OC corpus making use of these features? Are there many pre/post-processing functions defined for these APIs? In addition, if server logs include individual queries, it would be good to visualize the importance of filtering and pre/post-processing in the logs.

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
----------------------------------------------------------------------------------------------------------------------------------------------------------

Section 2 could be removed without losing information on RAMOSE. The only important point (which is not clearly made in the section), would be to state that the OC corpus hasn't undergone significant architectural changes since 2018 (besides those mentioned in section 4).

If Listing 2 were shown earlier in the paper, reading would be easier. In my opinion, listing 1 is superfluous (a single example is enough to grasp the idea).