PAPyA: a Library for Performance Analysis of SQL-based RDF Processing Systems

Tracking #: 3243-4457

Mohamed Ragab
Adam Satria Adidarma
Riccardo Tommasini

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Prescriptive Performance Analysis (PPA) has shown to be more useful than traditional descriptive and diagnostic analyses for making sense of Big Data (BD) frameworks’ performance. In practice, when processing large (RDF) graphs on top of relational BD systems, several design decisions emerge and cannot be decided automatically, e.g., the choice of the schema, the partitioning technique, and the storage formats. PPA, and in particular ranking functions, helps enable actionable insights on performance data, leading practitioners to an easier choice of the best way to deploy BD frameworks, especially for graph processing. However, the amount of experimental work required to implement PPA is still huge. In this paper, we present PAPyA 1, a library for implementing PPA that allows (1) preparing RDF graphs data for a processing pipeline over relational BD systems, (2) enables automatic ranking of the performance in a user-defined solution space of experimental dimensions; (3) allows user-defined flexible extensions in terms of systems to test and ranking methods. We showcase PAPyA on a set of experiments based on the SparkSQL framework. PAPyA simplifies the performance analytics of BD systems for processing large (RDF) graphs.We provide PAPyA as a public open-source library
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 20/Dec/2022
Major Revision
Review Comment:

The paper tackles the problem of prescriptive performance analysis and presents a library named PAPyA, that enables to perform this type of analysis over knowledge graphs. PAPyA implements a whole pipeline for transforming data stored into a relational database into an RDF knowledge graph from where prescriptive analysis can be performed; its goal is to reduce manual work in the phases of graph processing preparation and data loading. The performance of PAPyA is assessed using several testbdeds with large RDF graphs generated using Watdiv and SP2Bench. The tool is publicly available in Github.

Positive Points (PPs)
PP1) The paper describes PAPyA, a Python library that supports users in evaluating SPARQL query processing on top of Big data settings. Several parameters can be configured to analyze the dimensions that affect reproducibility.
PP2) A showcase demonstrating PAPyA functionality and main features.
PP3) Many options to visualize the outcomes of an experimental study.

Negative Points (NPs)
NP1) It is not clear how PAPyA can be extended to the requirements in Section 3.1
NP2) Nothing is mentioned about the number of users who have utilized PAPyA and consider that it actually reduces the work during query processing assesments.
NP3) It is not clear how different configuration of query rewriting, optimization, and evaluation can be included in PAPyA.

Detailed Comments
The development of PAPyA resorts to the assumption that the performance of query processing over RDF graphs is sensitive to various parameters, and PAPyA aims to enhance reproducibility and facilitate the best combination of relevant paraments (e.g., schema, partitioning technique, and storage format) during the application of the Bench-Ranking methodology (previously published by the authors).

It is important to make clear in the introduction that this paper extends the work presented in [5] and explicitly states the novel contributions. Figures and Tables, which are also part of the paper [5], should be cited (e.g., Figure 1 and Tables 1 and 2).

Although PAPyA is agnostic of the KPIs, it would be easy to include new metrics. For example, dief@t and dief@k are very informative metrics ( see that enable quantifying the dieficiency or continuous performance of an engine.

It is not clear how clear how the query optimization techniques can be confifured in an experimental setting.

It is unclear how the Executor's specification of the SQL queries is executed. Are they pushed down into the database management system, or is there a local optimization and processing? This will be another parameter that could impact the outcome of the experiments.

In the same line, it is unclear how the SPARQL queries are translated into SQL. The transformation process considerably impacts the outcome of the experiments. Are the users able to select various strategies for transforming SPARQL into SQL? Please, check the paper by Karim et al. 2021 to see how the representation may considerably impact the outcome in terms of execution time.

The authors claim that PAPyA is able to reduce the effort needed to analyze the performance of Big data systems when processing large RDF graphs. However, how has this statement been validated? How much are the savings? How easy is this framework for the users? Are they actually finding the PAPyA easy to use and reducing their work? A user study supports this statement.

Evaluation and Recommendations
The paper describes a library that has the potential to be very useful for the community. However, there are issues that reduce the value of this current version (See negative points). The recommendation is for major review addressin the comments previously presented.

Farah Karim et al. Compact representations for efficient storage of semantic sensor data. J. Intell. Inf. Syst. 57(2): 203-228 (2021)

Review #2
Anonymous submitted on 12/Jan/2023
Review Comment:

The authors proposed PAPyA, an useful and extensible framework for performance analysis. Compared to traditional descriptive and diagnostic analyses, PAPyA has proven valuable for practitioners to choose the best way of deploying BD framework.
From my view, this work makes contribution to the knowledge of the community.

Review #3
Anonymous submitted on 30/Sep/2023
Minor Revision
Review Comment:

This paper introduces a library called PAPyA which allows the processing of RDF graphs over relational Big Data sytems. It allows the user to perform SPARQL queries over a big data framework. It further details the functionalities of the library, the challenges, requirements, and the solutions. The details of how the library can be used in practice and the visualization of the data analytics.

The paper is very clearly written and the objectives of the library are very clearly described. Experiments on existing benchmark datasets were performed.

A discussion on how the functionalities, as well as the outcomes of the tool, can be connected to the objectives described in the introduction section would strengthen the paper.

Moreover, was the tool somehow evaluated based on its use by the experts or in some real scenario?

Is it possible to add some extensions to this library such as support for query optimization?

The library is available as open source from GitHub and the documentation is very clear and easy to follow.

The related work could be mentioned at the beginning of the paper to situate the novelty w.r.t. existing libraries.