Generating Public Transport Data for the Web based on Population Distributions

Tracking #: 1797-3010

Ruben Taelman
Pieter Colpaert
Ruben Verborgh
Erik Mannens

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Applying Linked Data technologies to geospatial and temporal data introduces many new challenges, such as Web-scale storage, management, and the transmission of potentially large amounts of data. Several benchmarks have been introduced to evaluate the efficiency of systems that aim to solve such problems. Unfortunately, the synthetic data many of these benchmarks work with have only limited realism, raising questions about the generalizability of benchmark results to real-world scenarios. On the other hand, real-world datasets cannot be configured as freely, and often cover only certain aspects. In order to benchmark geospatial and temporal rdf data management systems with sufficient external validity and depth, we designed PoDiGG, a highly configurable generation algorithm for synthetic datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PoDiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PoDiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Wei Wang submitted on 22/Jan/2018
Minor Revision
Review Comment:

Most of my comments raised in the previous review have been addressed. Detailed explanation have been provided regarding those which were not precise or confusing. Errors have been fixed. I only have two minor concerns.
- in Introduction, "The cause of this correlation is obvious, considering transport animals...". animals should be "networks"?
- in related work, under the "network-based approaches", "one-dimensional distribution function that maps each node to a certain chance", what is meant by certain chance? It sounds like a random function; but needs clarification.

Review #2
By Riccardo Tommasini submitted on 07/May/2018
Minor Revision
Review Comment:

The paper investigates how to generate spatio-temporal transport data w.r.t. population-distribution data. Data generation is an important task for fostering reproducible benchmarking and empirical research. The work focuses on generating public transport data encoded in RDF. The authors' choice of using RDF and Linked Data best practices, follows the intuition that transport data usually refer to shared entities.

The main paper contribution is an algorithm for public transport data generation and its implementation. The authors built on the state-of-the-art of public transport planning, designing an approach that creates a geospatial region, places stops, edges and routes, and finally schedule the trips.

The authors claim the state-of-the-art lacked a realistic dataset generator. Therefore, they employ many techniques to asses the generated data resample realistic scenarios.

The paper writing meets the SWJ standards. However, there are few passages, which require further clarification.

The work presents a significant engineering effort yet it does not highlight all the scientific value it creates. The architectural structure of the generators is as important as the algorithm. Indeed, it might inspire intuitions about the approach scalability, and clarify the design. Moreover, requirement analysis is surprisingly missing, which would have also helped to drive the evaluation. As far as I understood, there was one which was removed after the previous round of review. The authors should consider re-adding it, maybe in a different form.

To this extent, both Jim Gray and Karl Huppler provide good sets of principles to support the design of domain-specific benchmarks. Moreover, it is essential that the authors clarify which task will test whoever is going to use the generator, e.g., query answering.

My most significant concerns regard the work motivation as well as the evaluation.

Regarding the former, as pointed out by another reviewer (reported in the author letter), it is not enough to sustain the lack of such a data generator in the state-of-the-art. A benchmark becomes obsolete when it is not able to distinguish between different approaches that adopt it, i.e., all the solution look good. Given that a benchmark consists of a data, one or more task and a set of KPI to compare. We can upgrade a benchmark by tuning either of them.

Is that true that existing spatio-temporal benchmark, and data generator, are obsolete? Moreover, which tasks are the PODIGG uses going to test?

Regarding the latter, it is not convincing the use of Duan et al.'s coherence metric to assess whether a given dataset is realistic. Indeed, Duan et al. define the coherence metric to highlight the structural differences between synthetic RDF datasets and real ones. High structuredness causes poor evaluation for RDF Stores because it makes the result less relevant in practice.

Unfortunately, it is easier to identify a characteristic that makes a dataset a bad candidate for benchmarking. On the other hand,
claiming the opposite requires a more complex study. What authors did going more in-depth in the comparison is in the right direction.

Nevertheless, they did not fully identify which characteristics of the real datasets make them relevant samples to study. This point raises again the problem of better positioning the work in the state-of-the-art, which requires 1) identify what tasks to solve over the data generated, and 2) survey existing solutions to inquiry if they are able or not of reaching the level of observability that PODIGG enables.



- the designed algorithm follows best practices
- the implemented tool is highly configurable
- the intuition about dataset "distance" goes in the right direction

- Motivation and Comparison but be improved
- Tasks are essential elements of a benchmark (e.g., queries)
- Evaluation is not convincing because of 1) coherent metrics usage
2) lack of term of comparison, i.e., what can one benchmark using PODIGG that she/he could not benchmark before.

Review #3
By Carlo Allocca submitted on 16/May/2018
Review Comment:

The paper presents an approach to generate a synthetic geospatial and temporal dataset. The approach is based on a mimicking algorithm and it is framed in the context of
public transport domain. The underlying hypothesis of the proposed method is based the high correlation between public transport network and
the population distribution within the same area. Based on this, the authors claims that the proposed method has the advantage of producing dataset with real characteristics,
outperforming the existing ones, providing a flexible foundational for benchmarking RDF Data Management system with geospatial and temporal data.

Despite the fact that the presentation and the content of the document are quite clear, the Motivation, Originality, Methodology and Results are not well linked together,
the paper presents a number of remarks that should be considered and clarified:

1) The title. It does not well describe the intend of the whole paper. Example: Generating a Combined Geospatial and Temporal dataset For Benchmarking RDF Data Management System.
It would improve the main message of the paper.

2) The Abstract. Most of it discuss about Benchmark, only the last three lines describe in a very vague way what the paper is about.
Please revise the structure of the Abstract.

3) The Introduction. The reader would expect to easily understand (i)What is the main question(s) addressed by the research? (2)Why Is it relevant and interesting?
(iii) How original is the proposed approach? Of course, all these should be in short. Although some of those aspects are described, but the are not linked together to facilitate
the reader. For example, the authors states "There is a need for benchmarks that evaluate systems handling datasets with geospatial and temporal characteristic, but it is
important that the dataset they use are realistic." Here, the need is mentioned but not specified (What is the need?), "realistic" - what does it mean in this context?

4) The Related Work. One would expect to easily understand What it add to the subject area compared with other published material. It has got a more descriptive characteristic
rather than a comparative one. And, the "need" as expressed in this section, just referring [2] and Hobbit project does not help the paper to be self content.
Please, I would suggest to expand such a concept.

5) The Public Transit Background. The authors do not need to spend more than one page (two columns) to describe the background.
I would be highly recommended to expand the Research Question section which spend just 3 lines to describe the key RQ.

6) The Method. The Figure 3, Figure 7 and Table 3's position are not very clear. Overall, it is not clear why the proposed approach should be relevant for the subject of research.
Why one cannot choose another approach? As this work is in the context of benchmark, What are the quality, functionality and which typer of performances are required
to evaluate RDF data management system w.r.t. geo and spatial data? The Evaluation: there is a comparison with 3 Gold Standards. It seems that is not very difficult to generate
real data with the same spec. Every big city could expose such a data and have a big dataset of the same characteristic.

7) The Discussion. The references [13] and [20] are out of the scope. The authors should add example of publications that use ML or DL to solve a similar problem.
Otherwise the way that the two topic are discussed are not only valid only for this research topic but in general (tautology).

All the above need to be properly addressed, then the paper will improve significantly to be published for this important journal.