Generating Public Transport Data for the Web based on Population Distributions

Tracking #: 1580-2792

Ruben Taelman
Pieter Colpaert
Ruben Verborgh
Erik Mannens

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Applying Linked Data technologies to geospatial and temporal data introduces many new challenges, such as Web-scale storage, management, and the transmission of potentially large amounts of data. Several benchmarks have been introduced to evaluate the efficiency of systems that aim to solve such problems. Unfortunately, the synthetic data many of these benchmarks work with have only limited realism, raising questions about the generalizability of benchmark results to real-world scenarios. On the other hand, real-world datasets cannot be configured as freely, and often cover only certain aspects. In order to benchmark geospatial and temporal RDF data management systems with sufficient external validity and depth, we designed PoDiGG, a highly configurable generation algorithm for synthetic datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PoDiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PoDiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Oct/2017
Major Revision
Review Comment:

This paper presents an approach and a system called PoDiGG for generating synthetic public transport networks. The generator produces a synthetic transport network, which includes stops, edges (connections among stops), routes (combinations of edges) and trips (timetables). The generated data are transformed to RDF, thus producing datasets that can be used as Linked Data benchmarks. The main goal is to produce synthetic data that are realistic, i.e., mimic as closely as possible real-world transport networks. The main hypothesis, upon which the generator is based, is that there exists a significant correlation between transport networks and population distribution. Hence, the algorithm starts with a grid where cells are annotated with population density, and generates stops based on that. Then, the rest of the components of the network are generated based on assumptions and heuristics drawn from transport network design methodologies (e.g., producing a connected network, maximizing area coverage, maximum route length, etc.). The generator is experimentally evaluated using the actual transport networks of the Belgian railway, the Belgian buses and the Dutch railway. Randomly generated networks are used as baseline for comparison.

The paper addresses an interesting and challenging problem. A synthetic data generator for public transport networks can be very useful not only as a Linked Data benchmark but in several other scenarios as well, e.g., for testing and evaluating route planning algorithms. Indeed, the system is designed in a modular way so that the core generator produces a GTFS dataset and then transformation to RDF is done via an extension. This is a good design choice that makes the generator more easily usable by a broader community. Moreover, the generator is provided as open source software, which again greatly facilitates reusability and extensibility. Regarding the approach itself, using population density as a driving factor for generating the transport network seems an intuitive and reasonable assumption.

However, the paper also has some important shortcomings. Especially, my main concern is about the experimental evaluation of the approach. Using a randomly generated network for comparison is a very naive baseline. Comparing against a random method makes sense in relatively simple problems, e.g., in binary classification, where even a random algorithm has reasonably high chance (50%) of correctly guessing the outcome. However, a transport network is highly complex, thus randomly generating one will naturally fail miserably in mimicking a real-world network. I imagine that any method that is even slightly based on some reasonable assumptions for the structure of a transit network would significantly outperform a randomly generated one. Therefore, this comparison is hardly informative in my opinion. Instead, a less naive baseline should be used, so that it is possible to assess the accuracy of the generator more convincingly.

Moreover, although it seems reasonable that population density is an important factor for the design of the network, other factors can be significant too. One particular type of data that intuitively seems highly relevant is Points of Interest. Usually, bus and metro stations are located not only in areas with high population density but also in places where there is a high concentration of POIs (e.g., shops, universities, government buildings, tourist attractions). In fact, in certain areas, such as commercial or industrial zones, population density may be low but they may still contain a large number of stops and routes. Such information is not mentioned and is not taken into account in the paper. It seems however that it would be intuitive (and perhaps relatively easy) to extend the described approach to incorporate also POI data (e.g., from OpenStreetMap) and use POI density (or more elaborate features based on certain POI types) alongside population density. This point is also related to the previous comment, because then the authors could compare and evaluate three methods: (a) using only population density, (b) using only POI data, (c) combining both. This could potentially improve the accuracy of the generator, and in any case it would provide a much more plausible alternative method for comparison.

Other comments:

- Section 5.1: Requirements. This part seems a bit out of place and probably out of scope too. In my opinion, the inclusion of this detailed list of requirements here is a bit unexpected at this point and loses focus for the reader. Up to that point, the text is quite coherent and focused around a clear and concrete problem statement: how to produce synthetic transport networks that are *realistic*. This section suddenly changes the focus to a system perspective, introducing a list of requirements that (although reasonable and desirable) are orthogonal to the discussion and objectives so far, e.g., whether the generator offers visualizations, whether it is easy to use, whether it is open source, whether it runs in all major operating systems, etc. In my view, this part could be removed, and could be instead substituted by a short paragraph simply stating the characteristics of the implemented system.

- Functional Requirement 2: Mimicking querysets. It is not clear if, where and how this requirement is addressed or discussed.

- I do not see how the coherence metric (Section 6.1) is relevant here. As explained in the paper, this is a measure of structuredness that refers to the degree to which the instances of a type have values set for all attributes applicable to that type. This is a measure that makes sense for datasets with entities adhering to a schema where attributes vary or are optional, or where real-world data have a large portion of missing values. In this case, i.e., GTFS data, the schema is quite concise and concrete, and it is expected that all information is present (as indicated also by the values shown in Table 3, which are around 98-99%). Even if there are some missing values in the data, it is not clear whether the expected behavior of a synthetic generator would be to introduce missing values in the data too. Thus, I think the authors should clarify this point a bit more, explaining whether and how this metric is relevant and applicable here.

- Performance issues and evaluation. During the experimental evaluation, the authors mention that there are also performance issues when generating and evaluating large networks. They also point out the complexity of the operations performed while comparing the generated network with the actual one. Understandably, this is a valid point and worth mentioning, but it raises again a question of focus. If the authors consider efficiency and scalability to be within the scope of the paper, this point should be already mentioned much earlier, in the beginning of the paper, alongside the main objective of providing realistic output. In that case, the authors should also put more effort on showing how these performance issues are addressed in order to achieve better efficiency and scalability. Otherwise, if performance is not the focus here, it might be better to restructure a bit the text moving all such mentions to a separate subsection, in which to point out the problem, show some indicative measures and perhaps state that this is a potential aspect to be addressed in future work.

Review #2
By Wei Wang submitted on 04/Nov/2017
Major Revision
Review Comment:

The paper is very well written and straightforward to understand. The main contributions lie in the design of the dataset generation algorithm following the public transit planning methodology (as well as the population distribution) and a set of evaluation metrics for comparison with other (real) datasets. The generated datasets are of potential value to the research communities to benchmark different storage, query and search techniques.

Having said that, I would point out that it is not easy for me to find very notable novelties from the paper, although there are some interesting ideas in it. This is my major (and only) concern, and the reason that I would not recommend an accept immediately. The work in my opinion represents a significant amount of engineering effort and has achieved reasonably good results with respect to the evaluation metrics. Some parts of the paper are written in a style very much like a project deliverable (details can be found below). The authors are suggested to revise them. Below are some of the other comments that might be useful to the authors to further improve the presentation of the paper.

- in introduction, the authors claim that the main contribution of the paper is a mimicking algorithm for generating realistic public transport data. However, in the 1st paragraph, they also say that “different information architectures with varying trade-offs exist” and then quickly discuss the architectural choice. How is the architectural choice influencing/related to the focus of the paper?
- in introduction, “Our second main contribution is an implementation of this algorithm…”. This could be combined with the 1st contribution as they are both about the algorithm.
- in introduction, “semantic data models. [9,4]”, -> “semantic data models [9,4].”
- in section 2.1 Spatiotemporal Dataset Generation, “…over which objects can move at a certain speed based.”, remove “based”.
- in section 2.1, “…are represented as two-dimensional areas that exist for a period in time over the network, which apply a decreasing factor on the maximum speed of the objects in that area.”. This sentence does not read well and needs to be rephrased. How can one represent weather conditions and event as a two diminutional areas?
- in section 2.1, for the three methods to select the starting node, it would be better to present them in bullet points or a table. It would also be better to explicitly mention what the two/one dimensional distribution functions are.
- in section 2.1, “…generate new instances based on a set of dependency rules (called “rules” [20]).”, what is an instance here? It would be better to briefly mention what the instance represents in [20]. “(called “rules” [20])” can be removed. It would be better to present these dependencies in bullet points or a table.
- section 2.2 also discusses methods for synthetic datasets generation but with RDF, it can be combined with section 2.1.
- in section 2.2, “…since this level of structuredness will have an impact on how certain data is stored in RDF data management systems”. I am not able to understand what is meant by this sentence here.
- I feel that section 2.3 is a specification or requirement for public transit design. It does’t seem appropriate to be under the “related work”. Also the section can be shorten in my opinion. It would be better to present those items for objectives, metrics and so on, in either bullet points or tables.
- section 2.4 on transit feed format talks about the specification and knowledge representation in the domain. It seems not appropriate under related work. Section 2.4 and section 2.3 are both about the background of the public transit domain. Separating them from the literature review might be a better idea.
- section 3, “The main objective of a mimicking algorithm is its ability to create realistic data.”, remove “its ability”.
- section 3, “by first comparing the level of structuredness of real-world datasets compared to their synthetic variants”. This sentence does not read well and needs to be rephrased.
- section 3, what are meant by “macroscopic coherence metric and domain-specific microscopic metrics”?
- section 4, Connection is an important entity in the design. It is worth explaining a bit more, e.g., how it is instantiated in the dataset? How to define the two ends of a connection?
- section 4.2, page 7, Eucilidian distance: is it calculated using the geographical coordinates or the indices of the two dimensional area for the region (matrix)?
- section 4.3, under “Short-distance”, “…maintains its center point that represents the average location of all stops…”, I would suggest to use centroid in this case, especially you have explicitly named the clustering algorithm. “…make sure that nearby clusters are merged before more distant clusters…”. This seems unnecessary as the standard agglomerative clustering will do that. What you need to is to choose an appropriate cutoff value on the dendrogram.
- section 4.3, under “Cleanup”, “…that the amount of stops…”, “…significant amount of loose stops…”, the number of stops seems better here.
- Figures 3, 4, and 5, add some extra space between all the sub-figures.
- I can somehow understand Algorithm 1, but there are things to be clarified. Line 8, why do you use multiplication of the difference between coordinates and Radius? Line 10, what does the index i mean? What is “I”? Line 12, it does not show how the random station s’ is calculated.
- Equations 2, 3, and 4, the symbols should be explained in text.
- section 4.5, under “Delay”, it is worth explaining how the delay factors as in the transport disruption ontology affects the trip, for example, how the trip time is updated in the presence of different delay types.
- section 5.1, first, these requirements are best to be enumerated in a project report, while not in a research paper; second, if they are to be kept in this paper, it would be better to make them short and to list them in a table.
- section 5.2, compossable -> compossible. “Node module on NPM and as a Docker image on Docker Hub”, it would be better to use one sentence or two to explain what they are. I think many people, including me, don’t really know.
- Figure 6, “Each route has a different color, and darker route colors indicating more frequent trips over them…”, I am afraid that one cannot see the “darker” one from different colors.
- Table 2, would it be better to provide the table as a footnote with a link? Similar to the requirement specification in section 5.1, it is of less interest to the readers of the paper, who are primarily looking for novel ideas on design and implementation.
- section 6.1, under “Metric”, “…the coherence metric [14] measures the structuredness of a dataset.” -> “the coherence metric [14] measuring the structuredness of a dataset is used.”
- section 6.1, under “Results”, “That is because of … and the fact that they originate from GTFS datasets that have the characteristics of relational databases.”. Is there a correlation between high structuredness and the fact that the data is from a relational database?
- section 6.1, under “Results”, “the same amount of stops, routes and connections”, -> same number of
- section 6.2, “…and the other way around given a distance function…”, what is meant by “the other way around” here? It is obvious that the function in (6) is symmetric.
- section 6.2, under “Edges Distance”, “…weighed by the length of the edges”, -> “weighed by the difference of length of the edges”. Why there is a “+1” in equation (9)?
- section 6.2, “the realism of stops and routes is lower, but still sufficiently high to consider them as realistic”. But from Table 4 the data for routes is not sufficiently high. Also, it doesn’t seem reasonable to average the values across stops, edges, routes and connections. They are better to be discussed separately.
- section 6.3, under “Results”, “…but now the increase for routes and connections is higher than for the connections parameter.”, do you mean higher than the stops parameter?

Review #3
Anonymous submitted on 21/Nov/2017
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper presents an algorithm for creating more realistic benchmark datasets based on population distributions.
According to the authors, previous benchmark datasets weren't generating realistic enough datasets.
The paper is well structured, easy to read, its contributions are clearly stated, and all of the results are published online.

However, the motivation that the existing benchmarks are not sufficient seems to be rather thin. Is there evidence that system evaluations failed in the past because the synthetic datasets weren't realistic enough?

"We observed a significant correlation between transport networks and the population distribution.."
Isn't it rather one of the causes they are built? Transport networks are built to move either people, goods or animals.

Which makes the research question seem to be rather obvious to answer.
“Can population distribution data be used to generate realistic synthetic public transport networks and scheduling?”

The figures 7-9 could be bigger.

Smaller things:
p.1: There is a use for .. -> There is a need for .. (?)
p.11: compossable -> composable