Publishing planned, live and historical public transport data on the Web with the Linked Connections framework

Tracking #: 2629-3843

Julian Rojas
Harm Delva
Pieter Colpaert
Ruben Verborgh

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Exposing transport data on the Web for consumption by others poses several challenges for data publishers. In addition to planned schedules, access to live schedule updates (e.g. delays or cancellations) and historical data is fundamental to enable reliable applications and to support machine learning use cases. However publishing such dynamic data further increases the computational burden of data publishers, resulting in often unavailable historical data and live schedule updates for most public transport networks. In this paper we apply and extend the current Linked Connections approach for static data to also support cost-efficient live and historical public transport data publishing on the Web. Our contributions include (i) a reference specification and system architecture to support cost-efficient publishing of dynamic public transport schedules and historical queries; (ii) an empirical evaluation of the impact that API design aspects such as data fragmentation size, have on query evaluation performance for the route planning use case; (iii) an analysis of potential correlations of query performance with particular public transport network characteristics such as size, average degree, density, clustering coefficient and average connection duration. Results confirm that fragmentation size indeed influences route planning query performance and converge on an optimal fragment size per network, in function of its size, density and connection duration. Our approach proves to be feasible for publishing live and historical public transport data and supporting efficient route planning use cases. Yet, for bigger networks further optimizations are needed to be useful in practice. Careful design of data fragmentation strategies constitute an important factor for cost-efficient, scalable and usable publishing on the Web. Additional dataset fragmentation strategies (e.g. geospatial) may be studied for designing more scalable and performant Web API s that adapt to particular use cases, not only limited to the public transport domain.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/Jan/2021
Major Revision
Review Comment:

The paper illustrates an effort to publish public transport data as linked frangments and an evaluation of a sort of "quickest path" route planning query with a set of data coming from disparate sources around the world. Besides being an interesting subject, the paper in my opinion has a limited validity and the claims by the authors are not always supported. It is a valuable extension of the previous work by the same research group, but it is not up the promises of a "cost-efficient" solution to any public transport use case that the introduction and conclusions try to suggest.

The main limitations I see in the current submission are as follows:
- there is no comparison with a non-semantic web baseline, failing to convince the reader that the solution is really worth the effort with respect to a more traditional architecture
- the evaluation seems only on the scheduled transport data and with a random selection of queries, so it gives no idea of the representativeness of the tested cases
- the "normalization" operated on the experimental results doesn't show the actual outcome of the evaluation, leaving the impression of "artificially adjusted" numbers
- regarding the live and historical transport data, the authors only provide their claim that the proposed architecture "works": they simply present *one* possible solution
- the addressed use case is only related to a traditional route planning scenario (and not any generic elaboration/query on public transport data), and it is also limited to the earliest arrival time search; a dedicated API would perfectly do and would also be much easier on the client side, so the authors fail to convince the reader of the need for such a linked fragments approach for the specific case
- the discussion section includes some unsupported claims on the advantages of the presented solution

I suggest a major revision of the paper with the following goals:
- clarifying the actual scope and limiting the claims to those that can be indeed demonstrated
- providing a comparison with a non-semantic web baseline
- (maybe also) providing some evaluation on the live/historical data and not only on the scheduled data

My review on the usual dimension for research contributions is as follows:
- originality: the content is indeed original, even if an incremental improvement with respect to the previour work of the same authors
- significance of the results: the evaluation is interesting but of limited validity, because of the complete lack of a comparison baseline based on a traditional approach; some of the results appear to be straightforward and not specifically surprising; the applicability of the presented solution in a real setting is not completely convincing (as the same authors partially admit)
- quality of writing: the paper is in general readable, with a few unclear sentences and a bunch of typos; the structure is reasonable, even if I'd suggest the authors to reorganize it a bit, by anticipating the information about the datasets, their metrics and the metric values before the evaluation section (rather than partly in the evaluation and partly in the result sections)

My detailed comments and suggestions are offered hereafter:
- introduction: the authors start with general considerations about public transport data and services, but then they focus only on route planning: while this is of course ok, I think that they should make this focus and scope clearer since the beginning (and probably also in the paper title). Indeed the authors criticize the use of Web APIs (page 2) because they "limit data accessibility" and they advocate for solutions that overcome such limitations "an API that calculates only the fastest route in a PT network may not be useful when trying to find routes that are wheelchair-friendly or for different purposes than route planning", but then they address a very specific route planning case only, failing to show the superiority of their solution in this respect
- related work, 2.1: the authors make an interesting overview of standards and vocabularies in the transport domain, but then they mainly use Linked GTFS, thus all the following claims on "interoperability" are based on the use of a single model (even if it is a reasonable choice, yet not covering all potential needs for public transport data representation)
- related work, 2.2: the authors "criticize" the existing dumps and APIs in use by the transport operators, because of restricted access and server-side costs; indeed, the operators may *want* to limit access to their data because of business choices; moreover, moving the burden to the client side doesn't seem an advantage from the client point of view
- related work, 2.4: the authors make a strong claim on "route planning is the most prominent use case over PT data" which may be true from the traveller's point of view, but not necessarily from the operator's one; moreover, they choose a specific algorithm (CSA) and a specific kind of query (earliest arrival time): this does not clarify why the proposed approach should be better that a dedicated API for this specific scenario
- related work, 2.5: "unfortunately, historical PT data is not easy to come by": this may be true if you limit the scope to open data, but the transport operators do have and maintain historical information! The fact that they are not openly available doesn't mean that they don't exist and the operators have understendable business reasons to keep that data private
- linked connections: "we showed that indeed LC achieves a better cost-efficiency by consuming considerably less computational resources on the server side": and what about the client side? The authors always seem to ignore the burden on the client side and, as a consequence, on the server-client system in its entirety
- linked connections, 3.1: the term "vehicle" is used but not introduced; however, since it is never reused in the rest of the paper, I'd suggest to abstract the explanation avoiding to include vehicles
- linked connections, figure 4: I'm not sure I got the explanation: are the connections simply in temporal order or does the structure take into account the network topology? or does each document refer to a single specific connection? The textual explanation also does not introduce dt_k and references the figure only at the end of the subsection
- linked connections, 3.4: "access to this data could support analytical studies..." but this is not route planning! What kind of queries do you expect on historical data? "historical data queries are not as performance-critical as live data queries for route planning purposes...": this is true *only if* you want to have a route planning query! What about a query like: give me all trips that were delayed of more than 10 minutes at least 50% of the times during the last 2 months?
- evaluation: as said in the summary at the beginning, the entire evaluation is LC vs LC; there is no comparison with a custom Web API approach: since in the end the auhtors address only *one* query type using *one* algorithm, it would be fair to compare the proposed architecture with a traditional solution, to show that it is not only an academic exercise
- evaluation, intro: in H1 the authors introduce the term "performance" but don't specify what it refers to; in the rest of the paper it is clear that they only refer to query response time, but they never address completeness or correctness of results (which are other interesting parts of "performance"). Moreover, H2 seems to be expressed in a reverse form: "it is possible to find PT networs whose topological characteristics improve route planning query performance when published over LC interfaces" I would have expected the contrary, i.e. that LC interfaces may improve response time in case of specific PT topology characteristics... In this subsection, the authors introduce the data, but the details only come much later: I would suggest a dedicated "data" section, followed by the "evaluation protocol", then "results" and "discussion"
- evaluation, 4.2: the authors introduce walking transfers but it is not clear if those are also modeled as connections. As said, performance is only query response time: do the authors always know the correct/complete solution for each EAT query?
- evaluation, table 2: are those numbers computed only on planned connections? how are they computed (e.g. multiple executions of each query)? I don't understand what the sparkline represents, because it does not look like a frequency distribution: what is on the x axis? time or query repetition? how long is the average duration of the query? is this computed over the entire fragmentation size space?
- evaluation 4.3 and results 5.1: this is what I'd move earlier as part of a "data" section
- results, figure 5: it is completely obscure to me what is represented in the pictures and it is also unclear why it is important; the visualization choice with the black background is very questionable (especially in printed or visualized in black and white), but I also don't find this picture specifically relevant
- results, 5.2: "we use a per-connection result to remove the influence of route query length" but isn't it also influenced by the query "complexity"? Since the queries were randomly genrated, we cannot understand their "difficulty" (e.g. long path but less alternative options vs. short path with a lot of alternatives). The "normalization" seems a way to hide the actual response time
- results, figure 7: why did the author choose the 90th percentile?!? the (normalized) response time depends on the query complexity!
- results, 5.3: the authors could simply provide the correlation between the response time and the various metrics (with the respective statistical significance) instead of trying to infer it from the plots in figure 8; moreover, I'd suggest the authors to provide several numerical outcomes (e.g. correlation, covariance, R^2, etc.) and to display more meaningful plots, for example with linear (or other) interpolation, confidence intervals and similar, to visually display the "strenght" of the relation between variables
- discussion, 6.1: "our approach performs such integration in an efficient way on the API server-side...": how can you say it's efficient? what about client-side? "...freeing data reuser applications from expensive data reconciliation tasks": the authors always test the *same* specific route planning use case, which could be implemented by a single API and they didn't compare to it! "live data updates are never recorded and get lost after being briefly published through traditional APIs": the authors can't say that! They can only say it is not left as OPEN historical data. There are a million ways to publish (or store) historical data and the authors offer only *one* way to do it! "More importantly, it also defines a query interface for this data": depending on the way to publish historical data, there can be other query interfaces!
- discussion, 6.2: if I got it correctly, all your evaluation was done on scheduled data only: what about live and historical? This subsection is the only place where the authors honestly admit that "for several networks this approach seems impractical for real scenarios". Here, they also introduce the discussion on caching: the absence of caching in your evaluation shuld have been mentioned before. Anyway, since there is no comparison to an alternative architecture, it is hard to tell whether the proposed solution (even with a cache) is practical or viable for real usage scenarios
- discussion, 6.3: a sentence may be missing on page 20, second column, line 39. Extra processing due to walking distance computation: the authors never explained how they considered or computed walking distance (which could be modeled as a connection itself)
- conclusions: " a cost-efficient data publishing": unsupported claim. "...facilitates data reuse for client applications": unsupported claim. "...establishes a framework for data interoperability": this is completely independent from the use of linked fragments and out of scope w.r.t. the presented evaluation; interoperability in this case emerges from the reuse of a reference set of vocabularies. Moreover, the authors did not explore further semantic possibilities, by attaching to connections additional "sematics": the authors used only connection time, but they could have used length (shortest path) or traffic condition (less congested path) as in [1], or "beauty" or "happiness" of the route as in [2], and they could have tried to demonstrate the potential of the approach to be applied to several aspects of the data, even keeping the investigation on route planning only. "...careful design of API data structure can affect and improve the performance...": isn't this obvious? "...the approach performs well enough to be used in practical scenarios...": unsupported claim, indeed in section 6.2 the authors admitted the exact opposite

[1] I. Celino, E. Della Valle, D. Dell'Aglio, F. Steinke, R. Grothmann, V. Tresp: "Semantic Traffic-Aware Routing for the City of Milano using the LarKC Platform", IEEE Internet Computing, DOI: 10.1109/MIC.2011.107, 2011
[2] D. Quercia, R. Schifanella, L. M. Aiello: "The shortest path to happiness: recommending beautiful, quiet, and happy routes in the city", Proceedings of the 25th ACM conference on Hypertext and social media, 2014

Review #2
Anonymous submitted on 17/Mar/2021
Minor Revision
Review Comment:

In this paper the authors apply a fragmentation to public transport datasets, in order to allow a processing of these fragments on the client side. The paper provides a specification of public transport connections as RDF (based on the GTFS standard), it evaluates the performance of different fragmentation sizes using a route planning algorithm. This study is performed over 22 different real world transport networks, using 100 randomly generated queries per network.

I enjoyed reading the paper and definitely recommend publishing it in the SWJ. However, I have some remarks regarding novelty and evaluations, see below.

(1) originality

In several parts of the paper you refer to prior work. I would appreciate a short cover letter (or as part of the paper if you prefer), listing the novel contributions of this paper compared to previous ones. Ideally, make clear of what is new, what is reused/refined, what comes from prior works, etc., to get a better understanding of the building blocks of this work.

(2) significance of the results

Evaluation of historical data: The evaluation part investigates the best fragmentation ratio in terms of response time. This is thoroughly done, and gives a good understanding of the factors that play a role.
However, one contribution of the paper is serving historical data, which is not really discussed in the rest of the paper. Historical queries are not covered in the evaluation (with the argument that this task is not performance-critical); I still would include an evaluation (even if it shows the increase of response times).

Evaluation of network traffic: The authors argue that the query performance depends on the size of the fragment size and the topology of the network. What I do not fully understand, is the impact of network traffic: The CSA algorithm performs a scan through the connections and downloads additional documents if required? In a real world setting the total download has the biggest impact? I would have expected a measurement of the transferred data.

Relation to big companies: The paper does not compare to the big route planning applications. I would have expected some discussion of approaches (if available) / scalability (theirs, yours) / some distinctions.

The clustering coefficient is missing in figure 8 for n7 (Auckland). The table and graphs could relate better e.g. in terms of consistent identifier or colors.

(3) quality of writing

To me, some parts were a bit hard to follow and could be better structured. I had to jump between the sections several times. In the introduction, a clear motivation for having historical data is missing. Also, I was a bit surprised to find the research questions only in section 3.5.

Minor remarks:

p.2: “client applications can be autonomously traverse them to evaluate route planning queries”
p.2: “[APIs] limit data accessibility by imposing request limitations due to high maintenance and scalability costs” -> *might* limit? Also, in some cases you have access to both, the dump and running a local version of the API. E.g., think of OSM nominatim.
p.6: *an* arrival stop
p.11: “comes from considering that are too big fragments will increase” -> -are
p.13: “how connected is each vertex in the network”
p.13: “Is defined as the ratio” -> +It
p.19: “We aimed on understanding what factors that drive better or worse performance.”
p. 20: “However we do observe apparent inverse relations for the cases of density D and average connection duration ACD with respect to query response times.models independent we see a pattern”
p.21: “From a general perspective, this work provides evidence” -> incomplete sentence
p.21: “We hope an approach such Linked Connections” -> *as*

This is an incomplete list. In general, the paper is well written, however, there were quite some errors and I strongly recommend proofreading to get rid of strange wordings, missing pronouns, etc.

Review #3
Anonymous submitted on 28/Mar/2021
Minor Revision
Review Comment:

The paper introduces an approach to serve historical and live transportation schedule data. The approach is based on the Linked Data-based approach Linked Connections, which moves a (large) part of the heavyweight route planning to the client application. An evaluation, based on real network and schedule data, gives insights on the systems behaviour when confronted with earliest-time-arrival route planning queries.

The novel contributions of the paper are twofold:
(i) The extension of the existing Linked Connections approach by schedule updates both for current updates as well as historical schedule updates.
(ii) An empirical evaluation of the systems behaviour (query response times) for route planning queries together with an analysis of possibly influencing factors, i.e., different (transportation network) graph metrics.

Whereas the extension to schedule updates is a more technical contribution (server-centric, systems with these features are in production for many years already) the second extension to historical schedule updates is novel as a feature for public facing transportation APIs. While public transportation service companies have to keep data about delays for different reasons (reporting, possible reimbursements), this data is ususally not available to the public at the granularity of individual connections.

Potential adoption of transportation companies is unclear, there would be (at least) two drawbacks: the companies would not get data on the queries routes anymore and they would have to openly publish data on delayed and cancelled trips/stops.

The paper structure offers no surprises and the text is mostly well written.

Main issues, ordered roughly by importance:

The main issue of the paper is the missing or not sufficiently communicated connection of the first and the second contribution: While the first (conceptual) contribution is about the extension with *dynamic* schedule updates, the second (empirical evaluation) contribution evaluates route planning query performance based on *static* data. The evaluation thus does *not* support the conceptual contribution.

For the evaluation the sentence "machines were set ina local network to avoid network latencies" is problematic since network latency is one of the two independent variables of the evaluation (the other one being the fragment size). It is thus very important for the validity of the results, to assume realistic latencies for internet connections. For the results this could mean, that optimal fragment sizes are slightly larger than currently reported.

Hypothesis H2 is not a good research hypothesis, it is not concrete enough.It would be very hard to reject: one would have to prove that no such topological can exist. Furthermore, topological characteristics might be *used* to improve API performance, but they can not "improve" performance by themselves.

The evaluation does not attempt to compare the approach to similar transportation route planning systems.

In Section 3 it could be unclear what exactly the contribution is, and what exists already from earlier publications.

Section 5.3: instead of relying on "visual correlation" the computation of correlation coefficients (e.g., Pearson correlation) would make results clearer.

Minor issues:
* the meaning of the term "PT data" is sometimes unclear. Often it means schedule data, sometimes it means schedule, topology and more.
* Sec 2, line 12: models -> data models
* Figure 2: meaning of the solid blocks inside the pages as well as their links (even between pages) not explained,
* Figure 2 and a few times in the text: "page" and "document" are used inconsistently
* Table 1: gtfs:headsign is missing (occurs in Listing 1)
* end of Sec 3.2: unfinished paragraph starting "time-based content"
* Figure 5: what are the points? Active stops/connections?
* Sec 7: Last sentence of first paragraph is unfinished
* formatting of abbreviations is inconsistent, sometimes upper case sometimes small caps
* several typos which will be caught by a spell checker