Publishing and archiving planned and live public transport events with the Linked~Connections framework

Tracking #: 1949-3162

Julian Rojas
David Chaves-Fraga
Pieter Colpaert
Oscar Corcho
Ruben Verborgh

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Using Linked Data based approaches, companies and institutions are seeking ways to automate the adoption of Open Datasets. In the transport domain, data about planned events, live updates and historical data have to coexist to provide reliable data to route planning assistants. Linked Connections (LC) introduces a preliminary specification that allows cost-efficient publishing of the raw public transport data in linked information resources. This paper gives an overview of Linked Connections so far and supports claims with existing and novel experiments. Furthermore, (i) an extension of the current Linked Connections specification providing methods and vocabulary to deal with live data is provided; (ii) a Linked Connections Live server is developed that is able to process GTFS-RT feeds providing consistent identifiers; and (iii) an efficient management of historical data taking into account the size of each fragments exposed on the Web is described. We discover that the size of the fragments has a relevant impact on the performance of query evaluation. Based on our experiments conducted in 2018, an ideal Linked Connections fragment -- for the use case of route planning with a client developed for this work -- weighs about 50kb. This research scratches the surface on a Web ecosystem for route planning. In future works, we envision to find optimal fragmentation strategies of larger public transit networks for automated federated route planning.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Patrick Westphal submitted on 26/Sep/2018
Major Revision
Review Comment:

In their submission 'Publishing and archiving planned and live public transport events with the Linked Connections framework' the authors summarize the current state of their route planning solution based on Linked Connections and the Connection Scan Algorithm. The authors further performed an evaluation to show the improvements over previous versions of their route planning solution, and to determine the impact of different fragment sizes on its query performance.
The author's motivation for the need of such techniques to establish integrated and optimized access to public transport data seems sound. Moreover, having an in depth investigation about performance optimization for Linked Data backed, query intensive algorithms is, in my opinion, a relevant research topic fitting the scope of the Semantic Web Journal.
One main issue I see regarding the overall presentation of the work is that it is not self contained. As I am neither experienced in working with Linked Data Fragments, nor in the details of route planning algorithms I would have appreciated if more of the terms and abbreviations used throughout the article were introduced properly. Overall, after having read through the paper I still only have a vague idea of how things work since I would have to consult earlier papers to get the details. Furthermore, it seems to me that the submission is lacking proof reading. Besides grammar errors, many explanations and argumentations are hard to follow. To give an example, one of the claimed main contributions of the paper was to split the overall data into equal sized fragments as compared to fragments covering a fixed-sized time period (which may result in unequal sized fragments). The main improvement reported in the article is that response times are more predictable, however, it is not discussed in which way this influences the overall user experience etc. Furthermore figures usually are not explained and too small to be readable on a printed copy. Another note concerns section 5 (Results) which could basically be compressed to one table which would then be discussed later. (This fill-the-blank exercise for 'Figure XX portrays the response time distribution for each tested fragmentation in a set up without a cache. For this case the best performance was obtained with a fragmentation of YYKB and a median response time of ZZms.' and so on just scatters the numbers across many pages and hampers a comparison.) Besides this, I would also like to have a more detailed presentation of the numbers. All a reader gets to know is the average of the response times for each fragment size and that "Each query is executed at least 2 times". The exact number of executions and the number of queries would be interesting, though.
A further issue I see concerns the originality of the submission's main contributions. As claimed, the major contribution was the integration of historic and live traffic data. However, this was already reported in a previous paper of the authors and IMO no clear statement was made how the current submissions differs from that. As a further contribution the authors list the route planning algorithm, but as I understood the algorithm they apply (Connection Scan Algorithm) was published by an unrelated group in 2013. In this regard it is not clear to me what the concrete contribution would be.
The authors provided the source code of their route planning solution, as well as the datasets used in the evaluation. (However, for future comparison, providing a Git commit hash or tag would be good since most of the software seems to be under ongoing development.) In terms of the outcome of the evaluation, i.e. that
1) caching improves the query performance
2) there is an optimal fragment size for Connection Scan Algorithm-based, Linked Connection-backed route planning ("Therefore we can affirm that a 50KB fragment size, with an active cache, provides the optimal performance for answering route planning queries on top of the LC framework.")
3) evaluated optimal fragment sizes provide a better performance than an arbitrarily chosen fixed time period-based fragmentation
I see severe problems. The insight of 1) is fine but IMO also too obvious to be discussed in that length and detail.
With 2) I cannot agree. My intuition would be that the fragment size providing an optimal query performance depends on many things like, e.g. (a) the network/graph structure of the underlying transportation network, (b) the frequency of how often a connection is served by e.g. a certain tram line, (c) the client cache size. The authors acknowledge (a) for the uncached setting (there coined 'geographical restrictions') but did not touch on any other things that could influence the query performance. Moreover, if there really was an optimal fragment size, does it really come without any restrictions? Could one really take the fragment size that worked well in two cities and generalize that this would also work best in all other cities/counties/countries without looking at (a), (b), (c), ...? Even if this was the case, IMO it would be essential to back this discussing issues like data granularity, query selectivity, cache sizes etc.
Outcome 3) as it is stated here, IMO does not add any further value unless discussed in more detail. I did not get what the actual problem is if fixed time period-based fragmentation is applied and how this relates to fixed size fragmentation in terms of the (average) query performance.
Another issue I see is that the evaluation does not give a good idea of how this improved solution compares to other route planning/public transport data services, or why such a comparison would be difficult.

Overall I consider the problems mentioned as major and cannot accept the submission in its current state for publication.

Review #2
By Adrian M.P. Brasoveanu submitted on 12/Oct/2018
Major Revision
Review Comment:

Overall the paper is well-writen, albeit with lots of typos (several on each page, but only collected some of them). The problem they focused on is really interesting,

Introduction contains both the actual introduction and the problem statement, which might make it confusing for those who are not already familiar with your previous LinkedConnections works. It would read much better if the problem statement is developed in a separate short section. This is particularly important as this paper has around 20 pages, therefore if someone invests 30 minutes to an hour to read it, he/she should end up having a good understanding of the topic from the ground up.

It is not immediately clear to me why historical data is important for the transport domain. I understand that archiving is a good use case, but speed is not necessarily important there. Also there doesn't seem to be any mention of recording incidents that took place during past rides. I would assume present and future timetables are more important for most people. Historical data would only be important for strategy planners (e.g., City Hall representatives, administration). Without a clear description of the archival use case, I can almost think that storing past connections in some sort of timetable format should be enough. Please include a section that includes the various use cases. Please also add some explanation on who would benefit from each use case.

Whenever a new term is used, it should be explained. For example, the following sentence is not clear at all due to the fact that origin-destination API concept is not fully described (a search in the paper also only reveals that this concept was used only once in the whole paper):

"The authors found that, at the expense of a
higher bandwidth consumption, more queries can be
answered using LC than the origin-destination API." (page 2, lines 15-17)

The various descriptions of algorithms from the Related Work section is really good. In some cases however it is not clear why certain algorithms should be used. For example, I would rather believe most people would want fastest route that has the lowest number of connections in most cases. Besides this, it would be great if a all these approaches towards modelling connections are presented in a table which distinguishes their main characteristics (e.g., live or historical data, attributes, etc).

Section 3 does a really good job explaining what the Linked Connections Framework is, while also adding some material about the newly developed extensions.
At one point in section 3.1 it is explained that some results are paged. Would it be possible to add some numbers so that casual readers understand this comment?
I think the user interface for the iRail can be improved a little bit by combining the displayed information with that obtained when clicking on new locations/destinations.

Sections 4 to 6 - Evaluation - is somewhat problematic. The main reason is the following: speed requirements and usage characteristics will be different according to the use case. I would assume live data use cases need to use a framework that simulates how LC behaves under stress conditions (e.g., at the minimum few thousand connections, some failures, etc), whereas different archival use cases might also have different needs (e.g., recent archivals might be important for the police or other third parties if some incident occured on one of the routes; old archivals - e.g., older than a year are mostly importnat for historical purposes, as long as they can be correlated with other data sources somehow). There are also no details on scalability. One might want to understand, for example, if heavy usage has any impact on fragmentation. If there was no intention to simulate real-world usage, this needs to be clearly explained. Most of the descriptions hints at testing "if there is an optimal fragment size for transport network datasets that maximize the performance of answering route planning queries"

The last section concludes the paper and includes future work.

Overall, the paper is well-written and the contribution is clear. The paper however needs more work on several fronts, as already explained:
- problem statement
- use cases
- evaluation.

Also various smaller details need to be improved upon.

Font in figure 10 needs to be increased a little bit, for example.

The architecture of the system is missing (a picture with it could really help a lot).

Also there is a good amount of typos (only showing some of them):

"adding that features both on server" (page 2, line 19)
"Tripscore, a Linked Data client that consume several Linked Connections" (page 2, lines 27-28)
" Next we describe how are the entities URIs build based on these templates" (page 6, lines 18-19)
"First, the server create the Linked Connections from the information provided by a GTFS dataset" (page 7, lines 28-30)

The paper is original and builds on years of work with Linked Data in Belgium. The contribution is generally well-explained, even though the use cases mentioned in the article need to be guessed from its various sections. The quality of writing can be seriously improved. I would recommend the authors to provide a Major Revision.

Review #3
By Ilaria Tiddi submitted on 05/Nov/2018
Major Revision
Review Comment:

The paper presents an overview and current status of the Linked Connection (LC) framework, where semantic technologies are employed to allow the inexpensive publication of public transport data gathered from heterogeneous, possibly linked, data sources. The three components of the framework are presented : (i) 0the vocabulary designed to deal with rapidly changing ("live") data; (ii) the server developed to process data feeds compliant with transport standards (GTFS-RT); and (iii) a size-based file fragmentation strategy for data management and exchange. The resulting framework allows the authors to face challenges such as guaranteeing URI stability and validity over time and the management of past ("historical") data for query optimisation. The authors show how the new version of LC is employed in the context of planning bus routes in Belgium and Spain, and evaluate it against its previous version in terms of query execution performance. Results show that fragmenting and caching fragments improve the execution of route planning queries.

Overall, the work is interesting from an industrial perspective, and it is clear that is the result of a research conducted over several year — hence definitely fit for a journal venue. I believe that the problem of dealing with historical & live data (let’s call them dynamic data) and rapidly managing them is quite relevant — not only in transports, but also in areas dealing with streaming sensors such as smart-cities and Internet of Things.

In the specific of the evaluation criteria the paper presents some novel ideas with (their) previous work, and results are somewhat significant (even though it is only compared against the author’s own baseline, and some of the conclusions seems quite trivial). Quality is overall good, with a number of things to be changed — see last part. In general, I have some comments and concerns, which make me say the paper might need some major modifications before acceptance.

# Introduction
This is clear enough I believe, problem, challenges and overall approach are well stated. It might be good to give a bit more of description of Figure 1 tho (lines 29-30), to put the reader in context. Also I am not sure if that sentence is anyhow related to the previous sentence? If so — it should be better clarified.

# Related work
Not convinced at all about this section. This seems more of a "background&motivation" section, where authors mostly report of their own previous work and what brought them to implement new features for their framework. This actual section can and should say (as background&motivation), but I recommend the authors to work on creating a new section of "related work", where they actually locate their work in the literature. More specifically:
1) I was surprised to see no mention of approaches from the transport domain for managing route planning: yet, this is quite a big area (transportation networks/transport planning/intelligent transportation systems). What do people in these area have done? Which algorithms, how do they manage data? Why the Linked Connection framework is supposed to be better?
2) While I do see the fundamental differences, another quite relevant area should be RDF stream processing/querying, which I am sure the authors should be aware of. There many stream processing systems and frameworks have been looking at the integration of historical and real-time data too, so they should mentioned and the main differences with LC have to be highlighted
3) Anything from the IoT and Wireless Sensor networks areas for the management of dynamic data?
4) Also check on how semantic technologies have been used smart cities, to support management of heterogeneous volatile data
These are of course pointers, but more things could be included of course.

On page 5, lines 1-15: The conclusions are very weird : "we think that it is the moment to" … based on what? I would use "based on the evidences presented […], we propose …" . And following : what constitutes reliable transport data?

# Section 3 (the framework)
The description of the framework is ok in subsection 3.1 & 3.2, but the quality sensibly reduces in 3.3. This last section should be better revised, as the reader gets lost. The sentence on lines 15-19 (second column) is not clear, nor grammatically correct. The sentence on page 8, col 1, lines 21 - 31 is also too long. Other minors below.

W.r.t. the use case of section 3.4, has the framework not been used to implement any mobile/web client application relying on it? If yes, it might be useful to present them, for completeness of the work&paper, but also to show the soundness of framework. Was/Is LC used by NMBS at all?

# Evaluation
As I said the work was compared only on the previous version of the framework, but authors devise a large number of experimental settings showing results convincing enough. I would not call the "scenarios" but rather "experimental settings" — in the end you use the same system with parameter variations.

In general the whole section is quite lengthy and could be simplified by putting all the exp. settings in a table, or by merging figures on similar topics (e.g. Fig 11&12, Fig 13&14, 15&16 etc…) As it is also hard to interpret results if they come one after the other. Btw — perhaps discussion of results could be held along the results? This might help in readability.

On page 9, col 2, what is exactly the role of the list lines 33-40? It sort of comes out of the blue, perhaps a introductory sentence could be useful.
On page 10, line 35, could you point to the GTFS datasets and GTFS-RT feeds available as open data?

# Conclusions and future work
this is also okay. I would anyway invite the authors to think and discuss a bit of which other applicability the LC framework could have outside the transport domain.

# Style & Minors
- Captions in Tables&Figures should have a full stop dot at the end (the are very few as such). Also — captions for Tables go above the table, captions for Figures go below.
- page 1, line 35 : the manage of these >> the management of these
- on the top (found in a couple of sections) >> on top
- page 6, lines 11-15 should be unindented (and add a full stop after "Figure 4" on line 14)
- page 7, lines 24-28 should be unindented
- page 7, lines 39-43, replace with fist column sentence: After defining how create Linked Connections feeds which take into account live data, our third contribution is how to provide reliable access to historical and live transport data in an cost-efficient way.
- page 7 line 27 : both, the historical and 27the live connections >> both historical and live connections
- page 9, line 15, a fragment Fk-1 whose >> a fragment Fk-1, whose
- page 9, line 42 : the algorithm process >> the algorithm processes
- page 9, line 50: figure 1 >> Figure 1 (also make sure you use Fig or Figure consistently throughout the paper)
- page 10, line 32 >> table 2 >> Table 2 (see comment above)
- page 18, line 43 : by relaying >> by relying. and "a set of common semantics" >> common standards and vocabularies