Review Comment:
This paper is about publishing Public Transport data in a cost-efficient and flexible manner using Semantic Web technologies (Linked Connections approach). The paper studies what are the performance tradeoffs for route query planning when publishing data and compares it with "traditional" implementation with a "fat" server instead of pushing some computation to intelligent clients.
DISCLAIMER: Is the first time I review this paper, that I see has been previously submitted. In line with what editors of this journal have asked me in similar situations, I read the paper with fresh eyes, without evaluating or assessing changes from previous versions.
(1) originality: The approach is original, there is enough difference with previous work from same authors.
(2) significance of the results: Not groundbreaking, but in my opinion just sufficient. Indeed, the paper concedes that in its current form the approach is still far from practical for larger networks. Plus, Linked Connection's inherent bandwith overhead means that mobile apps, which I think are the most common client applications, would be slower and more expensive. That is relatively bad news for the SemWeb community, but still a valid scientific result vis-a-vis the methodology followed.
(3) quality of writing: Can be improved, further comments below.
(4) Resources: GitHub repository, the repo also includes an external link to an institutional repository that I assume complies with the requirements of research data deposit.
Detailed comments:
Abstract:
You say "fragmentation size influences route planning query performance and converges on an optimal fragment size per network, in function of its size, density and connection". From the results shown in section 6.1 I can't see where is that function. I was expecting you derived something F(S,D,C) -> Fragment Size.
Introduction:
Motivates well with respect to "Open Data", but jumps to "Public Transport Data" without explaining why the client-servers cost tradeoffs are important for that domain. What is wrong with current PT open data publishing?
The contribution "Shows how Semantic Web technologies can be applied not only to describe domain specific data, but also interfaces that enable applications to consume it, whose principles could be re-used towards more generic, domain-independent and autonomous data applications" is quite fuzzy. I'm not certain what is mean with "generic application", or "autonomous application" and how what is presented here helps to that. What is presented here is for the PT domain (as stated in the immediately previous sentence), therefore don't see how your contribution creates "domain-independent" applications.
Related Work:
In section 2.2 it is mentioned that the approach ultimately lowers the cost for data publishers. Please provide references to this, is there something where this has been quantified? I believe previous work from some of the authors have shown the load balancing part, but not the cost for publishers. I also wonder that in the context of PT, the publisher is usually the Transport agency or provider that has a mandate (or a business interest) in developing a client application too, how does the cost balance works there?
The same remark appears in section 3, where you mention the tradeoff of "increased implementation complexity on the client". You mention a mitigation strategy at that point, but this should be expanded in the discussion section.
In terms of contributions, I can see what a "general architecture is", but the adjective "integrated" does not add anything.
"An study of the factors that influence route planning query performance" it should be specified that is query planning under the data publication conditions imposed by your approach.
Section 3:
The AVL tree is nice, but it is part of the implementation, I don't understand why is considered part of the "LC architecture". If your architecture is "general", then the Live Data Manager is a component that does something, and the AVL is just your implementation.
Section 4:
Minor: "set heterogeneous" -> set of heterogeneous
The word "observed" does not compile to me, it seems "measured" is more appropriate.
To me, this section should be about the datases and metrics used for the experimentation. You include here the choice of modeling as TVG, which I believe is part of your approach, and specifically about your reference implementation of the general architecture.
There is no explanation of why the 22 PT networks were chosen, were they the ones available? You mention "representative in terms of modes of transport and geographical coverage" Did you consider a larger set and then discarded some? Did you choose them to have a variety of sizes/degrees/densities? (does not seem the case) What timeframes were considered and why? In the caption of Table 2 you mention "number of active stops during their busiest day", what means "busiest" and how it was established?
Section 5:
The formulation of hypothesis is not consistent with the questions. RQ1 is formulated as "What is..." but the H1 is "There is...". For RQ2, the question is "What is", but the hypothesis is not concrete enough, what is the "specific set of topological characteristics"? That you hypothesise? It seems both questions need to be rewritten as "is there" questions. Another thing that makes noise to me about writing these as hypothesis is that if you formulate them as statistically testable, then you need an actual statistic test, which you only have for RQ2.
Unclear what the assumption "PT route planning queries will be normally evaluated within the span of one day". After reading other parts of the paper, this seems to mean that queries are done for travel on the same day.
I'm confused about the relationship between the paragraph on "Smallest fragment possible" where you talk about "number of connections allowed per document" and that "with this lower bound we were able to fragment the rest of the collection in fragments containing similar number of connections and hence a similar size", and the "fragmentation sets" where you talk about "connections per fragment". If you use that lower bound as guide for the size of the fragments, then how does it make sense to then vary in fixed sizes? It seems to me the last sentence of "Smallest fragment possible" may be poorly written.
In table 3, the term "query length" has not been used before, what does it mean?
Overall, this section has a lower writing quality than the others, and would benefit from additional proof-reading.
Discussion and conclusion:
MINOR: neglible -> negligible
Can you elaborate why Spain-Renfe has the optimal point?
With respect to historical data, on p6 you mention that the main issue is that this data is not currently being published. Let's assume that a publisher willing to publish that (instead of hiding it for business reasons), what is the advantage of using your approach over a data dump? You mention Machine Learning algorithms as beneficiaries, wouldn't those require a full dump? A statistical analysis would need the same. If I got it right, you use in your experimentation for historical data the same queries than with the Live setting, but are queries for historical use cases the same as for Live use cases? It is not clear for me, and I would say they aren't.
You mention that optimal fragmentation size is "related to the average scanned connections of the query set". There is no mention in section 6.1 or Figure 6 of E(SCQ), just some references to table 2 for values of K and D (that are quite heavy for a reader to go check) what is the support of this statement? I think you need some visualisation of this in section 6.1.
You stress a lot the "cost-efficiency" of your approach for publishers (presumably small agencies in a budget. You even mention on p27 "...more expensive servers will be needed with OpenTripPlanner than with LC Server", but I'm missing at least an estimation of how much more (in money), based perhaps on current average cloud or web server costs.
Following on my remark on statistical tests in section 5, I don't think you can write "accept hypothesis" on RQ1 and RQ3 in the conclusion.
|