Review Comment:
Summary
-------
This article studies which parts of the Wikidata and Bio2RDF knowledge graphs (KGs) are queried by queries in their respective query logs. The main questions the research tries to pursue are how the usage of a KG can be quantified and whether frequency distribution and ranking analyses reveal usage patterns over different (robotic vs organic) query types. I agree that both Wikidata and Bio2RDF are highly interesting for such a study, since Wikidata covers a very broad range of information, whereas Bio2RDF is more specialized. Furthermore, the authors seem to have many queries of each of these data sets available.
Assessment
----------
I think that this is a useful study that deserves publishing, but still needs some work. In particular, the paper needs to clarify or define several of the objects it is studying (such as schema elements or schema type patterns), since it is difficult to interpret the paper's results otherwise. Furthermore, certain analytical tests need to be better motivated (see detailed comments). I believe that interleaving the description of the tests that are going to be run with the test results will clearer motivate why certain analytical tests later are going to be done. As a bonus, such a presentation may render the discussion of the analytical results less dry.
The provided artifacts are well-organized, are hosted on Github, and appear to be complete.
Overall comments and questions to the authors
---------------------------------------------
- A premise of the paper is that other work often overlooked the actual entities that are queried in KGs. I agree that research did not yet emphasize this aspect much, so the present study is well-justified, but would like to point out that papers [A,B], which the submission does not mention, treated this aspect. The study [A] is fairly broad, covers it in Section 4, and provides an interactive diagram at https://podmr.github.io/darql/property-sunburst/. The study [B], which is more recent, lists the Wikidata IRIs that are used for transitive navigation.
- I found the phrasing "Separated organic and robotic queries to explore [...]" on p2 misleading, since it makes the reader believe that this separation is a contribution of the present paper. The separation for Wikidata was done by the authors of [C,D] and, as far as I can see, the present paper simply applies the method of [D] to Bio2RDF. The authors could consider removing this item from the list of main contributions, but adding below the items that all studies are done separately on robotic and organic query logs, where the Wikidata logs already have this division and they have applied the same algorithm to Bio2RDF to also obtain it for these query logs.
- Per data set, how many queries did this study use? I find such information very relevant to give me an idea of coverage. This discussion would fit in early Section 3, under "SPARQL Query Logs".
- This may be personal taste, but I would consider integrating Sections 3.5-3.6 into Section 4. The reasons are that (1) while reading Sections 3.5-3.6 I was often wondering *why* a certain test is useful and (2) Section 4 (especially p14) is very dry. So I wonder if seeing the results of some of the studies help me understand why certain other metrics make sense to investigate. Also, I feel that this interleaving will lead to a paper that is better at telling a story, which is what I prefer.
Detailed comments and questions to the authors
----------------------------------------------
- p2:5 I don't understand what you mean by "type, predicate, and individual." I understand "predicate", as in (s,p,o) triples, but I don't understand what you mean by "type" or "individual".
- p2:26 "[...] schema coverage computation by refining the query log triple pattern extraction [...]" I don't understand what you mean by this. Triple pattern extraction from queries seems trivial to me. Why is it so important to merit its presence in your list of main contributions. Where is this refined triple pattern extraction described?
- p3:19 "...over 200 million SPARQL queries..." By now, this is closer to 500 million SPARQL queries.
- p3:38 by ensuring that
- p4:10 Since LSQ has query logs from many more data sets, I would explain here why Bio2RDF was chosen. I suspect that the reasons are that (1) it gives you a data set with a relatively specific topic, which contrasts Wikidata and (2) you have many Bio2RDF queries available.
- p4:15 Can you provide a justification why you only considered intervals 1 and 7 for the Dresden query logs?
- p4:Listing1 The information provided in this figure is not very crucial, is it?
- p5:1-10 Could you please define "schema element" and what you are actually extracting here?
- p5:1-10 Could you provide examples of extracted info on Wikidata for Query 1 and Query 2? (Incidentally, why write Query_1 and Query_2 instead of Query 1 and Query 2, as you do in the caption of Listing 2? I would keep the readers' mental model and the naming scheme in the implementation separated.)
- p5:33 Could you specify whether you do whitespace normalization? It's perhaps not crucial, but gives readers a better idea what you consider to be duplicates. "We extract unique queries based on their text [...]" is vague.
- p6:Fig 2 The variable standardization you describe here is the same as in the Wikidata logs. This should be mentioned. Also, I'm not sure if this requires such an elaborate figure.
- p6:Section 3.5 To understand what's happening here, I need an explanation of what is a "schema element". Please be precise.
- p7:1 Please explain what the Shapiro-Wild test measures. I hope that it is possible to provide an explanation for those readers who don't have a background in statistics and who don't know how to interpret "the normality of the differences in the frequencies of common schema elements". For instance: what are important properties of this test that led you to choosing it?
- p7:9 I don't understand your explanation of x_i. The provided explanation
x_i = diff = Normalized_schema... - Normalized_schema...
seems to use variables that you use internally in your code, but I don't know the semantics of. Also, what is "diff" supposed to tell me, apart from the fact that you are using a minus operator, which I can infer immediately from the expression.
Can you make the description of the a_i (they are "constants based on the sample size") more concrete?
- p7:12 [...] differences are not normally distributed. This confuses me. Doesn't a normal distribution assume an ordering on the domain? (Height of humans has an ordering on "height", and we measure the number of people that have this height.) What's the ordering here?
Since I don't understand the normal distribution point, I would also appreciate more explanation as to why we need the Wilcoxon signed-rank test here.
- p7:25 What is the "rank of a schema element" ?
- p7:25 [...] across these two time points [...] It is unclear to me which time points are intended. We are talking about two intervals here, so we have four time points.
- p7:26 Please don't start a sentence with a mathematical symbol
- p7:27 n is the number -> $n$ is the number
- p7:27 Any other intuition you can provide me about Spearman's rank correlation (and why we do it) is welcome.
- p7:33-45 Your envisioned way of dealing with frequent schema elements bears some resemblance to what's done in [A], whereas [A] did not consider time intervals.
- p8:9 log_2013 and log_2019 are mentioned but it's not clear at all to me what these refer to. I would like to ask the authors to give the (slices of) query logs meaningful names for readers (not necessarily the internal names in the code).
- p8:8 What are "schema type patterns"? Please define.
- p8:26-39 I found the description of the data sets to be confusing. For one thing, "It contains a total of [...] queries, of which 3,212 are server logs, and do not contain SPARQL queries." is a very confusing sentence, which seems to contradict itself. Further, "Wikidata organic log2017" on p8:35 (again, not nicely formatted) has different characteristics than the log with the same name on p8:36.
- p8:26-39 This paper focuses on research of time intervals in query logs. It is quite unclear to me why the partitioning in time intervals is done the way it is. To me as a reader, the partitioning seems fairly random. Why not partition the query logs into time intervals that have the same length?
- p9:Table 1. Please improve the naming convention. Judging on the names, "Wikidata All organic log2017" seems to subsume "Wikidata organic log2017". Does it? I find the overview in this table to be rather unclear.
- p10:Table3: Same remark as for Table 1
- p10-18: Throughout this entire section, I would have appreciated more discussion about what the results of these tests teach us in less technical terms.
References
[A] Angela Bonifati, Wim Martens, Thomas Timm: Navigating the Maze of Wikidata Query Logs. WWW 2019: 127-138
[B] Janik Hammerer, Wim Martens: A Compendium of Regular Expression Shapes in SPARQL Queries. GRADES/NDA 2025: 4:1-4:10
[C] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt: Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph. ISWC (2) 2018: 376-394
[D] Adrian Bielefeldt, Julius Gonsior, Markus Krötzsch: Practical Linked Data Access via SPARQL: The Case of Wikidata. LDOW@WWW 2018
|