Review Comment:
Written by key project members, the paper provides a detailed report of the evolution of DBpedia, with a focus on new features and services, and updates since the last two reports ( published 2008 & 9). The report concludes with examples of use and potential avenues for further application. It also presents a comparison with other data extraction projects -highlighting similarity in aim, if not always in features provided or supported, and showing where they and DBpedia complement each other. The roles played and the contributions made by the wider DBpedia community are also described.
The paper is written more like a white paper than the typical system report. It is therefore also much longer - some of the descriptions could be condensed a bit. Importantly, it does not report explicit evaluation; however, utility may be inferred from usage, 3rd party and by project members.
As the authors themselves note, the relevance of the DBpedia community project to the Semantic Web, NLP and IR communities, among others, is evident in reuse, and beyond the references in this report. Th paper should serve as a useful reference for continued use and further (community) development of the resource.
One thing makes the report a bit difficult to follow - several references are made to points in DDpedia history - either using (numerical) publication references or version numbers. While the authors of these papers and the creators of DBpedia can easily map these to points in time, the average reader cannot - so ends up wasting time looking these up. It would be useful to provide a lookup early in the paper mapping version numbers to release dates, and also, where references ask the reader to judge based on time, explicitly provide these (as year or qtr/year as appropriate).
Similarly, the expression "last/past years" is used several times wrt to evolution of DBpedia - this is simply too vague - at worst it should specify last few or several - the reader should not need to keep checking to see what time span this probably covers.
DETAILED REVIEW
S1 - intro - "This system report is a comprehensive update and extension of previous project descriptions in [1] and [5]. The main novelties compared to these articles are: …" - "novelties" as used here is incorrect. I'd suggest "advances" or, simply, "new work".
What exactly is access here "…in Section 6, we provide statistics on the access of DBpedia"
S2.1 - why is N-Triples particularly given as the only example of output format?
A lookup table for the language codes would be useful - not all are easily guessed at, e.g., eu, tr, ar, hr; even Greek is not easily guessed as el - otherwise the reader needs to do extra work to confirm these.
S2.4 - "Mapping Validator: When editing a mapping, the mapping can be directly validated by a button on the edit page." - that the user clicks on a button is irrelevant - what matters is that functionality is available for triggering the validation.
S2.5 - "Recent DBpedia internationalisation developments showed that this approach resulted in less and redundant data [23]." - confusing - what does "less and redundant" mean? From the context I suspect "and" should be deleted?
S2.7 - "One of the major changes on the implementation level is that the extraction framework has been rewritten in Scala in 2010 to improve the efficiency of the extractors by an order of magnitude compared to the previous PHP based framework." - what is the (value of the) order of magnitude? - otherwise cannot gauge the value of extra work done. Also, why Scala? not saying it's not a good choice, just that drawing attention to it begs the question.
"They are important for specific extractors as well, for instance, the category hierarchy data set (SKOS) is produced from pages of the Category namespace. " - how is SKOS relevant here? - it is without doubt not the category hierarchy data set.
S3.1 - "It can be observed in the figure that the Portuguese DBpedia language edition is the most complete regarding mapping coverage." - actually quite difficult to locate this - the bar in question should be highlighted with some annotation.
S3.2 What were the criteria for selecting the 20 language versions? Ditto - the 10 in Table 3
S6.1 "To host the DBpedia dataset downloads, a bandwidth of approximately 18 TB per quarter or 6 TB per month is currently needed." - redundant - both halves of the sentence say the same thing.
S6.2 - reads more as an advert for Virtuoso than a description of how it is used as a store for DBpedia.
It would be more useful to say what the status code 509 represents than to give a link to the definition of status codes in wikipedia. Also, if anything at all the pointer should be to the formal definitions at w3.org, and not wikipedia.
S6.4 "Figure 13 shows the percentages ... As we can see, the usage of the SPARQL endpoint has doubled from about 22 percent in 2009 to about 44 percent in 2013." - actually, no, cannot see this - the authors may know the mapping from DBpedia version to release year, but the average reader will not. There are several other instances where such oblique references are made.
S7 - a lot of the examples of use are self-citations, I would suggest that reuse by the DBpedia team be noted - and maybe presented separately? - to demonstrate reusability beyond its creators/maintainers - this is what really strengthens the case for the value of DBpedia.
7.4 - is confusing - is this a positive example of use or one highlighting that applications are not always well implemented?
8.2 "Apart from this, link structures are used to build the Wikipedia Thesaurus Web service48. Additional projects presented by the authors that exploit the mentioned features are listed on the Special Interest Group on Wikipedia Mining (SIGWP) Web site49." - who are the authors being referred to - the creators of the thesaurus in the previous sentence? In which case - who are they? - both references are to a URL, not a publication.
Citing [2] as related work - with the precursor of DBpedia is a bit unusual - it's not really related work, but more a previous incarnation of the system (or part of it).
The paper concludes by saying "Despite recent advances …, there is still a huge potential for employing Linked Data background knowledge in various NLP …" - this is contradictory - I suspect what is meant is something like "Recent advances show huge potential for ..."
The Sindice query in the appendix is so short as to be better placed within the paper.
FIGURES & TABLES
Fig. 1 is placed two pages after it's referenced. Yet there's more than enough room to place it on the same page. Further, this would made it easier to interpret the corresponding text.
Fig. 2 - text in greek mapping box a bit confusing - why are some properties in greek and some in english?
Fig. 4 - x-axis should specify release date - especially as the curve will look a bit different if they're not evenly spaced out.
Fig. 5 - wd suggest (faint) background lines from y-axis - difficult to map values to bars.
Fig 6 - release versions need to be shown with the dates
b - the colours for en & pt, and el & fr (tr is only slightly darker) cannot be distinguished, even in colour, let alone greyscale, which is the default for printing to read.
Figures 11 and 12 are related to each other (also cross-referenced together in the text) and should be placed next to each other, not on different pages. Otherwise difficult to compare them. Also, while I acknowledge the caption states the charts are for SPQRQL access, it would be useful to state this in the table headers.
Convention places table captions ABOVE tables.
Table 1 - would be useful to indicate which of the four extraction types (in 2.2) each extractor is classified as.
Table 7 is referenced before 6 - should therefore come before it.
Table 15 - line count for XML config files is not very meaningful without a description of content
CITATIONS & REFERENCES
The intro forward references the DBpedia ontology - it would be useful to provide an exact cross-reference.
S7. In a lot of the "external" examples a web link is given to the project or organisation using it. Where a citable publication exists this should be used instead, or in addition to the URL, e.g., the BBC tag disambiguation example should reference a publication such as [1], Watson has a few articles that cite DBpedia (among others) as a data/reference source
Verify that capitalisation correct in all references, e.g., PowerAqua, not Poweraqua, in 28; GraphD, not graphd in 32; RDF, not rdf, in 41, DBpedia, not dbpedia in 45.
LANGUAGE & PRESENTATION
The Latin abbreviation cf. is used incorrectly in most cases - it means "compare (with)" - mostly used in place of "refer to" or "see".
There is some overuse of commas, making reading a bit difficult - a comma should be placed only where there's a natural pause in reading. E.g., 8.1.3 "One of the projects, which pursues similar goals to DBpedia is YAGO44 [39]. " - commas should be used only if it was written "One of the projects, YAGO, pursues similar goals to DBpedia [39]. "
Mostly minor grammatical errors easily caught with an auto check and proof-read. I'd recommend the latter by a single author - the paper reads quite uniformly, for one with such a long author list. There are however a few areas with differences in writing style and correctness in language use.
S7.1.2 - "Due to its large schema and data size as well as its topic adversity, " - should this be "topic DIversity"?
weird formatting, bottom of p.3 & 5
[1] Georgi Kobilarov, Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael Smethurst, Christian Bizer, Robert Lee: Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections. ESWC 2009:723-737
|