Review Comment:
This paper discusses Linked Data representing various forms of accommodation in Amsterdam, Tuscany and Spain, containing 19,973 instances. The dataset has been sourced from booking.com and Google+ Local using scrapers. The paper gives an overview of related (Linked) datasets, the existing vocabularies used, the mechanisms by which the raw data are extracted and integrated, and what the final RDF looks like. Links are provided as entry points to the dataset. Licensing and maintainance/updates to the dataset are briefly discussed. Finally, the authors outline potential use-cases for the dataset; primarily, as its name suggests, the OpeNER dataset was originally intended to provide Named Entity Recognition for entities referring to accommodation in the given locales.
Based on the review criteria for Linked Dataset Descriptions [1], I think that the description of the dataset is mostly adequate and the quality of the dataset is adequate (aside from a lack of links). My main concern is about the legality of publishing screen-scraped data as Linked Data: this completely undermines the usefulness of the dataset.
(3) Clarity and completeness of the descriptions.
The paper does a reasonable job of describing the dataset. There are a few minor issues here and there with formatting and with English (an incomplete list of minor comments highlighed below), but otherwise the description is quite concise and clear.
(1) Quality of the dataset.
The paper describes the dataset and provides links to the dataset. The example URI for a hotel is incorrect, and should presumably be:
http://wafi.iit.cnr.it/opener/resource/acco-1
This successfully returns a D2R-style HTML rendering of the RDF data, or Turtle if requested. From a quick look, the data seem fine. I dislike that the referenced HOntology is returned in OWL XML presentation syntax since most applications will not have a parser for this (since we are talking about consumption by RDF tools, an RDF representation of the ontology would make more sense than an OWL syntax supported by a handful of OWL tools). And lower-case properties are more conventional in Linked Data. I also think it should be clarified in the paper that Hontology was created by the authors.
http://wafi.iit.cnr.it/angelica/Hontology.owl#
I am also quite concerned that the level of interlinkage is quite low: about 543 links to DBpedia. (Also, the SPARQL endpoint and the dataset itself seem a bit unstable where I encountered various temporary errors while browing through.)
Personally, I would also recommend against VCard whereever possible. The vocabulary is not tailored for RDF; it over-uses literals and is not particularly intuitive. Use alternatives whereever possible. In particular, for latitude and longitude, wgs84 is much more commonly used:
http://www.w3.org/2003/01/geo/wgs84_pos#
Also, don't use gr:name. There are about twenty name/title/label properties in Linked Data and only three of them are needed. Please just use rdfs:label (and skos:prefLabel or skos:altLabel if there are aliases).
The lack of links is perhaps the biggest issue in terms of evaluating OpeNER as a Linked Dataset. The other issues are admittedly minor and the data seem to be formatted quite well as RDF.
(2) Usefulness (or potential usefulness) of the dataset.
My main concerns lie in this point. Aside from the fact that the coverage of the data is somewhat localised (and thus applications are limited to those locales), and that the highlighted NER application is very specific, licensing is a major issue since the dataset is screen-scraped from two commercial sites that expressly forbid such extraction. The authors acknowledge such issues in Section 4.5 (discussion should probably be earlier as it's an obvious concern), but they leave the situation ambiguous whereas the situation seems rather clear-cut. For example, for booking.com, aside from the copyright notice on all pages, here's the relevant quote from the T&C's [2]:
"""
Our services are made available for personal and non-commercial use only. Therefore, you are not allowed to re-sell, deep-link, use, copy, monitor (e.g. spider, scrape), display, download or reproduce any content or information, software, products or services available on our website for any commercial or competitive activity or purpose.
"""
And these are the relevant Google T&C's [3]
"""
2. Restrictions on use. Unless you have received prior written authorisation from Google (or, as applicable, from the provider of particular Content), you must not: (a) copy, translate, modify or make derivative works of the Content or any part thereof; (b) redistribute, sub-license, rent, publish, sell, assign, lease, market, transfer or otherwise make the Products or Content available to third parties; (c) reverse engineer, decompile or otherwise attempt to extract the source code of the Service or any part thereof, unless this is expressly permitted or required by applicable law; (d) use the Products in a manner that gives you or any other person access to mass downloads or bulk feeds of any Content, including but not limited to numerical latitude or longitude coordinates, imagery and visible map data; (e) delete, obscure or in any manner alter any warning or link that appears in the Products or the Content; (f) use the Service or Content with any products, systems or applications for or in connection with (i) real-time navigation or route guidance, including but not limited to turn-by-turn route guidance that is synchronised to the position of a user's sensor-enabled device; (ii) any systems or functions for automatic or autonomous control of vehicle behaviour; (g) use the Products to create a database of places or other local listings information.
"""
When the authors used the service to screen-scrape the site, they obviously broke the T&C's. They simply should not have the data. I thus don't see how the authors can "get a specific licence in order to expose all the information, including accomodations description and other sensitive data."" It would probably be okay to use the data for personal use or for offline research purposes, but replicating the data online is obviously a different matter. Hence, since the legality and/or availability of the dataset are fundamentally compromised, I think the usefulness of the dataset is completely undermined.
MINOR COMMENTS:
* Throughout: use "," or a thind space as a thousand separator, not "."
* "Booking.com is [an] online booking"
* Fix line spacing in right hand column of first page.
* "100 textsearch" -> "100 text searches"
* "Note that not all the categories are defined in both the ontologies" Rephrase
* "using both [of] the"
* "Turtle code[ ]is"
* "rectangles [are] literal values"
* Fix formatting of links in Section 4.4.
* "to search for <> accommodation providing"
* Fix formatting of references.
[1] http://www.semantic-web-journal.net/reviewers
[2] http://www.booking.com/general.en-gb.html?dcid=1&sid=6ad949028f7510383b8...
[3] http://www.google.com/intl/en_uk/help/terms_maps.html
|
Comments
Submission in response to
Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-call-2nd-s...