Geospatial Dataset Curation through a Location-based Game

Tracking #: 452-1629

Authors: 
Irene Celino

Responsible editor: 
Jens Lehmann

<
Submission type: 
Dataset Description
Abstract: 
The Urbanopoly dataset contains the results of a data curation campaign on available geospatial open datasets like OpenStreetMap. The curation effort is conducted through a location-based Game with a Purpose inspired by the monopoly board game. The paper describes the dataset: we illustrate the genesis and life-cycle of Urbanopoly data; we explain the modelling choices by introducing the provenance-based Human Computation ontology and by giving examples of the dataset content; we describe the dataset publication on the Web as Linked Data and the cross-links to the curated datasets; finally, we indicate the possible uses of the dataset as well as its envisioned re-use.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Marta Sabou submitted on 15/May/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes the Urbanopoly dataset. The dataset contains the results of a data cleaning activity of geospatial datasets by means of a game with a purpose. As such, this paper is quite different from the typical papers submitted to this call and it’s suitability for this special issue might be questionable. At the same time, however, this paper brings novel insights especially in terms of how to publish data obtained through human computation processes (as detailed in Sections 3.2-3.3). Given that the number of efforts in our community to use human computation methods (games, paid for crowdsourcing) for data gathering is increasing, I expect that this paper will be of value to many working in this novel area. Therefore, I will suggest accepting it.

Quality of the dataset: Good
The dataset is published using careful modelling choices and is available for both human and machine inspection. The human readable interface provides a good insight into the data, although the ontology schema elements have not been published with pubby. Therefore, clicking on these elements simply leads to a page describing the corresponding ontology (hc or uo) rather than retrieving all data items that are related (i.e., instances) of these types. This issue should be addressed for the next version.

Usefulness of the dataset: Medium
I would expect that the major use of this dataset lies in the extension and correction of the open geospatial resources that provide the game input data (e.g., process 4 in figure 1). However, this process is never properly explained, besides stating in 4.1 that the obtained data is published as linked data on the Web. Is there more to this process? Can the authors clarify how exactly are the open geospatial resources “corrected” (as stated in Section 2)? The rest of the envisioned usage scenarios for this dataset are limited to the improvement of the game itself and potentially as comparison data for other similar research work, therefore the usefulness of the dataset is not particularly high.

Clarity and completeness of the descriptions: Good.
Overall, the paper is well structured and clearly written, addressing most of the aspects indicated in the call. Some aspects of the work that should be discussed in more detail or clarified are:
• What is the size of the dataset (in triples)?
• How many of the initial input venues have been verified through the game?
• How many players provided contributions to this dataset?
• Are there any specific versioning mechanisms employed in conjunction with the periodical update of the dataset? (mentioned at the end of 4.1)

Review #2
By Victor de Boer submitted on 18/May/2013
Suggestion:
Major Revision
Review Comment:

Name, URL, versioning, licensing, availability
Topic coverage, source for the data
Purpose of the Linked Dataset, e.g. demonstrated by relevant queries or inferences over it
Applications using the dataset and other metrics of use
Creation, maintenance and update mechanisms as well as policies to ensure sustainability and stability
Quality, quantity and purpose of links to other datasets
Domain modeling and use of established vocabularies
Examples and critical discussion of typical knowledge modeling patterns used
Known shortcomings of the dataset

This paper describes the Urbanopoly dataset, which consists of the results of a GWAP 'urbanopoly'. The dataset models the result of the GWAP including provenance. I think the paper is well written and understandable, the Urbanopoly game is in general well described and seems to be a good way of enriching geographical data. The datamodel is in my view relatively simple but it fits the purpose quite well. The human computation model is a very nice schema and especially the mapping to PROV is useful. Example queries show how the data can be queried to answer questions about the GWAP experiment outcomes.

- One question I have about the human computation ontology is the reusability. How reusable is this model, can the authors provide use cases or examples of such reuse outside of the Urbanopoly game?

However, with respect to the specific call for this special issue, I have some reservations about the paper. The special issue is concerned with concise descriptions of useful and reusable datasets. The dataset as it is described is essentially the result of a crowdsourcing experiment (who added what information), rather than the triples that enrich the geographical data (venue-feature-value). I can see how the latter would be useful and reusable, but for the human computation dataset as it is currently available, I fail to see the re-use and the authors do not provide convincing examples other than that it could be reused to evaluate different aggregation algorithms. The fact that the reusable triples (venue-feature-value) are now 'hidden' as reified triples inhc:ConsolidatedInformation instances makes the dataset harder to reuse from a Linked Data perspective. If I want an application to reuse your information to plot it on a map, I now need to be aware of the details of the human computation model, rather than just asking for all triples with subject venueX, which I think would be more reusable.
In short, as it is presented now, I am not convinced that this paper describes a dataset that is useful outside of its context. The useful information (the geo-enrichments) are hidden in the data and are not presented as the main outcome in this paper. I think this can be solved in two ways: either the authors make more explicit how the human computation dataset as it is now qualifies as a useful and reusable dataset or the authors describe the geographical data (venue-feature-value) in more detail (how many features/properties are used, where do these properties come from, how many venues are enriched, what is the average number of enrichments per venue etc.)

- One issue that remains regardless of this choice is the lack of statistics about the dataset. the paper fails to provide clear statistics on a number of defining features of the dataset. (# of venues, players, properties, relation instances, average number of features per venue, the distribution of information over venue or feature types, etc.). This will give the reader a much clearer view of the usefulness of the data.

- It seems that there is a lack of links to other datasets other than the reuse of the source data (OpenStreetMaps and LinkedGeoData). Are all rdf:object values unmapped literals or could you use these literals to link to other datasources (e.g. dbpedia for restaurant types,...)

Some other issues:

p2. Specific minigames result in specific information. please elaborate on how many (which) minigames were used and which triples they produce.

p2. - "The evaluation on the data curation results [6] of Urbanopoly is very good in terms of both precision/accuracy – around 92%" -> I understand that these results are better reported in [6], however, it would be good to also repeat here how these results were obtained to get a clear picture of the quality of the resulting dataset. Is this evaluation of a sample? Who evaluated it? Is this evenly distributed over the properties or is some data more reliable than other data?

Review #3
By Willem Robert van Hage submitted on 24/Jun/2013
Suggestion:
Major Revision
Review Comment:

This paper describes how a new ontology for describing the output of games with a purpose, Human Computation, is used in the Urbanopoly game. The paper shows how it is aligned with the PROV model, and how the resulting building data set can be used.
I think there is clear value in this paper as an examplar case for people trying to accomplish something related in the field of provenance modeling or crowdsourcing.
The paper does not move beyond the level of a very well written technical report, because it lacks reflection and comparison with alternative approaches.
There are various ways in which the described modeling task could have been approached. If the authors would add a discussion of the issues they faced while making the Human Computation model, the advantages and disadvantages of various solutions, and the motivation for their choices given the constraints of their specific application (Urbanopoly), then this paper would become a good academic paper.
If you see this paper as participatory empiricla research, then it is relevant to know what the properties are of the dataset yielded by the users playing the game, given the modeling choices that were made by the authors. If the paper would be extended with descriptive statistics of the data set, combined with conclusions on how the modeling choices influenced these, then the paper would become an interesting empirical research paper.


Comments