Matching and Visualizing Thematic Linked Data: An Approach Based on Geographic Reference Data

Tracking #: 935-2146

Authors: 
Abdelfettah Feliachi
Nathalie Abadie
Fayçal Hamdi

Responsible editor: 
Guest Editors Ontology and Linked Data Matching

Submission type: 
Full Paper
Abstract: 
Many resources published on the Web of Data are described by either direct or indirect spatial references. These spatial references can be used beneficially for data matching or cartographic visualization purposes. Indeed, they may be used as instance matching criteria: two resources that are very close in space may represent the same thing, or at least they may have some semantic relationship. However, heterogeneities between spatial references may make their use as instance matching criteria not very reliable or even impossible. In this article, we propose to reduce the data matching difficulties caused by the heterogeneity of spatial references by the mean of background reference geodataset. We also propose to take advantage of links created between thematic resources and geographic resources for designing better maps for data visualization at different scales.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 08/Feb/2015
Suggestion:
Major Revision
Review Comment:

The article proposes an approach that makes use of a geographic reference dataset for matching (data linking) and visualizing thematic data described by heterogeneous spatial references.

The article first describes related work in instance matching and thematic mapping and then describes the proposed approach and defines it in terms of set theory notation. The proposed approach is illustrated with two thematic and open datasets of historical monuments in the city of Paris (France): the first dataset has direct spatial references in the form of coordinates (latitude, longitude) – from DBPedia, while the second has indirect spatial references in the form of addresses (literal values) - Merimee. The geographic reference dataset consists of a set of polygons representing individual buildings in Paris (BD_PARCELLAIRE), and a set of structured addresses, each georeferenced to a geometric point (BD_ADRESSE). All the datasets were converted into RDF and stored in local triple stores.

The article presents an interesting case study of georeferencing and geovisualization in the Web of Data. In my opinion, the strength of the contribution lies in the fact that this case study was done in the Web of Data (RDF), which brings a different set of challenges than a similar case study with a desktop GIS or online mapping tools; there are also unique (promising) opportunities for the future. These challenges and opportunities should be elaborated in the discussion. At the same time, it should be acknowledged that the georeferencing and geovisualization techniques in the case study are not novel, but draw on existing best practices in cartography, geoinformatics and geographic information science.

Detailed comments:
The title refers to ‘geographic reference data’, while the article in some places refers to ‘background reference geodataset’. Terminology should be used consistently, e.g. use ‘background geographic reference dataset’ only.

Geocoding is an important part of the approach presented in the article, but literature on this topic is not present in the Related Works section. This should be added. See for example, Goldberg et al. (2007) From text to geographic coordinates: the current state of geocoding, URISA Journal.

According to section 3, instance matching is usually based on measures that compare spatial references of the same type. This needs to be better qualified, as there are other ways of instance matching (some discussed in section 2), such as comparing property/attribute values or comparing descriptions instances, which could involve multiple attributes.

The proposed approach references DBPedia locations (coordinates) to buildings in the BD_PARCELLAIRE dataset (poylgons) based on shortest distance. Any reason why a ‘within building’ or ‘within buffer around the building’ was not done? The same question arises for matching BD_ADRESSE to BD_PARCELLAIRE. Since these are two official reference datasets, one would expect their quality to be such that there is a known relationship between a building and an address e.g. the address is within a building or at a specific distance/location from the building. Using the shortest distance for matching needs to be explained/justified.

Section 5 describes visualization of the results to illustrate the usability of the proposed approach. A large part of this section explains current knowledge, e.g. descriptions of grouping and amalgamation, and the algorithm for feature amalgamation. While this article presents an interesting new case of amalgamation in the Web of Data, amalgamation itself is not new. The focus should shift to the challenges and opportunities of doing amalgamation in the Web of Data, rather than an explanation of amalgamation itself. Also, can anything be said about the performance? Was the visualization produced in an acceptable time period?

A major shortcoming of the article is that there is no discussion of the results. For example, would this approach be generally applicable to all kinds of datasets and all kinds of spatial references? The approach is based on the assumption that points (very specific) can be generalized to polygons (larger area). This works for locations and buildings, but what about other datasets? Or if a polygon dataset is not available?

The results of the proposed approach depend on the quality of the datasets that are used. For example, if more of the addresses in the Merimee dataset were incomplete or invalid, this would have resulted in poorer/better matching. The same applies to the coordinates in the DBPedia dataset. This needs to be acknowledged when the results are discussed. It should be recommended that future work should tests the approach against larger datasets of varying quality – to evaluate the time performance, as well as the quality of the results.

The approach is tailored to specific datasets. For example, in other countries/regions, dataset of building polygons exist with attributes of the building address. In such cases the approach could be simplified:
Merimee → BD_PARCELLAIRE
DBPedia → BD_PARCELLAIRE

Such assumptions and limitations of the approach (pointed out above) should be thoroughly discussed in a separate discussion section. The discussion of the results should also emphasize the contribution of the article to data matching and geovisualization in the Web of Data specifically.

Language
The language in the article is generally acceptable, however, I found it difficult to follow the flow in some parts of the article. For example, the references to different matching tasks and sub-tasks in section 4.4 are difficult to follow. Consider adding an overview of tasks or assigning unique names to different tasks to make it easier to follow (e.g. by adding matching task numbers or unique names to Figure 3). This should also be applied to other parts of the article to improve its readability.

There are some typos and grammatical errors that need to be corrected.

Review #2
By Werner Kuhn submitted on 29/Mar/2015
Suggestion:
Major Revision
Review Comment:

This is a quite well-written paper on an important (though not novel) idea, namely to use spatial references for entity matching. This idea is applied to the domain of linked data and implemented for two data sets about historic monuments in Paris. I believe the paper should eventually be published, but it currently suffers from some significant weaknesses that need to be dealt with.

The main weakness is the artificial setup of the specific problem to be solved. If I wanted to match monument data from the (specialized) Merimee database with French DBpedia data, I would not limit the monuments in DBpedia to those classified as Monument_
historique_de_Paris and as Monument_parisien. Given the arbitrariness of DBpedia labels, this is likely to result in a haphazard collection and exclusion of monuments. In fact, the example in Figure 7 (the Immigration Museum in Merimee and its building, the Palais de la Porte Doree in DBpedia) shows this mismatch resulting from the way the problem is posed. The french Wikipedia has entries for both, the building http://fr.wikipedia.org/wiki/Palais_de_la_Porte_Dor%C3%A9e and the Immigration Museum http://fr.wikipedia.org/wiki/Mus%C3%A9e_de_l%E2%80%99histoire_de_l%E2%80.... Matching any of these with the Merimee entry does not require the complicated procedure proposed in the paper.

It would have been much more interesting to see an actual problem of data matching between Merimee and a data set that one would really want to match with it. In the absence of such a problem (or at least of a clear use case for the problem set up here and for the selection of DBpedia data), I am not sure we learn much from the paper. Along the same lines, the general problem statement in Figure 1 should be made specific, showing an actual match problem with real data. As a vexing detail, this figure lists coordinates with a nanometer (!) precision...

I also failed to understand what alternative matching procedure was used to evaluate the proposed method. Was it string-based name matching? Are we sure, based on the paper, that an adequately selected second data set would not fair *better* in name matching than this one does in the laborious spatial matching? Is the massive overhead of linking to two geographic reference data sets worth the gain? And why are there *two* such reference data sets? I understand the need for the geometry and address information, but do we know how the two are matching up? What if they do not? Shouldn't the matching first be done for the reference data sets themselves?

The second and third weaknesses of the paper have more to do with what it presents than with the work done. What is the purpose of the formalization in section 3, apart from showing impressive sub- and super-scripted strings of symbols? It is fairly straightforward to understand (though not easy to read at first), but has it been used in any way for designing the system, implementing or testing it? How exactly? If not, I suggest it be replaced by clear natural language statements. Section 5 on visualization should be reduced to the purpose of illustrating matches or mismatches. Presenting visualization as its own goal of work has nothing to to with linked data matching and overloads the paper (already in the title).

There are a number of English grammar and style problems throughout the paper. Many of them are misplaced plural forms and articles (too numerous to mention). Also, some formulations are bloated or unclear:
- "background reference data" should simply be "reference data"
- "used beneficially" is simply "used"
- "database instance" should be "database record" or "database element" - a database instance is something else!
- why always talk about "thematic" data? Why not just data? Geographic data are also thematic; in fact all data are thematic
- what does this sentence mean? "Each convex hull is created from the venues location points of a cluster."
- "geonames" is not a word: use place names or toponyms
- "spatial location coordinates" are simply "coordinates"
- a "GoogleMaps base map" is simply a Google map
- "completely distinct" is simply "distinct"
- "a spatial plugin that can be used to compute" is simply "a spatial plugin to compute"
- there is no such thing as "tends to prove": it proves or does not, or it provides (some) evidence etc

A common misunderstanding is that "data described by points instead of polylines or polygons" are "less accurate". They have less detail, but can be more or less or equally accurate (beginning of section 4).

The following sentence makes me worried about lots of non-sense being added to the linked data cloud: "To simplify the process, Anchor
links, spatial relation links and links between thematic resources are all of type owl:sameAs.". Is the Eiffel Tower now stated to be a point? Is the Immigration Museum now the sameAs its address?

The way the evaluation is presented is not clear, neither methodologically (what exactly was the proposed method compared with?) nor in the presentation of the results (which should be shown as a table with clear explanations of the values). Also, the discussion at the end (one short paragraph) is too short.

The footnotes should be checked for specificity. For example, footnote 3 should point to the actual OS tool, not just OS, and explaining shapefiles as "a geospatial vector data format" in footnote 23 is not very informative. The formatting of the reference list at the end needs to be cleaned up.

Review #3
By Thorsten Reitz submitted on 06/May/2015
Suggestion:
Major Revision
Review Comment:

The paper addresses two interesting issues: matching Linked Open Data resources by using their spatial properties and discussing the visualization of such geographic linked data over multiple scales.

I like that the authors explicitly take into account applications and use cases,and the overview of related work in matching and cartographic visualisation is pretty complete. I did think of a paper I co-authored a few years back that might make a useful read in this context:

http://link.springer.com/chapter/10.1007/978-3-642-00318-9_9

However, there are several improvements I'd suggest to round off the contribution in the matching part:
- Explain reference data better: What makes reference data suitable? Homogeneity? Known quality? Other properties?
- Determine which properties of reference data lead to a good result: Is it because the reference data was polygons, whereas the two matching data sets were points? Would the same approach work with other topologies, e.g. point - polygon - polyline?
- For the data matching approach, I'd highly recommend to improve the formatting of the logic statements and to add a figure to explain it. Link Fig 1. more clearly to the steps in the 3.2, for example, and use a separate figure to explain the problem. Best would be to use a real-world data set for that problem figure.

On the cartographic visualization part, I see only little contribution over the body of work that can be found in generalization methods and in visualization of Linked Data. I would recommend to determine which unique ways LOD provides to visualize data - i.e. use links to determine what needs to be shown at which scale, think of how to visualize the links, think of realistic and thematic mapping...