Raising Semantics-Awareness in Geospatial Metadata Management

Tracking #: 1570-2782

Cristiano Fugazza
Monica Pepe
Alessandro Oggioni
Paolo Tagliolato
Paola Carrara

Responsible editor: 
Werner Kuhn

Submission type: 
Full Paper
Geospatial metadata are often encoded in formats that either are not aimed at efficient retrieval of resources or are plainly outdated. Particularly, the quantum leap represented by the Linked Open Data (LOD) movement did not induce so far a consistent, interlinked baseline in the geospatial domain. Datasets, scientific literature related to them, and ultimately the researchers behind these products are only loosely connected; the corresponding metadata intelligible only to humans, duplicated on different systems, seldom consistently. Our methodology for metadata management envisages i) editing via customizable web-based forms, ii) encoding of records in any XML metadata schema, iii) translation into RDF (involving semantic lift of metadata), and finally iv) back-translation into the original XML format with added semantics-aware features. Phase iii) relates metadata to RDF data structures that represent keywords, toponyms, researchers, institutes, and virtually any data structure in the LOD Cloud. Our framework fosters delegated metadata management as the entities referred to in metadata are independent, decentralized data structures with their own life cycle. Our approach, demonstrated in the context of INSPIRE metadata (the ISO 19115/19119 profile eliciting integration of European geospatial resources) is also applicable to a broad range of metadata standards, also non-geospatial ones.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sven Schade submitted on 09/Jun/2017
Minor Revision
Review Comment:

The authors address the long standing and important challenge of handling geospatial metadata with an approach that could equally be applied to other domains. With this clear written and illustrative paper, they make a valuable contribution that is certainly worth publishing. However, in its current form, the document remains at a technical level and close to practical implementation solutions. I would therefore recommend to change the paper type from “Full Paper” to “Report on tools and systems”, also in order to address the appropriate readership.

If this would be acceptable, I would have only very few more detailed comments:
- The abstract should be formatted as a single paragraph.
- In the first paragraph of the introduction it should be mentioned that the INSPIRE Directive was published in 2007 in order to indicate the time frame.
- The fifths paragraph of the introduction might start with “The purpose of our work is twofold” instead of “Our purpose is twofold”.
- In the same paragraph, it would need an explanation with SWE practices had to be followed in this particular case.
- In section 2, purists say for a long time that only the title and description of metadata should be free text. I thus wonder a bit about the opening statement of this section. Although examples are given later on, it might be appropriate to relax the statement slightly.
- Section 2.1 lists many practical issues, which is great! The example introduced in 2.2 is very illustrative. I wish other authors would follow a similar approach.
- In section 3, there is a blank line to be removed just before the paragraph starting with “Listing 1 shows a fragment…”.

Review #2
By Sara Lafia submitted on 14/Jun/2017
Major Revision
Review Comment:

This paper contributes a workflow for semantics-aware metadata that integrates data into a single-point of access and offers on-the-fly conversion between triples hosted in a triplestore and XML templates. This allows for dynamic and flexible metadata management, which is helpful as researchers increasingly rely on “third parties to shape their identities on the web” (p. 10) and taking advantage of URIs for disambiguation helps.

The main contribution of this work is automatic metadata management and update, which is enabled by adoption of RDF as “the native metadata storage format” (p.5). While many metadata editors currently exist, none offer particularly flexible schema translation. The architecture solution that the authors present allows for dynamic generation of metadata from RDF triples, with updates made via linked entities described with URIs.

Overall, the paper was legible and logical. The work presented was not particularly novel, but was written in language that was concise and clear. The writing has some minor English language issues and uses acronyms abundantly, but generally, the content is well-cited and explained. The authors clearly document their workflow and do a good job of anticipating challenges to their work. In general, this paper is a solid technical document, but does not present original research results.

The hypothetical example given and developed throughout the paper is not geospatial in nature. In fact, the scenario (updating the contact information for a data custodian who changes his work agency) is rather trivial. While the title of the paper is “Raising Semantics-Awareness in Geospatial Metadata Management”, this paper hardly mentions any unique challenges posed by the management of geospatial metadata. The authors cite use of GeoDCAT extension and then omit any mention of its adoption in their solution. The hypothetical example shows neither applications nor benefits to the cited RITMARE project.

The work applies existing extensions to build a custom workflow. While this is specifically useful for the case of RITMARE, the authors do not show how extensible it is and how beneficial it could be for other such cases. This work also claims to contribute to harmonization of metadata from free text descriptions, but does not show how this is accomplished. The authors also claim that that entities in metadata are linked to URIs from controlled vocabularies. Aside from the case study where the researcher’s FOAF profile is linked to his custodianship, this is not demonstrated.

The obvious benefits of using a triplestore in conjunction with metadata management are demonstrated, rendering metadata as dynamic "living documents" (p.3) and allowing for decentralized metadata management. The stated benefit of RDF as a native storage format is ease of maintenance due to the dynamic nature of the metadata, as XML is created from RDF triples on-the-fly when looked up. Other benefits, such as query expansion, inference, subsumption, disambiguation, however, were not addressed and are possibly not considered.

The update capability for RDF triples also raises larger questions about metadata provenance, which the authors also do not address. Is there a changelog for insertions to the triplestore? What checks for quality assurance are performed by metadata managers?

This paper was technically sound, but was not presented as a research paper with novel contributions. The RDF capabilities showcased were very basic and the results were not particularly significant. While certainly helpful for addressing the particular hypothetical use case, I am not convinced that the authors’ solution would have far-reaching practical implications for current metadata management practices. Perhaps rewriting this paper once the development of the project RITMARE geoportal is complete would yield more interesting results.

Review #3
By Simon Cox submitted on 18/Jun/2017
Minor Revision
Review Comment:

The paper describes a system that updates practice in metadata encoding and management for a significant community that has a strong history of formal metadata use (the geospatial community). It leverages developments in identifier management, linked data and semantics in order to address a nagging issue in metadata maintenance – that descriptions of key elements can go stale over the likely lifetime of a dataset. The key element of the proposed solution (delegating the maintenance of people-descriptions to external services) has always been theoretically possibly in the GML-based metadata environment under consideration. But with the emergence of a well-governed system for descriptions of people (ORCID), the approach is now feasible at scale.

The authors describe an original and effective RDF/SPARQL based tool-chain to actually implement it. In particular, they address the practical need to support XML-oriented legacy systems which have very widespread deployment in practice. While the paper only describes the application of this pattern to one aspect of the metadata record (the ‘responsible party’ in a metadata record) the principle that metadata records should be ‘normalized’ (at web scale), by substituting references to externally managed resources in place of information that is currently repeated inline in many records, is clearly applicable to other aspects of the metadata.

There is a general principle at stake here: a record-centric approach to metadata has been a somewhat unfortunate outcome of the XML platform used to implement the object-oriented geospatial metadata models. The tool of XML Schema Validation has driven this, though as Schade and Cox [2010] showed, the underlying GML-based XML platform fully supports an RDF-style graph model. But a key principle here is that the metadata-record for a specific data item or dataset should be effectively just a sub-graph from a larger pool of metadata describing many resources (Cox 2015 also discussed this a little further). Embedding the SPARQL to extract the person description into the XML template is cute, but accepts the questionable convention that metadata records are fully denormalized, whereas it might be a better principle to propose leaving the cross-references as links (using GML’s xlink mechanics).

So my main criticism of the paper is that the authors have not brought out the general pattern strongly enough in the discussion.

Another concern that is only alluded to in the paper is that in many cases the ‘responsible party’ for some roles (e.g. custodian) is not a person, but is actually a functional position attached to an organization, though a model for this is not provided in the ISO metadata standard (the W3C Organization Ontology provides org:Post which is close). Yet another likely problem is that organizational stability will in some cases be over shorter timescale than the usefulness of a dataset. Another important issue is that many of the parties (people) with some responsibility for geospatial data are not researchers and will not have ORCIDs.

These do not undermine the linking principle, but are likely to lead to suboptimal outcomes in practice.

Finally, I would like to see the discussion of the tooling choices extended just a little further. With the development of Shex and SHACL, the standard RDF-based toolset is probably fully competent to mirror the XML pipeline, with significant benefits in addition to the metadata-integrity concern that is the focus of this paper. In future we might see triple-stores replace the document-stores that are currently underneath most geospatial metadata systems, and a legacy ‘ISO metadata record’ could then probably be assembled by a SPARQL/XSLT tool chain.

I recommend acceptance. The paper is highly original. While the subject matter is not at the core of 'semantics' since their is no mention of entialments or reasoning, it is a pragmatic applications of linked data principles using semantic-web technologies to address a real problem to a significant web semantics community.

Schade & Cox, 2010 - http://publications.jrc.ec.europa.eu/repository/handle/JRC57141

Cox et al. 2015 - https://doi.org/10.4225/08/584af38d53fd7 https://www.researchgate.net/publication/306014248_Some_problems_with_st...