A Systemic Approach for Eff ective Semantic Access to Cultural Content

Paper Title: 
A Systemic Approach for Eff ective Semantic Access to Cultural Content
Authors: 
Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou
Abstract: 
A large on-going activity for digitization, dissemination and preservation of cultural heritage is taking place in Europe, United States and the world, which involves all types of cultural institutions, i.e., galleries, libraries, museums, archives and all types of cultural content. The development of Europeana, as a single point of access to European Cultural Heritage, has probably been the most important result of the activities in the field till now. Semantic interoperability, linked open data, user involvement and user generated content are key issues in these developments. This paper presents a system that provides content providers and users the ability to map, in an e ffective way, their own metadata schemas to common domain standards and the Europeana (ESE, EDM) data models. The system is currently largely used by many European research projects and the Europeana. Based on these mappings, semantic query answering techniques are proposed as a means for e ffective access to digital cultural heritage, providing users with content enrichment, linking of data based on their involvement and facilitating content search and retrieval. An experimental study is presented, involving content from national content aggregators, as well as thematic content aggregators and the Europeana, which illustrates the proposed system capabilities.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Decision/Status: 
Accept
Reviews: 

Manuscript revision (now accepted) after an accept with minor revisions. Previous reviews below.

Review 1 by anonymous reviewer

The paper is really improved.

Here are some remarks for the new version:

Related work:
Although this paper explores semantic access, the related work mostly deals with metadata schemas.
Still, there are tools that use semantic information, like Powerset, CatScan, Shallow Semantic Query,
etc.

Scalability:
Even with limited data, the proposed approach needs a lot of execution time.
Are there any plans or thoughts on how to make this more scalable, for Europeana sized collections?

Evaluation:
The evaluation deals mostly with performance / response time, and not with the evaluation of the output results:
A recall / precision measure should be estimated, by examing the results, to express how well this approach works.
Or the results should be compared with results from another approach.
Some related tools (see above) often use pre-computed semantic information, and this makes them run faster, and more scalable.
But how can we compare the results of the alternative approaches?

Review 2 by anonymous reviewer

I have now checked the paper and responses to reviewers, and my assessment is that the paper has improved much, especially the evaluation part that was my main concern. So from my side there is no objection to proceed with accepting the paper.

Meta review by anonymous reviewer

I was asked to provide a meta review of this paper. For this I read both the paper, all the reviewer comments and also the document from the authors where they explain how they have addressed the reviewer comments.

To me it seems that the most crucial issue that reviewers addressed was related to the novelty of the query rewriting and further on to the lack evaluation of the approach. Although authors provided some sort of evaluation in the new version it looks like an artificial one without real involvement of any human test subjects, and e.g. precision-recall analysis. Comparing running times of the queries is not perhaps the kind of evaluation that reviewers asked for. Taking this, I would vote for reject & resubmitting after a major revision.

However, I had a look at the other papers of this special issue and for me it seems that many of them also lack any proper evaluation. For example the paper by Mäkelä and Hyvönen does not have any. So if editors have accepted papers that do not include any evaluation then they might consider including the one by Kollia et. al as well. But this is something I leave to the editors to decide.

This is a revised submission after a "reject and encourage resubmission." The reviews below are for the original submission.

Review 1 by Sarantos Kapidakis

The paper briefly presents a tool that helps the user define metadata mappings for harvesting in Europeana, and proposes a semantic framework for its use.
The tool presentation is short, and it is not clear how it handles n:m field mappings, obligatory europeana elements and input validation.
The description of the proposed framework is lengthy although not implemented, but in general explains the Europeana current plans and efforts, and gives some examples for the implementation of some parts of it. The query evaluation examples use known optimizations, and the other examples cover only small simple cases, with a small dataset and classification hierarchies.
A useful proposed approach for Europeana should be able to handle larger collections: from one library alone we could get millions of records, thematically mapped to DDC or LC. Many of the descriptions serve more like a wish-list, as it is not clear how they can be implemented in the real environment.
The ideas behind the paper and proposal are good, but we need to see them implemented, to see how they work on real data and even to measure precision and recall (or other metrics) on its usage.

Review 2 by Werner Kuhn

"This paper presents a system..." is a phrase that raises concerns in an abstract: can we learn something from such a presentation and is the system innovative? I am afraid I cannot answer either of these questions positively. While it is obviously very important to solve semantic interoperability problems in the context of cultural heritage information, the paper fails to identify specific unsolved problems and does not make it clear what its contributions are in terms of semantic web methods (rather than possibly useful tools for cultural heritage communities).

If a system is the main focus of a paper, one needs to show at least what parts of it are completed and what results they have produced. The paper also fails on this account and actually raises more questions about the state of implementation than it answers. There is no evaluation at all, just a rather superficial section describing an "experimental study" that is largely a thought experiment on how the system might operate and be useful in the future. No data, no evaluation, no user studies.

The only contributions to methodology could be the query expansion and optimization techniques, but these are to the best of my knowledge rather standard and not innovative.

The paper also has some language problems, but these are irrelevant given the above remarks. The images in the paper are nice, but really have nothing at all to do with its contents, only with the content of the information treated.

Review 3 by Rainer Simon

The paper presents a software system for mapping institutional metadata schemas to a unifying schema - the Europeana Data Model - in a user-guided process. The paper furthermore discusses how query answering can help in the process of metadata enrichment.

Overall, the paper is well written. The mapping tool is likely a valuable achievement w.r.t. establishing semantic interoperability in the cultural heritage field. Furthermore, it appears the tool is already being used on a large scale, which is impressive.

My one point of criticism, however, is that the paper slightly lacks a clear focus and "storyline", which makes it difficult to read. On the one hand, the paper describes the tool (which would certainly deserve a dedicated paper on its own right), but only relatively briefly. On the other hand, the paper talks about the "metadata enrichment by query answering" approach. It is suggested that this is somehow part of the overall mapping workflow, but I fail to see how exactly this is integrated, or whether it is integrated in the tool at all. An additional screenshot might help.

Likewise, the idea of resource linking done by (amateur) users as part of creating "stories" for personal use is fascinating. But it is not clear how this relates to the rest of the paper. Sure, semantic interoperability, search and query answering are all, apparently, "technological underpinnings" for this. But considering the paper's initial focus on the metadata mapping software, this seems like a large step from one topic to the other...

Also, the paper lacks a dedicated related work section. This might be useful in helping the reader to better put things into perspective.

Review 4 by anonymous reviewer

The paper is describing a metadata mapping tool plus some cases of semantic search applied to the domain of cultural content. The paper is dealing with semantics and it nicely fits both the journal scope and the contents of the call.

In general terms, the paper is reporting in some pressing practical issues related to metadata integration, and then focusing on the possibilities of ontology-based search exploiting formal semantics. In consequence the topics of the paper are highly relevant to the field. However, I have major concerns with the contributions provided and their degree of maturity and innovativeness. In what follows I detail my major concerns:

(1) The first contribution is contained in the first part of Section 2. It first described the existing ESE/EDM for Europeana. Then it reports on the integration system, succintly describing technology (XML/XSl) then visualization (Figure 1), and then workflow issues (Fig. 2 and 3). A review of the innovative issues is the following:
* The technical issues and visualization are mainstream and provide no real innovation over existing metadata editors and schema mapping tools, or the authors have failed to highlight which of the features of the tool that can not be found in other tools. The authors claim user friendlines of the tool, but this is not evaluated or analyzed from a scientific viewpoint in the paper.
* Workflow issues are interesting as some kind of "best practice" but they do not constitute a substantial progress over the state of the practice. Also, they are not subject to further analysis or evaluation later so they can be considered mostly background information on the usage of the solution.

(2) The main research content comes in the last paragraphs of Section 2 and Sections 3, 4 and 5. In Section 2, interesting problems associated to performance querying using formal ontologies and OWL. Section 4 describes the query answering implementation reused (from Oxford) and deals with some high level aspects in 4.1. that in my opinion can be removed as they are not directly relevant to the query problems analyzed. Then, Section 5 provides some example queries and response and some brief discussion. The problem with this contribution is that the status is still preliminary, and there is no real experiment, but a report on some examples. More work is needed and substantial representative sample queries with details measurements to come up with credible conclusions. Also, the authors are not clearly analyzing if performance is dependant on the implementation and content base structure or on the ways queries are resolved only. Both aspects need to be considering when assessing running time for queries.

It seems that contributions (1) and (2) are disconnected, as the mapping tool is for standard XML metadata and not highly expressive OWL.

In conclusion, the paper is addressing relevant formal query issues but:
- It mixes with a presentation of a mapping tool that can be removed or integrated in the introduction.
- Does not report substantial empirical data that constitutes a real advance. The experimental method is limited to some example queries and comments on anecdotal evidence on query results on a particular content database.

My overall suggestion is reorienting the paper to the query problems and rewritting it to expose that as the main contribution, and doing additional work in the experimental methodology and extensive analysis of empirical results for the query problems posed. I encourage the authors to do so as there is very limited literature in these topics.

Some additional comments:
- Fig. 4 and discussion on lined data can be romoved as it is non relevant to the main research issues.
- An explanation of the ontologies used in the queries is required and the detail on how LIDO metadata in the database is translated to that richer OWL representation, to understand if it is 100% automatic or requires enrichment.
- The relation of Europeana EDM to the OWL based querying system must be clarified, as the implementation of the experiments is done with other existing systems that are not operating on the representations used by Europeana.

Tags: 

Comments

REPLY TO THE COMMENTS OF REVIEWERS
for the paper
"A Systemic Approach for Effective Semantic Access to Cultural Content"

We wish to thank all reviewers for their constructive comments which have greatly assisted us in improving our manuscript. In particular, our reply and taken action w.r.t. each comment is described below.

-----------------------------------------------------------------------------------------
Reply to the 1st Reviewer

COMMENT 1.1. The tool presentation is short, and it is not clear how it handles n:m field mappings, obligatory europeana elements and input validation.

RESPONSE: The presentation of the metadata aggregation system and specifically of the mapping editor has been reorganized, in sections 3.1 and 3.2 respectively. Available mapping language features are listed in the last paragraph of section 3.2, while the previous paragraph mentions that the mapping editor is schema aware and applies all constraints imposed by the target XSD, including mandatory items. Finally, ingestion procedures and subsequent input and transformation validation are mentioned in section 3.1.

COMMENT 1.2. The query evaluation examples use known optimizations, and the other examples cover only small simple cases, with a small dataset and classification hierarchies.

RESPONSE: We revised Section 5 (the system evaluation), providing in Section 5.2 results with thematic terminologies and large data sets to which these terminologies apply. With the aid of these results that are based on the description of the proposed methods and system implementation, provided in the revised Section 4 (the description of the semantic query answering module), we believe that it is now know clear that the query evaluation covers complex cases of practical use.

COMMENT 1.3. The ideas behind the paper and proposal are good, but we need to see them implemented, to see how they work on real data and even to measure precision and recall (or other metrics) on its usage.

RESPONSE: Sections 3 and 4 (restructured) of the revised paper provide the details of the algorithms implemented for the ingestion, schema mapping, semantic enrichment and semantic query answering modules. Moreover, the revised Section 5 (the implementation and system evaluation section), describes additional system implementation and integration issues, also presenting a systematic evaluation of the system with large data sets and terminologies, within the framework of Europeana.

-----------------------------------------------------------------------------------------
Reply to the 2nd Reviewer

COMMENT 2.1. While it is obviously very important to solve semantic interoperability problems in the context of cultural heritage information, the paper fails to identify specific unsolved problems and does not make it clear what its contributions are in terms of semantic web methods (rather than possibly useful tools for cultural heritage communities).

RESPONSE: In order to clarify the motivation, we added in the revised version of Section 1 (Introduction) a paragraph (paragraphs 4 an 5, page 2) identifying the unsolved problems that motivate our work. Moreover, in the beginning of Section 2 (paragraph 5, page 3) of the revised paper, we clarify the paper contribution in terms of Semantic Web technologies, presenting the system architecture.

COMMENT 2.2. If a system is the main focus of a paper, one needs to show at least what parts of it are completed and what results they have produced. The paper also fails on this account and actually raises more questions about the state of implementation than it answers. There is no evaluation at all, just a rather superficial section describing an "experimental study" that is largely a thought experiment on how the system might operate and be useful in the future. No data, no evaluation, no user studies.

RESPONSE: We added a new section (Section 2 in the revised paper), presenting the system architecture, clarifying in this way the modules that we have implemented and integrated in our system. In Sections 3 and 4 of the revised paper, we follow the above system architecture and describe the modules, from a technical point of view. In particular, we reorganised the old Section 2 (Section 3 in the revised paper; the section that describes the semantic mapping and enrichment modules), split it into three subsections, and provide the details of the metadata aggregation and semantic enrichment modules, putting emphasis on implementation issues. Moreover, in the revision of the old Sections 3 and 4 (Section 4 in the revised paper; the section that describes the semantic query answering module) we added formal descriptions of the proposed algorithm (last paragraph of page 7, pages 8 and 9) and its components (subsections 4.1 and 4.2). Finally, we revised Section 5 (the implementation and system evaluation section), describing several implementation and integration issues. Concerning the evaluation, we now present in Section 5 a systematic evaluation of the system with large data sets, terminologies and user studies concerning user from several organisations, within the framework of Europeana.

COMMENT 2.3. The only contributions to methodology could be the query expansion and optimization techniques, but these are to the best of my knowledge rather standard and not innovative.

RESPONSE: As mentioned in our reply to COMMENT 2.2, the content of Section 4 of the revised paper (the section that describes the semantic query answering module) is new and clarifies the contribution of the paper, that is summarised in Algorithm 1 (see the Figure in page 9). According to our knowledge, the hybrid use of both the 'query rewriting' and the 'reduction to entailment' methods for query answering is novel and the integrated system that we have implemented presented promising characteristics and performance in our experiments. Nevertheless, we could not consider our paper as a work that contributes in the area of conjunctive query answering over description logic terminologies, as it is not (according to our understanding) the scope of the specific special issue. Moreover, after the revised Section 3, in our opinion it is now clear that there is also a contribution in the area of semantic mapping, since the methodology that we presented for the construction of the semantic repository is not standard. Consequently, we believe that our claim that the proposed paper constitutes a novel systemic approach (with methodological novelties for both schema mapping and semantic query answering) for semantic access to cultural content is not unjustified, following the revisions of the methodology descriptions.

COMMENT 2.4. The paper also has some language problems, but these are irrelevant given the above remarks.

RESPONSE: We did our best, revising the language used throughout the paper.

COMMENT 2.5. The images in the paper are nice, but really have nothing at all to do with its contents, only with the content of the information treated.

RESPONSE: We added specific text for describing all the Figures of the paper. Moreover, we changed the system architecture Figure (Fig. 1 in the revised version), in order to be more easily described and understood by the reader.

-----------------------------------------------------------------------------------------
Reply to the 3nd Reviewer

COMMENT 3.1. My one point of criticism, however, is that the paper slightly lacks a clear focus and "storyline", which makes it difficult to read. On the one hand, the paper describes the tool (which would certainly deserve a dedicated paper on its own right), but only relatively briefly. On the other hand, the paper talks about the "metadata enrichment by query answering" approach. It is suggested that this is somehow part of the overall mapping workflow, but I fail to see how exactly this is integrated, or whether it is integrated in the tool at all. An additional screenshot might help.

RESPONSE: We added Section 2 in the revised paper, trying to clarify the presentation, providing a clear motivation, description of the contribution and summarise the structure of the proposed system in one Figure (Fig. 1, page 4). The specific section, we think that also clarifies the connection of different modules in the overall system.

COMMENT 3.2. Likewise, the idea of resource linking done by (amateur) users as part of creating "stories" for personal use is fascinating. But it is not clear how this relates to the rest of the paper. Sure, semantic interoperability, search and query answering are all, apparently, "technological underpinnings" for this. But considering the paper's initial focus on the metadata mapping software, this seems like a large step from one topic to the other...

RESPONSE: We excluded the description of the general linked open data framework (section 4.1 in the previous version), focusing on the descriptions of the terminological knowledge and its use to semantic query answering and semantic enrichment.

COMMENT 3.3. Also, the paper lacks a dedicated related work section. This might be
useful in helping the reader to better put things into perspective.

RESPONSE: We added a related work section (Section 6 in the revised paper).

-----------------------------------------------------------------------------------------
Reply to the 4th Reviewer

COMMENT 4.1. The technical issues and visualization are mainstream and provide no real innovation over existing metadata editors and schema mapping tools, or the authors have failed to highlight which of the features of the tool that can not be found in other tools. The authors claim user friendliness of the tool, but this is not evaluated or analyzed from a scientific viewpoint in the paper. Workflow issues are interesting as some kind of "best practice" but they do not constitute a substantial progress over the state of the practice. Also, they are not subject to further analysis or evaluation later so they can be considered mostly background information on the usage of the solution.

RESPONSE: In Section 3 of the revised paper, we tried to explain why the proposed system has novel characteristics and several advantages against other similar metadata editors and schema mapping tools. Its extensive use within the framework of Europeana, described in Section 5.1 is an evidence for this claim. Moreover, as far as user friendliness is concerned, we present in Section 5.1 a study based on user feedback during the last years of usage of the system for metadata aggregation.

COMMENT 4.2. ... Section 4.1. that in my opinion can be removed as they are not directly relevant to the query problems analyzed.

RESPONSE: We have removed the content of this Section.

COMMENT 4.3. Section 5 provides some example queries and response and some brief discussion. The problem with this contribution is that the status is still preliminary, and there is no real experiment, but a report on some examples. More work is needed and substantial representative sample queries with details measurements to come up with credible conclusions. Also, the authors are not clearly analyzing if performance is dependent on the implementation and content base structure or on the ways queries are resolved only. Both aspects need to be considering when assessing running time for queries.

RESPONSE: We totally revised Section 5. The new evaluation section for the semantic query answering module (Section 5.2) provides now a systematic evaluation of the system with large data sets and thematic terminologies, within the framework of Europeana. Concerning the performance of the system, we do not claim that it does not depend on the specific implementation. Actually, our goal is not to show that our system has an improved performance against other similar ones; it is rather to show that we can effectively apply semantic technologies (and more specifically semantic query answering over expressive terminologies and data sets) in the area of cultural content access.

COMMENT 4.4. It seems that contributions (1) and (2) are disconnected, as the mapping tool is for standard XML metadata and not highly expressive OWL.

RESPONSE: By adding a new section (Section 2 in the revised paper), presenting the system architecture, and revising Sections 3 and 4, we tried to clarify the way the modules are connected within the overall system.

COMMENT 4.5. In conclusion, the paper is addressing relevant formal query issues
but: - It mixes with a presentation of a mapping tool that can be removed or integrated in the introduction.

RESPONSE: The metadata aggregation and schema mapping module is a part of the proposed system, necessary for creating the semantic repository and enrichment modules and thus it is difficult to avoid its short description (it is also in contrast with comments of the 3rd Reviewer). Nevertheless, we tried to shorten as much as we could the description of technical details and put the emphasis, in the revised paper, to the use of semantic technologies towards the advancement of the metadata aggregation and mapping strategy.

COMMENT 4.6. In conclusion, the paper is addressing relevant formal query issues
but: - Does not report substantial empirical data that constitutes a real advance. The experimental method is limited to some example queries and comments on anecdotal evidence on query results on a particular content database.

RESPONSE: Please see our reply to COMMENT 4.3.

COMMENT 4.7. My overall suggestion is reorienting the paper to the query problems and rewriting it to expose that as the main contribution, and doing additional work in the experimental methodology and extensive analysis of empirical results for the query problems posed. I encourage the authors to do so as there is very limited literature in these topics.

RESPONSE: We followed this suggestion, presenting a formal description of the semantic query answering strategy in Section 4 of the revised paper and extended results in Section 5. Moreover, in Section 3.3, we describe the construction of the semantic repository that is crucial for query answering, as it is a semantic representation of the cultural content in terms of the ontological knowledge. It is important to notice that a sophisticated metadata aggregation and mapping procedure is a strong prerequisite for achieving this goal.

COMMENT 4.8. Fig. 4 and discussion on linked data can be removed as it is non relevant to the main research issues.

RESPONSE: We have removed the discussion on linked data.

COMMENT 4.9. An explanation of the ontologies used in the queries is required and the detail on how LIDO metadata in the database is translated to that richer OWL representation, to understand if it is 100\% automatic or requires enrichment.

RESPONSE: Sections 3.2-3.3 and 5.2 explain the ontologies used and the translation to OWL representation.

COMMENT 4.10. The relation of Europeana EDM to the OWL based querying system must be clarified, as the implementation of the experiments is done with other existing systems that are not operating on the representations used by Europeana.

RESPONSE: Sections 2 and 3 present the operation of all modules of the proposed system, referring to all representations used in it.