RÉPENER’s Linked Dataset

Tracking #: 451-1628

Authors: 
Alvaro Sicilia
German Nemirovski
Marco Massetti
Leandro Madrazo

Responsible editor: 
Oscar Corcho

Submission type: 
Dataset Description
Abstract: 
The dataset presented in this paper constitutes one of the outcomes of RÉPENER- a research project, co-funded by the Spanish RDI plan. It contains integrated information of the Spanish territory, regarding energy certification, building monitoring, and geographical data. The integration has been carried out by means of semantic technologies. The following of the Linked Data principles helps to guarantee standard methods of accessing the data as well as to connect data to the existing dataset on the Web of Data. The dataset is a Knowledge base for end-users. It has a clear objective of providing information that stakeholders need for improving energy efficiency of buildings, influencing, thus, their respective decision-making areas.
Full PDF Version: 
Revised Version:
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Lutz submitted on 08/May/2013
Suggestion:
Minor Revision
Review Comment:

The paper describes a linked data set developed in the RÉPENER project based on three existing input data sets. The data set is well described, even though the language can be improved in places (see some suggestions below) and should be checked by a native English speaker. Although the authors present a number of possible uses and the services developed on top of the data set, the actual use remains unclear. It would be good to include actual usage figures or metrics if available.

I also feel that the potential of using a linked data approach has not been fully tapped. Much of the described work could also have been achieved using traditional (relational) database technology and data integration (ETL) methods. In particular the re-use of existing vocabularies and the links to other external data sets could be improved.

Detailed comments:

- Section 1: "This requires having access to energy information at the different stages of the building life-cycle –from design, to construction, and operation– and not in separated sources." -- I agree that the information on the different stages is necessary. But why is it a problem to have that information in different sources? Please explain.

- Section 2: Why did you only include 202 of the 1800+ energy certifications of ICAEN? Even if the 202 chosen entries contain the most detailed information, also the other entries may be useful in some cases. Please explain.

- Section 3 / Figure 1: The distinction between properties and concepts is not clear in the figure. The legend states that black arrows represent object properties, but the arrows are not named. It is unclear if the boxes represent concepts/classes or also properties. Please explicitly make this clearer in the figure and textual description.

- Section 3.1 (data transformation): "Finally, the values of the use of building (repener:mainBuildingUtilisation) (...) have been converted to the classification provided by the DATAMINE project [5], an international domain reference. In this way, third-parties, from other countries, are able to understand the data." -- I assume this means that the DATAMINE classification contains multi-lingual labels, correct? Is this the only multi-lingual code list/classification used in the data set? Are rdfs:label's provided in different languages? If not what language is used for textual properties?

- Section 3.1 / Figure 2: Have you considered the benefits/drawbacks of your solution to load all data sets into the same triple store (rather than establishing a separate triple store for each data set)? What are the implications if the data sets evolve? (e.g. does the ETL process have to be run again every time there is a change in one of the source data sets? how will the central triple store know about any changes in the source data sets?) On a related note, what does the notion of "data set" refer to in the paper? After the ETL, all the content of the triple store could be considered a data set (the RÉPENER Linked Dataset), i.e. all links between resources coming from different source data sets are now internal (to the newly created merged data set). This could be discussed in the paper.

- Section 3.2 (data linking): "For instance, a climate zone resource such as C2 (see http://...) connects both sources through repener:hasCity and repener:hasBuilding properties." - This is not very clear. Why are climate zones used with hasCity and hasBuilding properties. A figure may help explaining how the link works.

- Section 4: It seems to me that all the described services could also be implemented based on a conventional database - they do not illustrate any additional benefit of using a linked data approach. Please illustrate how the links (in particular to external data sets) are beneficial for the presented services.

- Section 4.1: "It can be explored also graphically, in a heat map implemented on top of Google Maps." --> what does the heat map show - just the density of where there are buildings in the dataset? Or are the heat maps also related to the energy efficiency of the represented buildings?

- Section 4.2: How is the temporal aspect (which is needed for the "before-after-renovation" comparison) handled in the ontology?

- Section 5: "While Reegle and OpenEI platforms offer energy-­related data at a country level policies, regulations, energy production or renewable resource RÉPENER's dataset collects data for specific buildings" -- This suggests that additional external links to the data provided by projects/platforms like Reegle or OpenEI (e.g. on policies, regulations, energy production or renewable resources) could be added to the data set.

Some suggestions to improve the language:

Title (and text)
"RÉPENER’s Linked Dataset" sounds strange to me (probably because it suggests RÉPENER to be a person rather than a project). Maybe consider using "the RÉPENER Linked Dataset" in the title and text instead.

Abstract:
- "The following of the Linked Data principles" --> "Following Linked Data principles"
- "The dataset is a Knowledge base for end-users" --> "The dataset is a knowledge base for end-users"

Section 1
- "the improvement of the energy-efficient of new and existing buildings" --> "the improvement of the energy-efficiency of new and existing buildings" or "improving the energy-efficiency of new and existing buildings"
- "Designing and building more efficient buildings become necessary to have a better knowledge of the relationship between design and performance and between the design objectives and the actual performance of the building." -- this sentence does not make sense. Do you mean "In order to [be able to] design and build more [energy-]efficient buildings, it is necessary to have a better knowledge of the relationship between design and performance and between the design objectives and the actual performance of the building."
- "can be found in Madrazo [1]" --> "can be found in [1]"

Section 2
- "simulations results" --> "simulation results"
- "Besides, the ICAEN owns more than 1800 energy certifications, 202 have been included in the dataset be- cause of its simulation details." --> "The ICAEN owns more than 1800 energy certifications, of which 202 have been included in the dataset because of their simulation details"
- "It was thought to use GeoLinked dataset (.es), in the first place" --> "We initially considered using the GeoLinkedData.es dataset" (also change "GeoLinked dataset" to "GeoLinkedData.es dataset" later in the text, e.g. in section 3.2)
- "which stores the populated places of the Spanish territory including geographical data for each record such as population, areas, elevation, or Universal Transverse Mercator (UTM) coordinate." --> "which stores geographical data on the populated places of the Spanish territory including their population, area, elevation and geometry (specified in Universal Transverse Mercator (UTM) coordinates)."

Section 3
- "is provided by Nemirovskij [4]" --> "is provided in [4]"
- "including the links to external datasets. Data transformation" --> "Data transformation" should probably be a (2nd level?) heading
- "through an ETL (Extract, Transform and Load), a process which" --> "through an ETL (Extract, Transform and Load) process, which"
- "Paradox is an obsolete database" -- what do you mean by "obsolete"?
- "In addition, the data extracted from Paradox files have been aggregated from hourly to monthly values since its usage is foreseen in a kind analysis which does not require low level of data aggregation." --> what does "its usage" refer to (the Paradox database)? What do you mean by "a kind analysis"? What do you mean by "low level of data aggregation" (highly disaggregated / highly detailed)?

Section 4
- "to contribute with the improvement of the buildings' energy efficiency" --> "to contribute to the improvement of the buildings' energy efficiency"
- "users inform about the" / "users tell about" --> "users specify the"
- "It can be explored" --> "The results can be explored"

Review #2
By Raúl García-Castro submitted on 09/May/2013
Suggestion:
Major Revision
Review Comment:

The paper describes the REPENER's linked dataset that includes information about energy certification, building monitoring and geographical data.

The paper describes the process followed for generating the dataset and how it is being used and fits perfectly in the special issue. However, it still requires some improvements.

I have some comments about the ontology:
.- Is there some link where the ontology can be downloaded as an independent artifact? If not, it is quite difficult to analyse it through the Linked Data browser or the SPARQL endpoint.
.- Related to this, is the REPENER (http://arcdev.housing.salle.url.edu/repener/lod/ontology/) ontology identical to the SEMANCO (http://www.semanco-project.eu/2012/5/SEMANCO.owl) one?
It seems so, but the links to both URLs are broken so I cannot check.
.- owl:real (used as, e.g., range of cO2emissionsValue) does not exist. The ontology must be corrected.
.- Correct "/lod//ontology/" in section 3.1. Is "/page/" missing?
Also if I try to access the repener:ClimateZone class mentioned in section 3.2 (http://arcdev.housing.salle.url.edu/repener/lod/page/ontology/ClimateZone) it gives me an error.

In section 3, some things about the dataset creation through ETLs should be clarified:
.- Is the dataset created through one ETL or through three ETLs? If there are three, are they somewhat related or are they independent processes?
.- How are ETLs executed? Are they executed periodically or has it been a one-time activity?
.- Is the dataset creation fully automated or it involves some manual actions?

Figure 1 can be improved in different ways:
.- When talking about links, it is not clear whether links are between ontologies or between datasets. If they are between ontologies they can be object properties or owl:equivalentClass; if they are between datasets they could be owl:sameAs or other properties. Please clarify this.
.- This previous comment is also related to whether the figure depicts the ontology (as stated in the caption), the datasets created, or the original data sources. One thing that is interesting is to present which part of the ontologies is covered by each data source.
.- Another thing that would clarify things is whether the concepts present in the ontology (the top level ones) are exhaustive or not. For example, are only 3 classes used for sumo, one from aemet and another from geoes? It would be nice if the figure included all the top level classes (if it does not do so right now).
.- The legend should include differentiated arrows for subclass properties.

When linking data from the different data sources, sometimes the same URI was generated from an entity in different data sources and in other cases different URIs were generated and linked with owl:sameAs.
If it is possible to identify that two entities are equivalent in order to create the link with owl:sameAs, it should be possible to identify that they are equivalent and assign them the same URI. Or is it the case that owl:sameAs was only used for external entities that already had their own URIs? Please, clarify this.

The topic of the evaluation of the dataset is not covered. For example, regarding the independent generation of URIs when creating the data from different data sources. Was there any problem of the same entity being created with different URIs in two different data sources (e.g., "../Girona" and "../Gerona"). How have the authors detected/corrected this?
Another example; when linking data, the authors mention that data from different datasets complements. Did the authors found any contradiction when merging data?

During the data linking process, were all the links between entities in different datasets identified (BuildingLocation-City, City-WeatherStation, City-Municipio)? E.g., were weather stations found for every city? For this case, section 4.5 already clarifies that not every city has a weather station. Since this is something that happens in the real world, the authors must discuss how does this lack of information affect the use of the dataset.

It should be clarified whether the services in the dataset exploitation section can be accessed through other ways besides the portal (e.g., APIs). If they have been implemented and have user interfaces, the authors could put links to them.

In the conclusion the authors mention that by linking to other datasets they have complementary information. This is the main benefit of having linked data; the authors should highlight in the use cases (or in another one) the benefits of having this complementary information and how it can be used.

Finally, the paper does not have any information about the licensing of the three data sources reused and of the generated dataset. This should be included in the paper.

Other things:

.- In section 3, in the paragraph starting with "REPENER's ontology uses the upper ontology", there are two consecutive sentences that start with "The ontology" but in each case the ontology they are referring to is a different ontology.
.- In figure 2, it is not clear whether there is one or three ETLs. Besides, in reality the ETL is using as input a MySQL database instead of the original source in two cases. The figure should reflect those things.
.- Section 3.1 mentions that when minting URIs pluralized class names are used, but the example in that same paragraph does not follow that ("/city/" or "/climatezone/" later).
.- Section 3.2 talks about outgoing links. Clarify whether by the time of writing the paper the dataset has any incoming link.
.- In table 1, the number of closestStation links is only relevant if the number of cities is known, if not it is not possible to assess whether the links are enough or not.
.- End of section 4.4. The authors mention that in the following weeks the website will be activated. Change to the current status.
.- A similar comment can be made regarding the submission of the dataset to the Data Hub.
.- Format the query in section 4.5 so that it does not break words between lines ("?primaryen-ergy").
.- Revise the writing, there are some mistakes.
.- The maximum length for the paper is of 6 pages and the paper has 8.

Review #3
By Danh Le Phuoc submitted on 10/Jun/2013
Suggestion:
Minor Revision
Review Comment:

This article present Linked Dataset about the Spanish territory, regarding energy certification, building monitoring, and geographical data. The dataset aims to providing information that stakeholders need for improving energy efficiency of buildings, influencing, thus, their respective decision-making areas. The dataset is interesting and the paper is well-written. There are some following minor suggestions and concerns that needed to be addressed before publishing.

One aspect of the dataset the journal called is giving details is creation, maintenance and update mechanisms. Therefore, it would be clearer that the article give more details about how updating process is handled in the ETL process.

As pointed out as the dataset is quite small, it would be more convincing if the paper give more details about the coverage of the dataset such as area, types of building, how many percentage the buildings covered, etc. How big the data could be reached to in the future? Point out some more interesting external links could make the dataset richer and bigger.

As there is more space for article, more example queries, screenshot for the results for each sub-sections of section 4 will make the paper easier to read and more intuitive for the readers.


Comments