Energy Efficiency Measures as Linked Open Data

Tracking #: 862-2072

Authors: 
Eva Blomqvist
Patrik Thollander
Robin Keskisärkkä
Svetlana Paramonova

Responsible editor: 
Pascal Hitzler

Submission type: 
Dataset Description
Abstract: 
This paper describes an open linked dataset containing data on energy efficiency improvements, i.e., recommendations and measures taken based on energy audits, from both Sweden and the US, i.e., from the Swedish Energy Agency and the US Department of Energy's Industrial Assessment Centers (IAC), respectively. The overall goal of our project is threefold; (i) to facilitate better energy audits through allowing auditors and the organizations themselves to be inspired by information on measures taken earlier, in similar organisational settings, (ii) to allow researchers and policy-makers to search, compare, and assess Swedish energy audit data, and data from the US, in an integrated fashion, and (iii) to facilitate easier building of third-party applications on top of energy audit data by publishing it as Linked Open Data on the Web. The dataset is currently available through both a SPARQL endpoint, a Snorql interface, and a demonstration search interface tailored for human end-users. The data is being updated based on an ongoing manual quality control effort, and future work includes the use of the dataset to perform studies on the effects of using past energy audit data as inspiration for future recommendations for Swedish industry, as well as continuously publishing updates and extensions to the dataset itself.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Cameron Maclean submitted on 23/Nov/2014
Suggestion:
Major Revision
Review Comment:

The paper describes a manually curated linked open dataset of energy efficiency measures and recommendations based on energy audits from Sweden and the US. The authors aim to make previously only manually producible and disjoint data available as integrated RDF supported by SPARQL, Snorql and web based search interfaces - all in order to support policy research, application development, and future energy audits.Currently however, the data is focused for use within a Swedish context, and is available only in a mixture of English and Swedish languages.

Overall, the purpose and method of creation, and the description of vocabularies used is sufficient to enable exploration and use of the linked data. All data and endpoints were functional and available at the time of review.

The paper could be improved however, by clearer descriptions of the quality, maintenance and use (future or actual) of the data.

(1) The need to harmonize industry classifications lead to the omission of some IAC data - this fact is made explicit in the paper, but it would be beneficial to elaborate further - is there any way to measure, identify, or otherwise characterize what fraction of the IAC data is lost or not represented in the integrated dataset so that users can better interpret and (re)use the linked data?

(2) There is no license specified for the data or the vocabulary in machine readable form. The paper states that the data is licensed under CC-BY 4.0 - it would be good to make this explicit in the data itself, perhaps using CC REL http://creativecommons.org/ns. Presumably, the CC-BY 4.0 license is compatible with all the underlying original datasets.

(3) It is not clear how this paper differs from content alluded to in reference [1] which is not yet published. Does the current publication represent a legacy dataset that has already been superseded and has had additional data and quality issues addressed and added to in the 'forthcoming' publication? If so, one might question the utility of the current data being described and made available. Because much of the discussion on quality and prospective use of the data is deferred to an unpublished future article, it is difficult to evaluate the current situation. For example, the authors should specify which version of the data does the SPARQL, Snorql, and demo search interface use - it is always the most current version? How and where does any additional data become incorporated, and how are any data releases or changes managed and publicized to users other than via a change in the URI? If there are known shortcomings or ongoing improvements to the current dataset (as the authors indicate in section 3.5), the type and nature of these should be made explicit in the current article in order to benefit users and not merely alluded to.

(4) The dataset usage cases discusses how the uniform categorization of the Swedish data was beneficial (although one reference is unavailable/unpublished), however no mention is made of the utility of the US data or the usefulness of external linkages. Are there any cases where US data or links to geographic and SCB information has been beneficial? Please include if so. If not, more concrete examples of how such additional information can be utilized in future to answer questions that are currently difficult would be informative.

In general, the paper could benefit from being rewritten with this additional information so that references to future (as yet unpublished and inaccessible) articles are not required to supply the context or justification for data maintenance and quality and use cases.

Review #2
By Amrapali Zaveri submitted on 30/Apr/2015
Suggestion:
Minor Revision
Review Comment:

The article “Energy Efficiency Measures as Linked Open Data” describes the linked dataset of the energy efficiency improvements from both the Sweden and US.

However, a longer version of the paper is already published (ref [1] in the paper). I see considerable overlap between the two versions with obviously more details in the longer one. In fact some of the details should be included in this paper. But, I leave it up to the editor to decide.

(1) Quality and stability of the dataset - evidence must be provided.
The purpose of creation is well motivated and interesting. Also, there is sufficient re-usage of established vocabularies. With regards the interlinks to external datasets, however, more details are required there as to how the interlinking was performed, statistical details and how complete are the links. Perhaps there could be some interlinks with the mentioned energy-related linked datasets that can be added. The transformation is performed manually, which raises questions on the cost and time feasibility as well as the accuracy of the transformation. Also, how frequent is the original data updated and how soon is it transformed and the triple store updated?

(2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.
There is reported third party usage with useful and interesting queries demonstrated. Table 1, use case 2 query needs to be fixed - the GROUP BY clause is missing. I would also merge sections 1.2 and 4 about dataset usage and move Table 1 below.

(3) Clarity and completeness of the descriptions.
The paper is well written, however I encountered some formal errors listed here:
Abstract
- “open linked dataset” - “linked open dataset”
- “through” - “by”
- “both a” - “both as a”
Introduction
- “could be” - “could include”
- “isolation” - “insulation”
Purpose of the Linked Dataset
- “Some tools exist already today .. “ - provide references
- “audits through allowing” - “audits by allowing”
- “an organizations” - “and organisation” (z - s)
- “More in detail” - Please rephrase
- “(c)... through reusing” - “(c) … by reusing”
- “ran” - “run” (Also in the caption in Table 1)
Source of the Data and Topic Coverage
- “codified” - “coded”
Vocabulary Selection and Creation
- “Reegle” - provide reference
Links to Other Datasets
- “URI:s” - “URIs”

Review #3
By Charles Vardeman II submitted on 13/May/2015
Suggestion:
Accept
Review Comment:

The authors present a clear description of several existing energy audit data sets collected by organizations in the United States and Sweden that have been transformed to an RDF data representation and published as Linked Open Data. As part of the process, the "schema" for these data sets were aligned, normalized and conceptually lifted utilizing ontology patterns to enable a level of interoperability across data sets. The ontology developed for this purpose was also published as Linked Open Data and is using 5-star principles of Linked Vocabularies. These datasets use established W3C recommendations (RDF, OWL, SKOS, FOAF) as well as aligning to other established vocabularies where appropriate. They have addressed the country dependent nature of the data sets by using geospatial vocabularies as well as an explicit procedure for normalizing location. Ontologies developed in this publication utilize OWL relations and established ontology patterns. However, the semantics specified are relatively shallow but sufficient to conceptualize the data. The authors have indicated these limitations and indicated that some will be addressed in future work. The authors provide example data queries that demonstrate the semantics are sufficient to allow the data sets to be consumed for the provided use cases.

Original sources for the energy auditing data and links to the original data sets have been provided as well as links to detailed descriptions of the data sets. The protocol for transformation and normalization of the original data sets to RDF was explicitly stated in the article as well as the protocol to ensure data quality through the transformation process. Data sets have been version by URI encoding to a specific version endpoint. The linked data sets have been provided through multiple interfaces including raw RDF, SPARQL endpoint, and graphical interfaces under an explicit Creative Commons attribution license. It was verified that at the time of this review that the data endpoint specified in the paper exist. However, the license information does not appear to be linked to in a machine readable form in the ontology. It might be useful to provide a triple in both the data sets and the ontology with as suggested by the Health Care and Life Science (HCLS) Linked Data Guide http://www.w3.org/2001/sw/hcls/notes/hcls-rdf-guide/#Q10 for example. The data set was also submitted to a national data archive for long-term preservation as well as the authors explicitly committing to maintaining long-term access to the data. This step provides confidence that the data sets will be available beyond the lifetime of this project. The authors also explicitly addressed dataset adoption by domain collaborators and have involved some of the original data set creators in evaluating overall utility. This is an important step in creating community buy-in to the utility of a linked open data approach.

Data quality concerns were explicitly discussed in the paper as well as the steps taken by the authors to mitigate potential quality issues. They have included provenance information utilizing the W3C prov vocabulary which is an important step to understanding the quality of individuals within the data set. One potential issue, as stated by the authors, is that the international data sets may use different units of measure and that they have made these units explicit in their RDF data representation. It may also be useful to link to other recommended unit ontologies such as QUDT. For example utilizing the QUDT definition for kilowatthour, http://www.qudt.org/qudt/owl/1.0.0/unit/Instances.html#Kilowatthour making it clear that this is a unit of energy and linking explicitly to unit:Year365Day to specify the time period. While outside the scope of the current work, the authors may want to consider modeling the methodology (sampling), workflows and procedures as in the original data sets to provide a method of discovering inconsistencies and irregularities between data values that may be represented by the same conceptual measure but may differ in collection methodology. Also, the authors may want to consider utilization of concepts from the Data Cube Vocabulary to capture sampling information and dimensions for future versions of the data set. Such an approach may facilitate deeper analysis, particularly if the linked data approach has large adoption and analysis of larger data collections becomes desirable.