Linked Web APIs Dataset: Web APIs meet Linked Data

Tracking #: 1144-2356

Milan Dojchinovski
Tomas Vitvar

Responsible editor: 
Rinke Hoekstra

Submission type: 
Dataset Description
Web APIs enjoy significant increase in popularity and usage in the last decade. They have became the core technology for exposing functionalities and data. Nevertheless, due to the lack of semantic Web API descriptions their discovery, sharing, integration, and assessment of their quality and consumption is limited. In this paper, we present the Linked Web APIs dataset, an RDF dataset with semantic descriptions about Web APIs. It provides semantic descriptions for 11,339 Web APIs, 7,415 mashups and 7,717 developers profiles, which makes it the largest available dataset from the Web APIs domain. It captures the provenance, temporal, technical, functional, and non-functional aspects. We describe the Linked Web APIs Ontology, a minimal model which build on top of several well-known ontologies. The dataset has been interlinked and published according to the Linked Data principles. We describe several possible usage scenarios for the dataset and show its potential.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Enrico Daga submitted on 12/Nov/2015
Major Revision
Review Comment:

The article describes the Linked Web APIs dataset, an RDF datasets including semantic descriptions of Web APIs. The reference data source of these description is Programmable Web, a platform for publishing information about Web APIs, mashups that reuse these APIs and associated web developers. The descriptions are based on a Linked Web APIs Ontology, a minimal model which reuses a number of well-known ontologies. The descriptions of the items reflects the data structure of ProgrammableWeb and includes a description of mash-ups in the form of provenance (relying on PROV-O), functional and non-functional properties of the APIs (usage limits, fees, etc…), protocols and formats and temporal information.
A interesting set of use cases are provided, including Web API recommendation and temporal analysis of the usage of the data.

No details are provided about the life-cycle of this information: what was the method of creation and maintenance.
Another missing aspect is the license of the dataset. It is known that the data offered by ProgrammableWeb have usage restrictions. However this aspect is not mentioned in the article and authors should at least discuss how these restriction will affect their users. I invite the authors to explicitly refer to the five star model and discuss how this relates to the Linked Web APIs dataset.
Except the references to (rather interesting) use cases the authors implemented also in a number of papers, no evidence is given of other usages.
However, this can be accepted by the fact that the dataset is relatively young.

An important aspect, in my opinion, is about the quality of the data. It would be interesting that the article discusses if the quality of ProgrammableWeb as a resource has been assessed. The fact that it is the largest repository of APIs is important, true, but it is not the only important thing.
Related to that, the authors write that this dataset is “the first and the largest”, several times, that sounds a bit silly. If it is the first, then it cannot be the largest, but this is obviously a minor thing.

In summary:
(1) Quality and stability of the dataset - evidence must be provided.
Indeed, a discussion about the quality of the data should be added to the article to be accepted.

(2) Usefulness of the dataset, which should be shown by corresponding third-party uses.
This is demonstrated enough, the fact that there are no third-party users yet can be accepted, in my opinion.
A license statement should be also added, and a discussion of the dataset in the context of the five star open data initiative:

(3) Clarity and completeness of the descriptions.
The article is very clear with that respect.

Review #2
By Christoph Lange submitted on 24/Nov/2015
Minor Revision
Review Comment:

This paper presents a dataset about Web APIs, which has been generated from the directory website by screen-scraping, and furthermore interlinked with a few existing linked datasets. The paper …

* clearly motivates the need for such a dataset,
* explains the data source reasonably well,
* explains the ontology, which has been designed for this purpose, very well,
* explains the URI naming scheme and some statistics about the dataset,
* covers the interlinking, and
* presents as many as five (5) use cases, whose practical relevance is pointed out clearly.

It is thus a feature-complete dataset paper reflecting solid work and should therefore be accepted – with minor revisions.

The most frequent type of mistakes (spelling and grammar) will be easy to fix, probably with help from a native speaker. The slight lack of detail in some places will also be easy to improve, by elaboration on the following aspects:

* section 2 "data source": please comment on your screen-scraping approach. How sustainably will it ensure further updates of your dataset? In other words, how frequently does the structure of the HTML source change?

* section 3 "ontology":
* regarding provenance, in addition to who created an API or mashup, isn't it also relevant who created the ProgrammableWeb entry for a dataset? Is such information available from ProgrammableWeb?
* for some properties, whose ranges are not self-evident, a few comments on their range would be appreciated. E.g., how exactly do you represent usage fees? I suppose that in many cases this information will have quite a complex structure.

* section 4 "dataset":
* you explain how you created your own dataset, with all information harvested from ProgrammableWeb in a _central_ place. However, now that your ontology is available, it will enable Web API and mashup maintainers to make their mashups self-explaining, by publishing _decentral_ RDF records at the same domain from which the API/mashup is available. Could you provide some recommendations on how they should do this?
* while you do explain your inspiration for the URI format with appended type information such as "..._api" (roughly following the naming scheme of Wikipedia's disambiguation pages), I could imagine that URIs containing the type information as a path component (e.g. ".../api/...") would also be appropriate. Could you discuss this potential alternative?

* section 5 "interlinking": Why did you not use rule-based interlinking tools, such as Silk or LIMES? Wouldn't this have made the job easier?

* section 6 "use cases":
* what exactly do you mean by "personalised recommendation of Web APIs"? Even though this is covered in your other publications, please provide a little bit of explanation on the setting in which you provide such recommendations, and on who is the target audience of these recommendations.
* I wonder whether the queries that use prov:generatedAtTime make sense. If ProgrammableWeb does not record the history of versions of an API/mashup, then this probably effectively has the semantics of "last updated on ". Also, your ontology does not cover version histories.

* section 7 "future work": are you planning to consider other standards as well, such as WSDL or WADL?

At please find attached an annotated PDF with detailed comments.

Review #3
By Tobias Kuhn submitted on 27/Nov/2015
Review Comment:

This paper introduces an ontology and a dataset about Web APIs represented as
Linked Data. The data is obtained from and transformed into
RDF. A number of use cases are presented that demonstrate how the dataset can be
used to provide recommendations for API developers and to perform temporal data

In general, the paper is well written and easy to understand, and the presented
ontology and dataset seem to be well structured and properly implemented.
However, I think that the paper fails to provide convincing arguments with
respect to the dataset's relevance, usefulness, and uptake. The presented use
cases deal only with data analysis (for recommendations and for insights into
the history of Web APIs), whereas I would see the main impact of formal
representations of Web APIs in applications such as automated API composition,
discovery, and orchestration. In this context, it would be interesting to learn
about the connections to existing technologies such as SADI and existing
proposals on semantic Web Services. Specifically, I believe the paper only
achieves number (3) of the three criteria for such kinds of papers, as defined
by the editors:

> (1) Quality and stability of the dataset - evidence must be provided.

The ontology and the dataset *seem* to be of good quality and to be stable, but
no evidence is provided for that.

> (2) Usefulness of the dataset, which should be shown by corresponding
> third-party uses - evidence must be provided.

The dataset might be useful for interesting tasks such as automated composition
and discovery (see above), but the paper doesn't provide convincing use cases,
doesn't mention third-party uses, and provides no evidence in this respect.

> (3) Clarity and completeness of the descriptions.

The description of dataset and ontology is clear and complete (except for the
missing information according to points (1) and (2)).

I therefore believe that the paper should not be accepted, because it does not
meet the defined criteria.

As a last remark, the paper would benefit from spell-checking and proofreading,
as there are many small mistakes, some of which are listed below:

"They have became" > "They have become"

"utlize" > "utilize"

"mashups developers" > "mashup developers"