A Linked Data Wrapper for CrunchBase

Tracking #: 1443-2655

Michael Färber
Carsten Menne
Andreas Harth

Responsible editor: 
Jens Lehmann

Submission type: 
Dataset Description
CrunchBase is a database about startups and technology companies. The data can be searched, browsed, and edited via a website, but is also accessible via an entity-centric HTTP API in JSON format. We present a wrapper around the API that provides the data as Linked Data. The wrapper provides schema-level links to schema.org, Friend-of-a-Friend and Vocabulary-of-a-Friend, and entity-level links to DBpedia for organization entities. Further, we describe how to harvest the RDF data to obtain a local copy of the data for further processing and querying that goes beyond the query facilities of the CrunchBase API. Our Linked Data API for CrunchBase and a previous version of it have already been used in two cases, whereas our crawled CrunchBase RDF data set has been used once for data integration and once for information extraction on text. CrunchBase has also been used twice for exploratory data analysis.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Konrad Höffner submitted on 26/Aug/2016
Review Comment:

# Review of "A Linked Data Wrapper for CrunchBase"

**Summary: The dataset has a high significance and proven third-party uses but the paper itself has major issues so that I see it as borderline between reject and major revisions.**

## Quality and stability of the dataset

The quality of the interlinks to DBpedia is ensured using manual evaluation of random samples.
The quality of the data itself is presented through the conversion of an established base dataset and the usage of a reasonable methodology.

URL: API URL given
version date and number: version date not given but the paper distinguishes between a first and a second version
Licensing: given for the source data but exact license for the RDF dataset is missing (stated as “non-
commercial purposes”)
Availability: complete RDF ntriples dump, JSON-LD API, ntriples API for single resources. SPARQL endpoint not given. Ontology with VoID description available as well.

## Usefulness of the dataset

The paper extensively discusses benefits and gives promising use cases for queries and integrations, such as for job search. Even better, the dataset is already used for financial data analysis in a peer-reviewed publication.

## Clarity and completeness of the descriptions.

### Innovation
> Nowack already provided an RDF wrapper for the CrunchBase API called Semantic CrunchBase in 2008, the service is no longer available.

A (blog entry)[(http://bnode.org/blog/2008/07/29/semantic-web-by-example-semantic-crunch...)] for Semantic CrunchBase states “The initial RDF dataset is not using any known vocabs such as FOAF (or FOAFCorp). (We can INSERT mapping triples later, though.)”.

This is a major issue. According to the above statements, the existing approach could be extended with “mapping triples” to integrate vocabularies. A thorough motivation, why this existing approach was not extended but a completely new approach was taken, is essential.

### Unproven Statements
Many claims made are vague and/or not accompanied by citations. Examples:

> “Is used by millions of users”
Provide exact numbers over some time period and provide a source.

> In contrast, many professional Crunch Base users may want to formulate more elaborate queries.

“may want to” is speculation, at least reformulate, better yet find a source.

> Having up-to-date answers to such questions can result in better market insights.

Instead of “can” cite source that says it does.

### Factual Errors

#### 5 Star Ranking
>Originally, the Crunchbase vocabulary would be a 1-star vocabulary according to Tim Berners-Lee’s star rating [...] Our CrunchBase RDF data set is a 5-star data set, as we provide our data set in RDF and link entity URIs (organizations) to DBpedia and our vocabulary URIs to other vocabularies.

I think you are confusing the 5-star deployment scheme for Open Data with the 5 stars of Linked Data Vocabulary Use as requested to discuss by the SWJ.

The statement that the original data has 1 star is incorrect: In the Open Data rating it would get at least 3 because it is “available in a non-proprietary open format”, i.e. JSON. I don’t have an API key but according to the CrunchBase docs, it seems like the REST interface uses “URIs to denote things, so that people can point at your stuff”, so it would get 4 stars here. If you mean the Linked Data Vocabulary Use stars, it wouldn’t even get 0 stars because it is not Linked Data.

The resulting RDF data on the other hand would get 4 stars, because it contains links to DBpedia but not 5 because there are no backlinks from other knowledge bases, as far as described. If there are backlinks, e.g. from Lee et al., then clearly state that, then it would have 5 stars.

#### JSON to RDF
> “five out of all 38 papers mention JSON as input or output data format, but only the description of the Facebook RDF Wrapper [8] describes a conversion of JSON to RDF

It should always be an input and not an output format, as RDF should always be the output format of a method to produce an RDF dataset. Also, the claim is false, as “LinkedSpending: OpenSpending becomes Linked Open Data” is a Semantic Web Journal paper that transforms JSON to RDF.

### Formal Criteria and Writing
I did not find any error with grammar, formatting and spelling.
The writing is sloppy at times, though, with phrases such as “the topic of CrunchBase is a bit special”.
Many sentences are unnecessarily verbose; fixing this could help to achieve the 10 page limit, which is exceeded by half a page right now. For example, consider the following passages:

> CrunchBase was founded in 2007 by Mike Arrington, the founder of the TechCrunch weblog, to track data about startups covered in posts. Nowadays, CrunchBase is used by millions of users to track the fast-changing world of startups.

Reads like an advertisement. Not necessary to know who the founder is. Compress to one sentence.

> According to the authors, the reported work has been well-accepted at several public events and conferences such as the 26th XBRL conference.

Unnecessary. The reference already tells the reader that it is a peer-reviewed publication.

> For this information, we queried our CrunchBase RDF data which we retrieved via our Linked Data API. See Section 3 for more information” this is clear

## Other questions and comments

>“Should we invest in startup X?”

If users edit CrunchBase themselves, how to prevent abuse, such as misrepresentation of one's own company?

>Such a query, formulated in natural language, might be: “Which companies existing at most 5 years have been acquired for more than 1 bn USD?”

You could thus add Semantic Question Answering to future work.

> we have implemented a Linked Data API as wrapper around the publicly available CrunchBase REST API;
> the official CrunchBase API is only accessible with an API key

Please clarify, is it freely available or do you need an API key potentially with costs?
If it is the latter, are there legal problems with openly publishing it as RDF? Judging from the website, it seems like organizations and people are freely available while product information costs money. You state that it is available for non-commercial purposes, does that include products, etc., that is only available from the API with a key?

>This confidence value is encoded in binary format [...]

I would suggest “is encoded as a bit array” or “bitset”.

Review #2
By Marta Sabou submitted on 12/Sep/2016
Minor Revision
Review Comment:

This paper describes a novel Linked Data dataset obtained from the CrunchBase online platform which contains information about (primarily US-based) startups and technology companies. The paper describes both a wrapper for obtaining Linked Data from CrunchBase as well as an example dataset that was obtained by the authors and shared with the community. This overcomes the fact that access to the entire CrunchBase dataset is license protected and requires an access key which needs to be provided to the wrapper. Although the paper is classified as a “Dataset Description paper”, its strong focus on the wrapper itself gives it a flavour of a “Tools” paper as well. In my review, I however only judge it from the criteria relevant for the “Dataset Description” category.

(1)Quality and stability of the dataset – Medium.

Based on the description given by the paper, the produced data has a high quality – it uses a suitable ontology, reuses existing vocabularies, has a well-designed URl structure, includes links to other datasets. Nevertheless, I rated this criterion only as “Medium” because the RDF data set is only available as a zip file. This data should also be provided through a SPARQL endpoint and a Linked Data interface which provides access to dereferenceable URIs. The lack of such access to the data prompted me to rate the paper with “Minor revisions” as opposed to “Accept”.

(2)Usefulness of the dataset, which should be shown by corresponding third-party uses.

The paper provides convincing examples of how the dataset and the wrapper were used in various application use cases, also by third parties. Additionally, I think that this dataset nicely diversifies the currently available Linked Data datasets.

(3)Clarity and completeness of the descriptions.

The paper is well-written and easy to read. It is very clear how the dataset was designed, created and used. Additionally, the authors have shared much of this work with the community, including, the list of DBpedia mappings, their ontology and the wrapper code, which is very positive. Suggestions for small improvements include:
•Please specify the exact number of links to schema.org (on p6)
•Since mappings to people and products mentioned in DBpedia were not included in the dataset, I would suggest removing sections 2.4.2 and 2.4.3 - also to reduce the paper to the expected 10-page limit.

Review #3
Anonymous submitted on 17/Sep/2016
Minor Revision
Review Comment:

The paper describes the Linked Data version of CrunchBase database with the technical details how the extraction and mapping process work. The paper is well-written. Several required aspects of the data set is clearly presented. The experience of building dataset is reasonably shared and discussed. Some reported usages via example applications and papers are also provided. However, I have some concerns as following.

The paper does not mention clearly how to find SPARQL endpoint of the dataset, I had to find a sparql endpoint
in the web page http://km.aifb.kit.edu/services/crunchbase/ by looking in to HTML code. I tried some queries, it work out quite nice, I think such detail should be clearly mentioned.

I’m a bit confused with the Table 3 about mapping some concepts of Schema.org, especially, “schema:Organization”. But when I tried with a first piece of data of http://km.aifb.kit.edu/services/crunchbase/api/people/mark-zuckerberg#id. I got following snippet which I see the usage of the vocabulary defined by the paper (prefix cb:) , there is no sign of using schema:Organization.

Also some other aspects in following need to be discussed in more details:

1. Metrics and statistics on external and internal connectivity: there is dedicated paragraphs/sections to discuss about this.
2. There is no clear discussion on Expressivity and complexity in using modelling languages
3. Section 2.3 describe the choice of using vocabularies model the Crunchbase API data with a diagram in Figure 3. However, there should be more critical points/short comings of design pattern to be discussed.
4. “Five star rating” is not discussed in the paper.

-The sentence “whether there is a corresponding entity in DBpedia” is replicated twice in the same place.

- I spot out a typo of a link on the webpage page http://km.aifb.kit.edu/services/crunchbase/ , i.e, , the line “Check out our page with some example queries on the Crunchbase dump”, the link under “our page” leads to a wrong website.