COLINDA - Conference Linked Data

Tracking #: 455-1632

Selver Softic

Responsible editor: 
Jens Lehmann

Submission type: 
Dataset Description
We introduce a new LOD (Linked Open Data) Cloud member COLINDA (COnference LInked DAta) which exposes information about scientific events like conferences and workshops for the period from 2007 up to 2011. COLINDA includes also time and venue information of the scientific events which is interlinked to the GeoNames Linked Dat aset.The main sources of COLINDA are WikiCfP and Eventseer.COLINDA holds information about conferences from all over the world and contains information about 6000 scientific events generating around 140000 triples. More then 25000 new conferences are to come. This paper provides an introduction on the conference linked dataset,and demonstrates its applicability for adoption of web 2.0 into science also known as Research 2.0 or Science 2.0.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aba-Sah Dadzie submitted on 29/May/2013
Major Revision
Review Comment:

COLINDA is a new addition to the LOD cloud, containing geo-referenced data about scientific conferences and workshops from 2007-11, based on calls for papers sourced from WikiCfP and Eventseer.
The authors describe the data sources and the conversion process, and conclude with an example of usage. A key limitation identified is incomplete or missing geolocation information for a fair part of the dataset.

I have one key concern - the population of the dataset - the authors note that publication approval from Eventseer is still pending - is there a fallback option if this is not obtained? Also - they note "parts of WikiCfP data are still in processing stage" - where and what is the bottleneck in processing, and is it on WikiCfP or COLINDA?

The COLINDA website provides a map that can be used only to browse conference acronyms and location (city, country). Linking each point to the complete entry in RDF would be useful. Even using the acronyms on the map I was unable to retrieve information for any conference (I tried a few before giving up) from the forms and endpoint on the website, or using the URIs as indicated in section 2.5, other than the example provided on the website. However, trying the data dump from datahub works - I noticed the URIs are different (paper, COLINDA website and datahub).

Røst et al., have available on the SWJ website a paper under review: 'Eventseer: "Calls for Papers" as Linked Data'
for the first call for linked dataset descriptions
While I acknowledge this is not yet published, the current version of the submission and the data are publicly visible and very relevant to this work. What does COLINDA provide over other similar existing services, and especially the Eventseer linked dataset (seeing as it makes use of this data)?
Some information IS given later in the paper, about connections to Linked Science and Research 2.0 - if these are key they should be mentioned in the introduction. A brief summary of the requirements of these two fields which COLINDA satisfies would be useful.

************ SWJ Linked Dataset description requirements ************

* Name, URL, version date and number, licensing, availability, etc.
Licensing information for the source data is provided - however, what license COLINDA will be provided under is not, nor date nor version information.

* Topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.
The authors state that the initial intention of COLINDA was to provide "tag based identification system for scientific events" - did this change? If so, what to? If not, consider "primary" (assuming this was what was meant) rather than "initial". Also, what exactly does "tag based identification" mean? Is this related to the (implied definition of a) hash tag in S2.2 - which is incorrect, as they do not have a '#' prefix - the examples given are terms or concepts, or acronyms. Further, the corresponding example for DBLP is a path. I would reword this to say e.g., acronym/term/concept - which is easily extended to a hashtag. OR provide an unambiguous definition of "tag" here.

The information extraction & data processing carried out is described, based on the ontology model followed.
The dataset is currently incomplete - by the authors' own estimation, approximately 1/6 of their target dataset has been processed and is currently available. However, which sections are complete/available is not clear - (in S3.6) - "As it can be seen in table 3 currently most conferences date from 2008 and 2009 since those dumps from WIkiCfP has been imported completely yet." - "yet" implies there is still data to be imported for those years - however the beginning of the sentence implies that all but these two are yet to be imported.
Also, are there plans to update COLINDA? - the end date reported is 2011 - however this submission is mid 2013. Especially considering the use case mentions Twitter as a source of affinity data, 2011 is woefully out of date.

S3.3 - are the processes to retrieve a single instance via REST and the full dataset via the SPARQL endpoint independent? This seems to be the case, if so isn't it inefficient? Why is the data not simply stored in the RDF store only? Also, the data processing involves a number of steps, first a conversion to CSV, before dumping into a MySQL database, then the conversion to RDF on demand - why so many intermediate steps?

* Metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth.
Statistics provided only at a high level - in independent tables listing conference counts by year and country. See point above on statistics, and below on internal connectivity. Ontologies reused are mentioned, with a brief discussion of relevance/applicability. However, information on coverage of concepts other than year and country/location, and relations in COLINDA's model is missing.

* Examples and critical discussion of typical knowledge modeling patterns used.
Known shortcomings of the dataset.
The model used is described; there is however no comparison with previous work or recommended practice. Further, the discussion is in some parts difficult to follow - key being the extraction of location information.
In S3.4 - owl:sameAs and swrc:location are two different things - why do the authors see them as equivalent? From Fig.1 - swrc:location points from the RDFType for the conference to a location - which makes sense - owl:sameAs here would not make sense.

The discussion about dissemination of locations is confusing. What is the "top 5 conference count"? And what is the proportion of this to the complete dataset? The point made about it being " ...only 1/6 part of the whole locations contained" is not very meaningful without this. Fig.2 on the other hand shows the majority of instances to contain location information.

* Usage - the potential to use this dataset for to provide data for mashups is proposed.
The authors give one example of use - the Researcher Affinity Browser - however it isn't obvious what contribution COLINDA makes to the browser. Also, is 4.1 the authors' extension to this browser, or simply an explanation of how it works? If the latter it would be more useful if it was used to illustrate clearly where it works with COLINDA.

************ Other points

Fig 1 is a bit confusing - how is rdf:type a node, and a central one at that? Should id not be followed rather by the conference ID?

WikiCfP and Eventseer are NOT web "pages" but sites.

"Such pages can be considered as scientific event announcement pages editable by the users with archiving character. " - this sentence is confusing - too many things are being said in one sentence. Assuming "users" are people who submit new events, what do they edit? Can they continue to edit post submission? How is "archiving character" relevant here - who/what is archiving - the users or the sites?

"Eventseer contains according the latest infromation4 information about around 21000 events ..." - would suggest changing the first "information" to data or report.

"Scientific events from both pages date from 2002 up to now ..." - implies it extends to the time of reading, rather than the point when the paper was written. Although even the latter would be incorrect - the paper states data ranges from 2007-2011.

"Listing 1 shows a simple entry from an WikiCfP data dump that was used to create instances from COLINDA ..." - do you mean FOR rather than FROM?

"Mesh Ups" - used several times - should this be "mashups"?

It would be useful to show the results of running the query in listing 3.

"... possible appliance case ..." - appliance incorrect, should be " ...possible APPLICATION" or "possible (use) case"

Fig. 3 is too small for the reader to follow the text description of use - it is only legible at very high magnification on-screen.
"... special affinity criteria ..." - does this mean a way of measuring similarity between researchers? - affinity used in this way is unusual, and simply makes the reader's job harder. Even simply showing examples of the affinities in the snapshot would be helpful - this part of the snapshot is hidden.
How is the video of the browser relevant to COLINDA?

Citations & References

Wrt Berners-Lee's "5-star" - should really cite the article itself, and add the link to the diagram as more detail.

Presentation, spelling & grammar

The paper is difficult to follow, mainly because of the presentation. An auto spelling and grammar check IS needed, but more importantly, a proofread. The issue is not that the authors may not have English as a primary language, but more that the paper appears not to have been read through to pick up basic errors and ensure readability.
I would also suggest the authors read the criteria for this paper type and ensure they've answered each of the key requirements - I don't that doubt this has been done, but rather that the information is not provided so the reader easily understands it. Papers from the first call should also give pointers to this.

Review #2
By Axel Polleres submitted on 25/Jun/2013
Review Comment:

This paper describes the COLINDA dataset, generated from WikiCFP and Eventseer and interlinking it with geonames.

The reason for my negative assessment for this special issue are as follows:

1) the paper/dataset does not yet demonstrate real applications (except a demo application apparently by the authors itself.

2) the dataset could do much more in terms of interlinkage: e.g. DBLP is mentioned as related, but not attempted to be linked. Likewise, is not linked.

3) the sustainability strategy for keeping the dataset alive is not clearly defined.

Summarizing, as this special issue should focus on sustainable linked datasets with clear applicability and adoption, this work seems to premature to be published in a journal as of yet.

Review #3
By Tomi Kauppinen submitted on 26/Jun/2013
Review Comment:

The paper describes a dataset about conferences as Linked Data. This sounds very promising, and indeed very usable if done well. Unfortunately the result is not at an acceptable level because of the following issues.

I tried the sample SPARQL query shown in Listing 3 in the but the result was "No result bindings specified. in ARC2_SPARQLPlusParser". The paper also promises that "Further executable samples of SPARQL queries can be found at [endpoint]" but there are none. In section 3.5 the URI design is explained. From that one expects that e.g. would be resolved. But it is not, result is 404. Am I missing something?

Also, how well the GeoNames mapping works? I checked COLINDA web site ( ) and especially the front page visualization "Conferences in Europe since 2006 contained in COLINDA". There are clearly some locations outside Europe, e.g. in Mexico area. Clicking them, however, show that those conferences are in fact in Spain. So my concern is to what extent the mapping works, and what is the precision. Can you open this up?

Finally, for me it was unclear how the "Researcher Affinity Browser" is connected with COLINDA. Clearly it could benefit of linked conference data, but is it now using COLINDA at all? If not then it is out of focus of this paper. Could you be more specific about the connection?

Based on the above issues about the data availability and its description I cannot recommend accepting the paper for publication.

Minor comments:

- "." missing at the end of the abstract.
- Check the sentence "Data published currently in COLINDA is extracted from the dumps of WikiCfP2 and data has been extracted via JSON interface from Eventseer."
- Correct sentence "This process is more strict by Eventseer then by WikiCfP"
- Correct sentence "Eventseer contains according the latest infromation information ...."
- Correct "...microblog data Mesh Ups..." (should be mashups)
- "seem to be very intuitive base" --> "seem to be a very intuitive base"
- Figure 2 is not very informative. Better would be to just say in the text the percentage of how many conferences had the location tag.