Publishing Bibliographic Data on the Semantic Web using BibBase

Paper Title: 
Publishing Bibliographic Data on the Semantic Web using BibBase
Authors: 
Reynold S. Xin, Oktie Hassanzadeh, Christian Fritz, Shirin Sohrabi and Renée J. Miller
Abstract: 
We present BibBase, a system for publishing and managing bibliographic data available in BiBTeX files. BibBase uses a powerful yet light-weight approach to transform BiBTeX files into rich Linked Data as well as custom HTML code and RSS feed that can readily be integrated within a user’s website while the data can instantly be queried online on the system’s SPARQL endpoint. In this paper, we present an overview of several features of our system. We outline several challenges involved in on-the-fly transformation of highly heterogeneous BiBTeX files into high-quality Linked Data, and present our solution to these challenges.
Full PDF Version: 
Submission type: 
Tool/System Report
Responsible editor: 
Guest Editors
Decision/Status: 
Accept
Reviews: 

This is a revised submission following a "reject and resubmit", which has now been accepted. The reviews below are for the resubmission, followed by those of the original submission.

Reviews for the resubmission:

Solicited review by Kai Eckert:

I still think that BibBase needs much more also conceptional work and especially a clearer message about the goals to the user. However, this is an interesting approach and hopefully the publication not only motivates you, but also gives you additional feedback that helps you to optimize bibbase. All in all, I would accept the paper now.

Solicited review by Antoine Isaac:

I am happy with the modifications the authors made to the first submission.

There is still no evaluation report, and the wiki page of the project still presents old material. But it is good to see that the authors are still working and supporting their system, and willing to advertise it to community pages like http://www.w3.org/wiki/ConverterToRdf

Reviews for the original submission:

Solicited review by Kai Eckert:

The authors present BibBase, a web application that fetches bibliographic entries (Bibtex) from external URLs, transforms the data into linked data and provides several views on the data: RDF, Bibtex and HTML pages. The latter are intended to be used for integration into own webpages. A SPARQL endpoint is provided. Fetching from Mendeley is announced, but did not work at the time I tested the service.

I would like to start this review in an unusual way: BibBase works as described, the description is solid, the topic relevant, so from a formal point of view, I would accept the publication. It is appropriate for a tools description and certainly not too long.

I rather have some concerns with (the presentation of) BibBase than with the article to be reviewed, which I would strongly recommend to improve until the final publication of the article:

1. The welcome page is not very welcoming and does not explain much. Beside the fact that Mendeley did not work, it is unneccessary to let the potential users create a URL instead of just letting them insert a URL of their Bibtex file in a form field.

2. At first I missed a possibility to upload a Bibtex file. I understand now that the explicit strength of BibBase is to work with external sources and as such, probably the missing feature to host a bibtex file is intended. Nevertheless, as the files are cached on your server anyway, I would provide this possibility, probably it would not hurt.

3. A nice addon would be to assist Bibsonomy users to select Bibsonomy Bibtex exports. Also not a big deal, but would welcome Bibsonomy users nicely (like it seems to be intended for Mendeley users).

4. I don't quite understand the effect of the login. Is it really only that you value votes and corrections higher. That is confusing. When I login, I expect that I somehow can do something with "my" data at least, but I could not find a difference, "my" data is just cached Bibtex files, like everone else's. Here I see a huge potential to help a logged-in user to curate the own data, e.g. by selecting the desired versions of the various bibtex representations, maybe even merge them further and allows to get back an enriched Bibtex file. If this is not desired (e.g. to keep the service simple and strictly focussed on the external sources), then I would try to communicate this in a clear way and at least let the logged-in user select own sources and provide direct hints how to enrich these sources.

6. The Linked Data HTML representation should present the automatically derived links that are available in RDF (and described in your paper), as these might be interesting for just-browsing visitors.

7. A major concern: Please change the data license to a real open data license, at least without commercial restriction, better and more practicable would be a public domain (CC0 or PDDL) waiver. At least the users should have the choice to select an open data license (maybe directly in the Bibtex file?).

I would like to read even more about the technical details that you implemented behind the scenes, especially for automatic linking and duplicate detection. This is not necessary for a tools presentation, but properly described would make a nice full paper. To summarize, BibBase is a nice and clean approach to publish and link bibliographic data, but currently does not yet uses it's full potential, especially with respect to the ease-of-use which is claimed to be the main priority by the authors. I hope we will see further improvements soon. As a first measure, I would concentrate on the presentation and documentation of the whole project and make it as intuitive, appealing and usable as possible for new users.

Solicited review by Jan Brase:

The paper describes an interesting system to publish bibliographic information as linked data. The paper is clear written and includes a good overview on the current state of the art. The whole system is very ambitious and it has to be proven, if it can fullfill its expectations. Concerning duplicate detection of author names for example, as this still is a very open problem to most bibliographic data systems. It is also not clear, how the stability of the links in bibbase can be guaranteed.

Generally the paper is acceptable. It could be discussed, whether a revised version of the paper at a later stage with more actual user experience and proof of concept should be considered.

Solicited review by Antoine Isaac:

The paper is well written, the Bibbase functionality described can be much useful to researchers, and the system seems to be working.

One of the first issues, however, is the situation of Bibbase wrt. existing work.
As the technical level, it would be useful to acknowledge the many Bibtex to RDF work done before, and explain what the differences are with Bibtex. To name a few:
- http://www.l3s.de/~siberski/bibtex2rdf/
- http://simile.mit.edu/repository/RDFizers/bibtex2rdf/
- http://www.cs.vu.nl/~mcaklein/bib2rdf/

Also, is good to relate to established ontologies like BIBO. Further, the connection seems not to be implemented in a proper way. For instance I don't see that hasTitle is mapped to anything at http://zeitkunst.org/bibtex/0.1/bibtex.owl .
In the same line it could be interesting to see whether connections can be established with the more recent CiTO http://purl.org/net/cito/ from the SPAR ontologies.

More worrying is the lack of reference to the more-and-more visible VIVO project. VIVO focuses on researchers, but of course their publications are a key aspect of it, and VIVO serves linked data for it...

Another source of frustration is the part on duplicate detection and matching (e.g., to DBpedia). These are interesting, in fact they could be a key contribution of the paper. Unfortunately there is not much technical description (or links) to how they have been implemented. And no evaluation of the precision of the approaches.

Also, before the paper is accepted it should be re-written to really show what is done and not. There is not much point accepting a paper if more functionalities are coming in the next months, and are described in the main body of the paper (as opposed to the usual "future work"). For example it seems that users are NOW able to vote for identified duplicates. But I have not checked other "will be able"-like sentences. And no item has been published on the "news" part of wiki.bibbase.org...

Finally, a crucial aspect left untouched is the lack of visibility on longer-term plans from the project team. The proposed architecture seems quite centralized, with contributors sending their data to bibase.org. What kind of commitment is there, for this central server? Is the code openly available, for others to create their own Bibbase instances? The terms of use, which mention that retro-engineering of the code is prohibited, seems not to be a good sign.
I find it also worrying that the resulting RDF is Creative Commons Attribution-Noncommercial-Share Alike 2.5. This means that Bibbase claims rights over data that was contributed by users. This may be prove very deterring for the field to start using Bibbase...

Minor point:
- I'm not sure I understand the difference between "offline' and "online" in section 4.

Tags: 

Comments


We would like to thank the reviewers for their thoughtful comments and reviews. Below we have included reviewer comments (in their entirety) and the actions we have taken to address the comment.


Sincerely,
Oktie Hassanzadeh
(On behalf of my co-authors: Christian Fritz, Renée J. Miller, Shirin Sohrabi, and Reynold S. Xin)

  • Solicited review by Kai Eckert:
    The authors present BibBase, a web application that fetches bibliographic entries (Bibtex) from external URLs, transforms the data into linked data and provides several views on the data: RDF, Bibtex and HTML pages. The latter are intended to be used for integration into own webpages. A SPARQL endpoint is provided. Fetching from Mendeley is announced, but did not work at the time I tested the service.
    I would like to start this review in an unusual way: BibBase works as described, the description is solid, the topic relevant, so from a formal point of view, I would accept the publication. It is appropriate for a tools description and certainly not too long.
    I rather have some concerns with (the presentation of) BibBase than with the article to be reviewed, which I would strongly recommend to improve until the final publication of the article:


    We truly appreciate the type of review you have provided. We would like to encourage the readers to also provide us with feedback on all the aspects of the system. We will do our best to fit all the requests for features in our to-do list and assign them to our current or future team members.


    1. The welcome page is not very welcoming and does not explain much. Beside the fact that Mendeley did not work, it is unneccessary to let the potential users create a URL instead of just letting them insert a URL of their Bibtex file in a form field.


    This is one of our main concerns as well. In particular, we are working on a new data browse interface at http://data.bibbase.org/ that will replace the current list of items, with a single Google-style keyword search box that supports entity lookup and fuzzy keyword search. We are currently experimenting with a number of options, including using Bimaple’s technology http://bimaple.com/, but we will release the new interface once it is fully tested on our test server, as we would like to make sure that the system stays as light-weight as possible to be able support much larger scale of users and BibTeX files while maintaining all our features (please also see our response to reviewer #3 (Antoine Isaac) below on this issue)


    2. At first I missed a possibility to upload a Bibtex file. I understand now that the explicit strength of BibBase is to work with external sources and as such, probably the missing feature to host a bibtex file is intended. Nevertheless, as the files are cached on your server anyway, I would provide this possibility, probably it would not hurt.


    One of our main goals in BibBase is to provide a high-quality data source and provide provenance information (currently in the form of IPs and host URLs) so that the users can find out the source of each entry (and possibly provide feedback and report errors). Requiring the BibTeX files to be hosted elsewhere has helped us achieve this goal so far. With the new OpenID-based login, we will be able to provide the option of uploading BibTeX files. However, since it is currently possible to use other file-hosting services to upload BibTeX files, we prefer to not act as a BibTeX hosting service (such as BibSonomy, please see below).


    3. A nice addon would be to assist Bibsonomy users to select Bibsonomy Bibtex exports. Also not a big deal, but would welcome Bibsonomy users nicely (like it seems to be intended for Mendeley users).


    It is currently possible to use the BibTeX output of BibSonomy on BibBase. For example, on http://data.bibbase.org one can enter a BibSonomy user’s URL such as http://www.bibsonomy.org/bib/user/oktie which will result in the following BibBase page: http://www.bibbase.org/cgi-bin/pyBibBase/pyBibBase.cgi?bib=data.bibbase.... and provenance entity: http://data.bibbase.org/provenance/httpwwwbibsonomyorgbibuseroktie/


    4. I don't quite understand the effect of the login. Is it really only that you value votes and corrections higher. That is confusing. When I login, I expect that I somehow can do something with "my" data at least, but I could not find a difference, "my" data is just cached Bibtex files, like everone else's. Here I see a huge potential to help a logged-in user to curate the own data, e.g. by selecting the desired versions of the various bibtex representations, maybe even merge them further and allows to get back an enriched Bibtex file. If this is not desired (e.g. to keep the service simple and strictly focussed on the external sources), then I would try to communicate this in a clear way and at least let the logged-in user select own sources and provide direct hints how to enrich these sources.


    We certainly agree. There is indeed a huge potential to enhance the quality of data using logged-in users. We are currently working on an extension of the feedback mechanism that allows acting as an administrator for your own BibTeX files, which requires having your OpenID login on your BibTeX file as a comment. But we have not released the feature yet since this is a very sensitive one: one user can try to modify other users’ entries, and for example merge author1 with author2 without their permission. We need to keep a history of changes (to be able to revert unwanted changes), and also allow efficient communication with our users, which is what we are working on right now. Currently, we only gather feedback from regular users, but we have a number of “administrators” that have the ability to modify the data. We will be happy to make you an administrator if you send us a note!


    6. The Linked Data HTML representation should present the automatically derived links that are available in RDF (and described in your paper), as these might be interesting for just-browsing visitors.


    Currently the "See Also" section on top of HTML pages contain a list of such links. For example, see http://data.bibbase.org/author/oktie-hassanzadeh/ which has three links, to:
    http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hassanzadeh:...,
    http://dblp.l3s.de/d2r/resource/authors/Oktie_Hassanzadeh, and
    http://data.semanticweb.org/person/oktie-hassanzadeh


    7. A major concern: Please change the data license to a real open data license, at least without commercial restriction, better and more practicable would be a public domain (CC0 or PDDL) waiver. At least the users should have the choice to select an open data license (maybe directly in the Bibtex file?).


    We just removed the non-commercial restriction from the license. We want BibBase data to be open and are willing to consider other licenses if the current license restricts any applications of the data.


    I would like to read even more about the technical details that you implemented behind the scenes, especially for automatic linking and duplicate detection. This is not necessary for a tools presentation, but properly described would make a nice full paper. To summarize, BibBase is a nice and clean approach to publish and link bibliographic data, but currently does not yet uses it's full potential, especially with respect to the ease-of-use which is claimed to be the main priority by the authors. I hope we will see further improvements soon. As a first measure, I would concentrate on the presentation and documentation of the whole project and make it as intuitive, appealing and usable as possible for new users.


    We agree, and we are working on better documentation and additional features that could make BibBase appealing for a larger set of users. We have also slightly extended Sections 3 and 4 with more technical details (please see below).

  • Solicited review by Jan Brase:
    The paper describes an interesting system to publish bibliographic information as linked data. The paper is clear written and includes a good overview on the current state of the art. The whole system is very ambitious and it has to be proven, if it can fullfill its expectations. Concerning duplicate detection of author names for example, as this still is a very open problem to most bibliographic data systems. It is also not clear, how the stability of the links in bibbase can be guaranteed.
    Generally the paper is acceptable. It could be discussed, whether a revised version of the paper at a later stage with more actual user experience and proof of concept should be considered.


    We also hope to report the results of the applications built on top of our data, and actual user experience especially with our newly implemented feedback mechanism in the near future. In particular, we would like to use the feedback provided by our users to evaluate duplicate detection and link discovery algorithms in a future technical paper.

  • Solicited review by Antoine Isaac:
    The paper is well written, the Bibbase functionality described can be much useful to researchers, and the system seems to be working.
    One of the first issues, however, is the situation of Bibbase wrt. existing work.
    As the technical level, it would be useful to acknowledge the many Bibtex to RDF work done before, and explain what the differences are with Bibtex. To name a few:
    - http://www.l3s.de/~siberski/bibtex2rdf/
    - http://simile.mit.edu/repository/RDFizers/bibtex2rdf/
    - http://www.cs.vu.nl/~mcaklein/bib2rdf/
    Also, is good to relate to established ontologies like BIBO. Further, the connection seems not to be implemented in a proper way. For instance I don't see that hasTitle is mapped to anything at http://zeitkunst.org/bibtex/0.1/bibtex.owl .
    In the same line it could be interesting to see whether connections can be established with the more recent CiTOhttp://purl.org/net/cito/ from the SPAR ontologies.
    More worrying is the lack of reference to the more-and-more visible VIVO project. VIVO focuses on researchers, but of course their publications are a key aspect of it, and VIVO serves linked data for it...


    Our ontology is available at http://data.bibbase.org/ontology/. At the bottom of the page, there is a table that shows the relationship between this ontology and BIBO and SWRC.


    Regarding BibTeX to RDF converters, we had only a link to a blog article by Ivan Herman that contained a list of such conversion tools http://ivan-herman.name/2007/01/13/bibtex-in-rdf/. In our revision, in an attempt to avoid having links that may become outdated or broken in the future, we created a permanent URL for the list of such tools and replaced the citation with that list http://purl.org/bibbase/other-bibtex2rdf-tools The URL currently points to the W3C list of convertors. We will keep this list and URL up-to-date.


    We also added a reference to the very interesting VIVO project.


    We see our work in BibBase complimentary to existing systems such as VIVO (and similar projects such as EPrints http://www.eprints.org/, with RDF provided by RKBExplorer project http://eprints.rkbexplorer.com/) since it is a much more light-weight approach that does not require setting up an individual server and maintaining a software system. Maintaining such systems may be a big burden for many institutions, who may not be aware of the benefits of Linked Data and Semantic Web solutions, and also smaller research groups who may not have the resources required to maintain such systems. Some of the top users of BibBase currently are such smaller research groups, for example:
    http://www.isi.edu/integration/publications.php
    http://www.cs.toronto.edu/kr/publications/
    http://npl.mcgill.ca/Publications.php


    Another source of frustration is the part on duplicate detection and matching (e.g., to DBpedia). These are interesting, in fact they could be a key contribution of the paper. Unfortunately there is not much technical description (or links) to how they have been implemented. And no evaluation of the precision of the approaches.


    We have slightly extended Sections 3 and 4 in our revision with a description of the particular methods we have used, and references to our work on duplicate detection and link discovery (that includes evaluation of the approaches). We agree that our description does not cover all the aspects of these processes and can be made more complete and a main contribution of the article. However, we omitted the details since the call for papers in "Reports on tools and systems" category asks for “short papers” with “brief and pointed” description of the capabilities of the system. If needed, we can further extend these sections with more complete overview of our algorithms and techniques.


    Also, before the paper is accepted it should be re-written to really show what is done and not. There is not much point accepting a paper if more functionalities are coming in the next months, and are described in the main body of the paper (as opposed to the usual "future work"). For example it seems that users are NOW able to vote for identified duplicates. But I have not checked other "will be able"-like sentences. And no item has been published on the "news" part of wiki.bibbase.org...


    We agree, and have revised the paper accordingly. We had a set of implemented features (including the feedback mechanism) that were not made publicly available yet while preparing the initial submission, and therefore we used the term “will be” but outside the usual “future work” section.


    Finally, a crucial aspect left untouched is the lack of visibility on longer-term plans from the project team. The proposed architecture seems quite centralized, with contributors sending their data to bibase.org. What kind of commitment is there, for this central server? Is the code openly available, for others to create their own Bibbase instances? The terms of use, which mention that retro-engineering of the code is prohibited, seems not to be a good sign.


    One of our main goals in the development of BibBase has been keeping the system light-weight and easy to maintain. We will not release any feature unless we make sure that we can fully support it in the long run, and on typical “commodity” servers (or hosts).


    On the other hand, our group is very well funded, and is actively working on a number of Linked Data projects (a list is available at http://dblab.cs.toronto.edu/project/linkdiscovery/). Our Linked Data sources use the same servers that the department of computer science at the University of Toronto are using, so as long as our department web site http://www.cs.toronto.edu/ is up, our servers will be up and running.


    Another indication of our commitment to this project is that the HTML portion of BibBase.org has been running since 2008. We currently have more than 2,000 unique visitors per month, and have had a fairly robust and continuous growth in terms of traffic and users. We added Figure 5 and extended Section 8 in our revision with our current usage statistics. BibBase is already being used in 79 different http domains (which translates into many more users). In case you would like to verify this and get live statistics, you can use the following Unix command:

    wget -O - http://www.bibbase.org/cgi-bin/analog.cgi | cut -d '>' -f 4-| cut -d '<' -f 1 | grep http | cut -d '/' -f -3 | sort | uniq | wc -l

    (the number returned by this command at the end is the number of unique http domains).


    I find it also worrying that the resulting RDF is Creative Commons Attribution-Noncommercial-Share Alike 2.5. This means that Bibbase claims rights over data that was contributed by users. This may be prove very deterring for the field to start using Bibbase...


    We have removed the commercial restriction as mentioned above, although our license only applies to our extension of the users’ files (such as the links we discover) and not the data contributed by the users. Currently, our RDF data publication framework does not allow a separate copyright notice per each file, but we will work on this issue as we are working on an alternative RDF server and publication framework.


    Minor point:
    - I'm not sure I understand the difference between "offline' and "online" in section 4.


    We have updated Sections 3 and 4 to clarify.