An architecture and methodologies for federated, privacy-enabled personalisation on the Web of Data

Paper Title: 
An architecture and methodologies for federated, privacy-enabled personalisation on the Web of Data
Benjamin Heitmann, Conor Hayes
Users of the Social Web have come to expect personalised services for recommendations and prioritisation. At the same time, the awareness of users for privacy issues has increased. Yet, privacy and personalisation have conflicting objectives. Users need to make their profile data available in order to benefit from personalisation. Service providers on the other hand usually require access to the maximum available amount of data from the user. These developments require new methods and architectures for personalisation which takes federated sources, structured data and privacy into account. In this article we propose an architecture for federated, privacy-enabled eco-systems based on the WebID standard and the FOAF and Web Access Control vocabularies. It enables the creation of a universal “private by default” ecosystem which enables interoperability of user profile data while protecting the privacy of the user. In addition we describe two methodologies for providing personalisation on top of the proposed architecture and the Web of Data. First we describe and evaluate a methodology for using federated, structured data for multi-source recommendations. Then we describe a methodology for exploiting data from different topic domains for cross-domain recommendations. Combined, these two methodologies enable personalisation beyond the context of a single service, by taking user profile data into account from all sources of the users social graph as well as his interest graph.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Major Revision

Solicited review by Eelco Herder:

This is very well-written article that discusses many issues regarding current personalization ecosystems (centralized, closed). A framework that makes use of established Semantic Web standards is proposed to enable user-controlled user profiles and cross-application recommendations. The practical use of this framework is clearly illustrated with a use case on personal health records.

The detailed descriptions on how the architecture works in practice and its qualitative evaluation make the paper a good textbook example on how user-controlled cross-application modeling and recommendation can and should work. For example, the observation 'self hosting of a user's profile can impact the protection of his identities negatively [...]' is a valid one - and one that I haven't found in the papers on privacy by Kobsa et al.

However, I found the quantitative evaluation using Smart Radio not very convincing. The study setup is good, the results are credible, but the results show only that multi-source recommendation may be beneficial in certain scenarios. This has been shown in earlier work (as partially discussed in 6.2.2) and the study does not really address or evaluate the practicality of the architecture itself.

I found it a bit disappointing that the main solution for user control and privacy protection involves the provision of a user interface in which users allow or deny applications access to their data. Is this solution really scalable? For example, I already have a hard time figuring out which applications are connected to my Facebook profile - in your application this problem would be even worse. Did you consider other privacy-preserving techniques, such as the collection of aggregated, anonymized data, or data perturbation and obfuscation?

Even though I do like the setup of the framework, I wonder how realistic it is to expect that major sites such as Facebook, Google and Twitter will eventually adopt such an approach? It is a known fact that these companies have their reasons for sticking to their closed, user lock-in setup. Which enabling factors are still to be discovered or evaluated in your planned future work? Some more discussion on this issue would increase the scope and impact of the paper.

Solicited review by Federica Cena:

The paper proposes an architecture for the interoperability of user profiles in the social semantic web, i.e. using well-known semantic web languages (RDF, OWL), vocabularies (FOAF, SIOC) and standards for user identification and data sharing (WebID, WAC). Such architecture has the goal of providing cross-domain multi-source and privacy-enable personalisation.

Relevance of the paper with respect of the special issue. I think the paper is very relevant to the special issue scope, since it provides a possible architecture that puts together many of the existing standards in the social semantic web in order to improve recommendation.

However, I have many concerns about the described approach.

First, the privacy-enable personalisation claimed in the title and in the abstract is barely and superficially addressed.

Many of the problems in the paper are addressed in a superficial manner, without the details that are necessary to understand well how the specific problem can be solved. This is particularly true in the following parts:
- how can the authors identify that two users in two different social networks are the same user? (section 4.1). The issue is very tricky and it is necessary to deal it with more details. Which algorithm do they implement? Why don't they use openID mechanism for user identification?
- How does the data integration take place? (section 4.2.1) The presented approach is very naïve and it works only with simple cases. How does the approach deal with more complex cases?

Finally the evaluation is not enough for a journal paper. The qualitative evaluation is more like a discussion than a real evaluation of the approach. The quantitative evaluation only covers one aspect of the paper, that is the improvement of recommendation, but it does not evaluate the privacy aspect of the framework. The evaluation also suffers of a poor description: more details are needed.

- Significance of the described research. The paper presents an architecture but does not present the problems that such architecture could used for solving. It is very important to start highlighting the problems and then present the architecture as a solution to address the problem.

- Presentation and organization. From a presentation point of view, the work is clearly presented only in some part, but sometime it put together heterogeneous stuff in a messy way. For example,
the background and the related work are in some parte overlapping. Make a clear distinction between what is background, and needs to be at the beginning, and what are related work, and needs to be at the end, after have read the entire paper.
At the same time, also section 3.2 according to me is more related to background than of the description of the architecture. This can be caused by the fact that background and requirements of the architecture are not clearly separated.
Several parts are non necessary. For example, Fig 1, fig 3 and figure 4 are redundant with respect to the text.

In some parts, there are a messy list of concepts that are of different levels of abstraction: for example, on page 5, section 2.3, the second paragraph dealing with the web of data and semantic web is very confused.

Moreover, the paper suffers of too many repetition of sentences, that make the reading of the paper a very boring experience. Only as example, the sentences: "in addition we describe two methodologies for providing personalisation on top of propose architecture .." is presented at the pages 2, 3, 11, 22.

The use case is very useful and interesting and I suggest to present it at the beginning of the paper as a motivating scenario, then in the description of the architecture as example of implementation of the proposed methodology and then exploit it in the evaluation section. I wonder why the authors decided to present as a use case for evaluation another example. I'd be interesting to see an evaluation of this.

- References. The references related to privacy-enable personalisation and case-based reasoning are adequate, but some areas relate to the topics of the paper are completely missing. For example, nothing is told about "user identification issue" (see for example, the work of Carmagnola and Cena, 2009, and Dolog, 2004):
- Carmagnola, F. and Cena, F.
User Identification in Cross-System Personalisation 
Information Science 2009, pp 16-32
- P. Dolog, Identifying relevant fragments of learner profile on the semantic web, in: Workshop on Applications of Semantic Web Technologies for Elearning, in International Semantic Web Conference, Hiroshima, Japan, 2004, pp. 37–42.

The user model interoperability literatures are also not covered (see for example Carmagnola, Cena and Gena, 2011, and Aroyo et al, 2006).
- Carmagnola, F., Cena, F., Gena, C.
User Model Interoperabily: A survey, UMAI 2011
- L. Aroyo, P. Dolog, G. Houben, M. Kravcik, A. Ambjorn Naeve, M. Nilsson, F. Wild, Interoperability in personalized adaptive learning, Educational Technology and Society 9 (2) (2006)

Also the data integration problem is not addressed in the related work.
Other similar architectures for cross-domain and multi source recommendation (see Berkovsky et al, 2008) are nor reported. Also Abel at al 2010 should be cited.

- Berkovsky, S., Kuflik, T., Ricci, F.:Mediation of usermodels for enhanced personalization in recommender systems. User Model. User-Adap. Inter. 18(3), 245–286 (2008)

- Abel, F., Henze, N., Herder, E., Krause, D.: Interweaving public user profiles on the web. In: Bra, P.D., Kobsa, A., Chin, D.N. (eds.) User Modeling, Adaptation, and Personalization: 18th International Conference, UMAP 2010. Lecture Notes in Computer Science, vol. 6075, pp. 16–27, ISBN 978-3- 642-13469-2. Springer, New York (2010)

Comparison with related work is very important to judge the novelty of the approach.

-General comments and specific suggestions.
The paper could be improved by a synthetic discussion of the research hypothesis and goals at the beginning of the paper, and then presented the answers at the evaluation sections.

The paper could be improved also by adding the definition of the main concepts used in the paper (i.e.: background data, web of data, interest graphs, social graphs, multi-source recommendation, cross-domain recommendation) at the beginning of the paper. Many of the definitions come too late (section 2.3). Also the choice of some words should be motivated: for example, why do the authors use the term "ecosystem"? is there a particular reason? Please, explain.

A discussion of the limitation of the approach is necessary and need to be added in the paper.

I do not agree with the sentence "User profiles are not portable between systems, connecting to
users from a different system is not possible and the user can not evade changes to the terms of service"
This is sentences should be mitigated, since there are several examples that contradict this (see for example, Abel 2010)

Many choices should be motivated, such as the decision of using only SIOC, FOAF and not other standards. Is FOAF enough? How do you represent the level of interest?

Section 4.1. The fact that a user inserts a content, does not imply an interest in the topic, but more probably an "expertise" on it. Please, take this aspect in consideration

Minor remarks.
- Page 1 ref 14 does not deal with social web
- Page 4 what is "tempo of music"?
- Page 4 explain better the difference between content-based recommendation and knowledge-based recommendation

Solicited review by Juri Luca De Coi:

I cannot find any better summary of the paper's content than (a slightly lengthened version of) its title: indeed the paper presents an architecture for portable user profiles and two methodologies for federated, privacy-enabled personalisation on the Web of Data.

Overall, I think that the paper is worth to be published. It is true that an evaluation of the second methodology would make it more complete; however, I do not think that such shortcoming by itself should motivate rejection.
A more serious concern is related to the availability of implementations of the proposed solutions: are the implementations and an environment enabling to reproduce the evaluation results publicly available? This could heavily impact the acceptance of the proposed solutions.
My last big remark is related to the presentation: the authors tend to copy and paste text across the paper. I do not have anything against such strategy, if the text fits all contexts it has been put in or if repetitions effectively improve readability by recalling concepts introduced far away (typically by reproducing parts of the introduction in the conclusion). If I can tolerate the application of such strategy when it just adds unnecessary redundancy (as it is often the case in this paper), I do definitely have something against its overuse in the conclusion, abstract and (above all) introduction, which appear to mainly be a patchwork of excerpts taken from elsewhere. For this reason, I would like the authors to rewrite such sections in a more focused and essential way, without minding if the paper will get shorter than 23 pages.

Content-related issues

Abstract & Introduction, "Yet, privacy and personalisation [...] data from the user.": The first sentence states that privacy and personalisation have conflicting objectives and the second one specifies the requirements for personalisation. I expected the following sentence to contrast the previous one by specifying the requirements for privacy, but it indeed specifies further requirements for personalisation
Pg. 2, col. 1, "user lock-in and social networking data silos": Explain what "user lock-in" and "social networking data silos" are. If by "user lock-in" you mean "User profiles are not portable between systems [...]", state it explicitly (e.g., 'by "user lock-in" we mean the fact that user profiles are not portable between systems [...]')
Pg. 3, col. 2, "Current recommendation algorithms can be grouped in 4 classes": It could be a good idea summarizing the classes in a table showing: (i) background data; (ii) input data; and (iii) how the recommendation algorithms use them
Last line of pg. 3, "the ratings between users and items": Do you mean "users' ratings of items"?
$2.2, "which is characterised by the high effort of knowledge engineering": What do you mean?
Pg. 7, col. 2, "which are informed by the emergence": Do you mean "which are required by the emergence"?
Pg. 8, col. 1, "enabling and protecting the anonymity related principles": Do you mean "enforcing and granting the anonymity-related principles"?
Pg. 8, col. 1, "affected by enabling and protecting the privacy principles": Do you mean "related to enforcing and granting the privacy principles"?
Pg. 8, col. 1, "the UI has the task to community": Do you mean "the UI has the task to communicate"?
Pg. 9, col. 1, "we describe the required communication pattern of the participants, followed by a qualitative evaluation": Actually, the "qualitative evaluation" will be presented in §5.1
Pg. 9, col. 2, "changing and maintaining the ACLs from the WAC metadata": What do you mean?
Pg. 9, col. 2, "This allows the storage to determine": Replace with "This allows the storage service to determine"
Pg. 10, col. 2, "they can be accessible via the same WebID [...] [1]": Replace with "they can be accessible via the same WebID [...], by using e.g., the approach described in [1]" (you never talked about WebID before)
Pg. 11, col. 2, "personalisation on user profiles": Do you mean "personalisation on top of user profiles"?
Last paragraph of §4.1: These scenarios do not really show the potential of your approach (for instance, why should people sharing musical preferences be willing to share travel preferences as well?). Think at better ones
Pg. 12, col. 2, "which share a MySpace users musical preferences": Do you mean "which share musical preferences as Myspace users"?
Pg. 13, col. 1, "extend the architecture of the recommender system with two components": Which ones? The "data interface" and the "integration service"?
Pg. 13, col. 1, "only requires writing one new rule": This seems to imply that the integration service is implicitly assumed to be rule-based. If this is the case, you have to clearly state it in advance
Pg. 14, col. 1, "to at least one musician for which the recommender system already has background data": Why is this constraint needed?
Pg. 14, col. 1, "multiple new user columns": I do not see how new columns can be generated. Do you mean that the new row will contain multiple 1's for different users?
Pg. 14, col. 2, "as cases which describe the experiential knowledge of users": What does it mean? And what is "experiential knowledge"?
Pg. 18, col. 1, "as explained in the previous section": Actually, you explained it in §4.2
Do you mean the same thing by "privacy-enabled" and "privacy-enhanced"? If this is the case, pick up one term and stick to it. Otherwise, explain the difference
§6: Explain in which respect(s) your solution is better than the existing ones. In particular, w.r.t. §6.1.2, why is your solution better than OAuth if both of them require a number of connections (cf. Fig. 8)?
Pg. 20, col. 2, "LOD cloud": What is it?

Minor issues

Throughout the paper
* Do not mix up "users", "user's" and "users'" (e.g., last line of the abstract)
* Add footnotes with the URLs of Yelp, Amazon, LiveJournal, GoodRelations, Best Buy, DBTune,
* Replace "a patients'" by "a patient's"
* Citations should follow these patterns
* For one author: "Smith [1]"
* For two authors: "Smith et Smith [1]"
* For more authors: "Smith et al. [1]" (not "Smith et. al [1]")
* Some figures are not referred to from within the text (e.g., Fig. 10): either refer to them or remove them
* Do not put a comma between the subject and the predicate of a sentence (e.g., pg. 11, col. 2, "Providing personalisation [...], requires")
* Hyphens often improve readability (add hyphens in expressions like "object centred", "knowledge intensive", "human readable", "graph based", "domain independent", "privacy enhancing", "anonimity related", "profile sharing", "self hosted", "privacy preserving", "work related", "user specified", "case based", "object oriented", "privacy enabled", "self hosting", "user generated", "cryptography enhanced", "OpenID enabled"), especially in case of noun adjuncts ( ) following the pattern (add hyphens in expressions like "high profile ", "health care ", "third party ", "large scale ", "collaborative filtering ", "data acquisition ", "general knowledge ", "open world ", "real world ")
* Keep consistency
* either always "Web" or always "web"
* either always "web site" or always "website"
* either always "light-weight" or always "lightweight"
* either always "Linked Data" or always "linked data"
* either always "schemas" or always "schemata"
* either always "he" or always "she" or always "s/he"
* either always "him" or always "her" or always "him/her"
* either always "his" or always "her" or always "his/her"
* Myspace (not MySpace or myspace)
* Wikipedia (not wikipedia)
* GoodRelations (not Good Relations)
* Best Buy (not BestBuy)
* (not
Abstract & Introduction, "new methods and architectures for personalisation": Replace with "new methods and architectures for a kind of personalisation"
Pg. 2, col. 1, "by the user themselves": Replace with "by the users themselves"
Pg. 2, col. 2, "background data which is provided": Replace with "background data which are provided"
Pg. 3, col. 2, "making their profile available": Replace with "making their profiles available"
Pg. 4, col. 1, "user preferences to the features": Replace with "user preferences against the features"
Pg. 4, col. 1, "which can be mapped to the knowledge": Replace with "which can be mapped against the knowledge"
Pg. 4, col. 1, "In addition, knowledge-based recommendation additionally": Replace with "In addition, knowledge-based recommendation"
Last line of §2.1: Replace with "hybrid one."
Pg. 5, col. 2, "a community is connected to each other [...] music from an artist": Replace with "a community is connected not only via direct links from person to person, but also via people's links to e.g. music by an artist"
§2.4, "Facebook reverting their policy": Replace with "Facebook reverting its policy"
§2.3, "This complimented with the description": You do not mean that
Pg. 6, col. 1, "Twitter": Remove the URL, you already added it before
Pg. 6, col. 2, "data and content[13]": Replace with "data and content [13]"
Pg. 7, col. 1, "plays in important role": Replace with "plays an important role"
Pg. 7, col. 1, "the difference for the architecture": Replace with "the difference in the architecture"
Pg. 7, col. 2, "third part services": Replace with "third-party services"
fig. 4, "many schema": Replace with "many schemas" or "many schemata" (cf. above)
Pg. 8, col. 1, "new infrastructure software from the user": Replace with "new infrastructure software by the user"
Pg. 8, col. 1, "into three areas:": Replace with "into three areas."
Pg. 8, col. 1, "Control over the user's data": Replace with "Control over the user data" (for consistency with §5.1)
Pg. 8, col. 1, "data which is collected of him": Replace with "data which is collected about him"
Pg. 8, col. 2, "they might contain data from millions of users": Replace with "they might contain data about millions of users"
Pg. 8, col. 2, "while maintaining their privacy at the same time": Replace with "while maintaining their privacy"
Pg. 9, col. 1, "This allows integrating": Replace with "This allows integration"
Pg. 9, col. 2, "profile storage service or data consuming services": Replace with "profile storage or data consuming services"
Fig. 6: What is "Chow"?
Pg. 10, col. 1, "However, while being central hub sites": Replace with "However, being central hub sites"
Pg. 10, col. 1, "provide personalised information recommendations": Replace with "provide personalised recommendations"
Pg. 10, col. 2, "clinical trials are matched to": Replace with "clinical trials are well-suited for"
First line of pg. 10: Replace "say" with "says"
Pg. 11, col. 2, "we are going to use two sources:": Replace with "we are going to use two sources: MySpace and DBpedia."
Last line of pg. 11: Replace "which has a special" with "has a special"
Pg. 11, col. 1, "Kyle Butler": Replace with "KyleButler"
Pg. 11, col. 1, "FOAF profile as RDF": Replace with "FOAF profile" (redundant: FOAF is RDF-based)
Pg. 11, col. 1, "we can follow it from MySpace": Replace with "we can follow the user they represent from Myspace"
Pg. 12, col. 1, 'these user identities, we could recommend "Dexter Morgans" topics to "KyleButler"': Replace with 'user identities, we could recommend "Dexter Morgan"'s topics to "TheTeacher"'
Pg. 12, col. 2, "would recommending": Replace with "would be to recommend"
Pg. 12, col. 2, "a statistical significant connection": Replace with "a statistically significant connection"
Pg. 13, col. 2, "which connections him": Replace with "which connects him"
Pg. 13, col. 2, "a wikipedia editors homepage": Replace with "a wikipedia editor's homepage"
Pg. 13, col. 2, "we can add data": Replace with "we can add the recommender system's background knowledge"
Pg. 14, col. 1, "which are connected to more then one musician": Replace with "who are connected to more than one musician"
Pg. 15, col. 1, "'problemßĂŹ": Correct
Pg. 15, col. 1, "matching case for the target problem": Replace with "matching case to the target problem"
Pg. 15, col. 1, "curation": Replace with "maintenance"
Pg. 15, col. 1, "there general knowledge": Replace with "there are general-knowledge"
Pg. 15, col. 2, "user generated content from": Replace with "content generated by"
Pg. 15, col. 2, "data about all of their stores": Replace with "data about all of its stores"
Pg. 15, col. 2, "this knowledge can be modelled in XML": Replace with "this knowledge can be represented in XML" (XML is just a representation format)
Pg. 16, col. 1, "RDF semantics follow": Replace with "RDF semantics follows"
Pg. 16, col. 2, "which the use. Consider as an example, that": Replace with "which they use. Consider as an example that,"
Pg. 16, col. 2, "": Replace with "DBTune" in normal font (for consistency with other brands)
Pg. 18, col. 1, "Linked Data can enable providing": Replace with "Linked Data can provide"
Pg. 18, col. 2, "is the set of recommendations": Replace with "is the set of artist recommendations"
Pg. 18, col. 2, "For a second recommendation": Replace with "For a system's recommendation"
Pg. 19, col. 1, "the existing approaches for enhancing personalisation to enable user privacy fall into these categories:": Replace with "the existing approaches for privacy-enhanced personalisation fall into these categories."
Pg. 19, col. 2, "randomised errors which cancelled out": Replace with "randomised errors which cancel out"
Pg. 20, col. 1, "However this thight integration": Replace with "However this tight integration"
Pg. 20, col. 1, "industry wide adoption": Replace with "wide industrial adoption"
Pg. 20, col. 1, "easy extensions to this vocabulary": Replace with "easy extensions of this vocabulary"
Pg. 20, col. 2, "[41] describe": Replace with "[41] describes"
Pg. 20, col. 2, "[44] describes their": Replace with "[44] describes"
Pg. 21, col. 2, "As we discuss in the next section": Point to the right section
Pg. 21, col. 2, "[21] propose": Replace with "[21] proposes"
Pg. 21, col. 2, "We argue that [...] integration": Replace with "We argue that, [...], integration"
Pg. 22, col. 1, "The problem is simply not one [...] to create innovative new models, but of": Replace with "The problem is not simply the one [...] to create new innovative models, but the one of"
Last line: Replace with "web browsers as user clients."
[20]: Capitalize "cbr" and "xml"