Semantics-Aware Shilling Attacks against collaborative recommender systems via Knowledge Graphs

Tracking #: 2735-3949

Authors: 
Vito Walter Anelli
Yashar Deldjoo
Tommaso Di Noia
Eugenio Di Sciascio
Felice Antonio Merra

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
Abstract: 
Several domains have widely benefited from the adoption of Knowledge graphs (KGs). For recommender systems (RSs), the adoption of KGs has resulted in accurate, personalized recommendations of items/products according to users' preferences. Among different recommendation techniques, collaborative filtering (CF) is one the most promising approaches to build RSs. Their success is due to the effective exploitation of similarities/correlations encoded in user interaction patterns. Nonetheless, their strength is also their weakness. A malicious agent can add fake user profiles into the platform, altering the genuine similarity values and the corresponding recommendation lists. While the research community has extensively studied KGs to solve various recommendation problems, sufficient attention was not paid to the possibility of exploiting KGs to compromise the quality of recommendations. KGs provide a rich source of information for item representation and recommendation that can dramatically increase the attackers' knowledge about the victim recommendation platform. To this end, this article introduces a new attack strategy, named semantics-aware shilling attack (SAShA), that leverages semantic features extracted from a knowledge graph. SAShA provides the semantics-aware variant of three state-of-the-art attack strategies: Random, Average, and BandWagon. These improved attacks can exploit graph relatedness measures, i.e., Katz and Exclusivity-based, computed considering 1-hop and 2-hops of graph exploration. We performed an extensive experimental evaluation with four state-of-the-art recommendation systems and two well-known recommendation datasets to investigate the effectiveness of SAShA. Since the semantics of relations has a crucial role in KGs, we have also analyzed the impact of relations' semantics by grouping them in various classes. Experimental results indicate the benefit of embracing KGs in favor of the attackers' capability in attacking recommendation systems.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/May/2021
Suggestion:
Major Revision
Review Comment:

This paper expands a new shilling attack, named semantics-aware shilling attacks (SAShA), by leveraging semantic features extracted from publicly available knowledge graphs. While the novelty of the idea initially comes from the same authors, this paper adopts and applies the main idea to broader recommender models and attack strategies. Meanwhile, more metrics are considered to cover different aspects of the similarity between the resources.
The main noticeable strengths of this work can be listed as follows: (i) the quality of the research, which is original, well presented and contains contributions that enhance state of the art for the integration of semantics in the shilling attacks and involving a deep neural recommendation model (ii) the solid study of state of the art has been performed systematically and documented in an organised and well-structured manner (iii) the motivation of the research has been discussed in good detail which is led to precisely proposing the research idea (iv) a comprehensive experimental evaluation has been conducted in order to investigate whether SAShA is more effective than baseline attacks against Collaborative Filtering models by taking into account the impact of various semantic features. Experimental results evaluated on two real-world datasets to show the usefulness of the proposed strategy.

All these strong points have eventuated in nicely documented first four sections of this paper which needs just a few typos and language corrections. Some examples of these minor issues can be found in “Minor corrections” subsection of this review.

Despite all the above-mentioned positive points, as noted in the paper, “given the extent of experiments carried out in the experimental section, it could be hard to decipher this information at first glance”. Therefore, more precise analyses, consideration and discussion are needed.
Although the insights obtained from the experimental results are interesting, there are some inconsistency and exceptions in the analysis of these results, especially in section 5, which make these discussions and their corresponding conclusions very fragile and confusing. Therefore, this chapter needs a major amendment and revision, in my opinion. I precisely reviewed this section and highlighted its polemical parts as following.

Highlighted points for Major revise/ correction:

•“the results obtained on the Yahoo!Movies dataset (Table 5) are more indicative of attacks’ effectiveness independently of the attack strategy, the number of injected profiles, and recommender models”. (Section 5.1 - Page 13 - column 1 - Lines 45-50)

Comparing the results of the experiments in both tables, this conclusion sounds very generalised and fragile. It is not clear for the reader how authors could come out with this intuition and which parts of Table 5 differentiate the effectiveness of attack compared with Table 4.

•“Furthermore, Table 4 also confirmed the semantics aware strategy’s efficacy over the baseline, either for the average and random attacks. For instance, the semantic strategies outperformed all the and < baseline attacks independently of the recommender model and the size of attacks” (Section 5.1 - Page 13- column 2- Lines 7 -13)

While there are several exceptions in Table 4 which show some semantic strategies could not outperform the baselines, for example:
Also with all similarity measures could not outperform the baseline while and can beat the baseline. These exceptions disaffirm the above-mentioned claim.

•“However, it is worth mentioning that, differently from the results on Yahoo!Movies, on , the baseline attack’s effectiveness did not improve” (Section 5.1 - Page 13- column 2- Lines 13 -16 )
Again, some exceptions show that improved the baseline attack’s effectiveness, such as . Also, in , there are some cases such as for recommendation models within all attack granularity did not improve the baseline. These exceptions again disaffirm the claim mentioned above.

•“We can observe that the adoption of graph-based relatedness generally leads to an attack efficacy improvement over the baseline, which adopts cosine similarity metric.” (Section 5.2 - Page 13- column 2- Lines 49 -52)

Again, it is a very general claim, and it is not clear what “graph-based relatedness” means in the above sentence. Does it mean the “semantic features” or “relatedness-based measures”?

•“The general observation here is that in majority of the experimental cases, the adoption of relatedness-based semantic information leads to improvement of the attacks’ effectiveness” ( Section 5.2 - Page 14- column 1- Lines 36 -39)

What does “relatedness-based semantic information” means? Does it mean “relatedness-based measures”? There is also inconsistency in using the exact/similar phrases in different places without proper definition, mitigating the paper's self-containing.

•“We may observe the same behavior for the Yahoo!Movies dataset in Table 5, in which the HR for <1H, User-kNN, Random, Categorical, Katz> is 10% better than the baseline, i.e., 0.3725 vs. 0.3512” (Section 5.2 - Page 14- column 1- Lines 39 -43)

The baseline for is 0.3624 , therefore 10% is not correct improvement against baseline. Shall it need to be revised? Or does it mean Katz vs Cosine?

•“Beyond random attacks, we can observe some general trends also for informed attacks. In detail, Table 4 (LibraryThing), we note that categorical information improves both User-kNN and Item-kNN.” (Section 5.2 - Page 14- column 1- Lines 43-47)

Again, not in all cases. The exceptions are as , , , .
It should also mention that this trend is just valid for Average attacks and not for BandWagon attacks.

•“It is worth noticing that the same consideration does not hold for latent factor-based models. MF and NeuMF suit better cosine vector similarity.” (Section 5.2 - Page 14- column 1- Lines 47 -49)

What does “same consideration” mean here? If you mean and , the claim is not true because and outperformed all others.

•Section 5.2 - Page 14- column 2- Lines 33 -51

It is not apparent which insight of the BandWagon attack is discussed in this paragraph. Therefore, it looks pretty dangling to discuss the irrelevant measure of popularity for justification.

•“All the experimental datasets and all the recommendation models clearly show this effect.” (Section 5.2 - Page 15- column 1- Lines 33 -34)

It is not clear which effect is discussed?

•Section 5.2 - Page 15- column 1- Lines 35 -46

The hypothesis definition and the relatedness of mentioned examples as evidence for the hypothesis and the concluded result from this example are unclear and do not make sense.

•“We start focusing on Categorical knowledge. The experiments on LibraryThing show that Exclusivity is probably the relatedness that best suits this information type”. (Section 5.2 - Page 15- column 2- Lines 40 -43)

It seems the result is not that solid and precise for LibraryThing, too. There are several exceptions regarding the different recommendation models and attack type.

•Section 5.2 - Page 15- column 2- Lines 49 -51

All possible cases of and show that this conclusion is not true.
In detail, we found that with low-knowledge attacks, the best relatedness is Exclusivity for LibraryThing and Katz for Yahoo!Movies. With informed attacks, the best relatedness metric is the cosine similarity. However, for the sake of electing a similarity that better suits Factual information, we can note that Exclusivity generally leads to better results with LibraryThing.

•“Regarding Yahoo!Movies, the first and foremost consideration we can draw is that graph-based relatedness measures seem to have no positive impact when exploiting a double-hop exploration” (Section 5.2 - Page 16- column 2- Lines 32 -36)

What does “graph-based relatedness measures” mean? It seems double-hop exploration on Factual information has a positive impact on exclusivity measure for average and bandwagon attacks. There are other cases as well.

•“Indeed, in most cases, we can observe a minimal variation for the double hop performance” (Section 5.2 - Page 16- Column 2- Lines 40 -42)

How did you range the minimal variation? It is interesting to know why NeuMF on LibraryThing has significantly more considerable positive and negative variations?

•“Beyond graph-based relatedness, we observe that cosine vector similarity almost always shows an improvement when considering second-hop features (particularly with Ontological and Factual information)” (Section 5.2 - Page 16- Column 2- Lines 50 -51)

What does “almost always” mean? It is not precise insight. All possible cases of show that this conclusion is not valid.

•Section 5.2 - Page 17- Column 1- Lines 21 -38

Although these intuitions to answer RQ4 sound theoretically reasonable, there are no solid pieces of evidence from experimental results mentioned for these outcomes.

All the above points show the lack of consistency and clarity in the discussion and conclusion of the experimental result.
While the authors specified a declarative format to identify any attack combination (Section 5.1 - Page 13- Column 1 - Lines 31 -35), they rarely use this format in the rest of the paper. I believe and show in my comments that using this format would bring more clarity to the discussed items, and I highly recommend exploiting this format in the revision of the paper.

Suggestion for minor corrections:

I came to cross some minor typos and syntactic errors during this review which can be listed as follow:

•“For this purpose, we compute semantic similarities/relatedness between the items in the catalog e the target item using KG-based features (cf. Section 3.1)” (Section 3.3 - Page 9- Column 2- Lines 18 -21)

•“The baseline attack leverages the mean and variance of the ratings, which is then used to sample each filer item’s rating from a normal distribution built using these values.” (Section 3.3 - Page 9- Column 2- Lines 28 -31)

•“However, similarly to the previous two semantic attack extensions,” (Section 3.3 - Page 9- Column 2- Lines 42 -43)

•“we describe the the experimental evaluation and provide details necessary to reproduce the experiments” (Section 4 - Page 10- Column 1- Lines 3 -5)

•“Following the evaluation procedure used in Mobasher et al. [4, 88],” (Section 4.4 - Page 12- Column 2- Lines 38 -39)

•“All the results are computed for top-10 recommendation??”. (Section 5 - Page 13- Column 1- Lines 21 -22)

•“In this section, ??? devote ourselves to provide a more in-depth discussion about the impact of several factors involved in the design space of the proposed semantics-aware shilling attacks against CF models.” (Section 5.2 - Page 13- Column 2- Lines 30 -33)

Although I firmly believe this study would benefit the research community, researchers can employ the findings of this paper in their research and apply their expertise to contribute to improving this study in the suggested future research directions. However, I won’t convince myself to accept the current version before applying suggested major correction, especially in Section 5 and 6.

I very much look forward to reading the revised version of this paper soon.

Best of luck.

Review #2
Anonymous submitted on 02/Jun/2021
Suggestion:
Major Revision
Review Comment:

The paper introduces a new attack strategy that leverages semantic features extracted from a knowledge graph in order to strengthen the efficacy of the attack to standard CF models. The work is interesting, but there are some questions that should be solved.

1.There is no sequence of references.

2.The introduction of Section 3.1 and attack model in Section 3.2 belongs to related work and should be put in Section 2

3.Based on the similarity in the knowledge map to generate a set of filler items, the specific technical details are not clear.

4.Is there a set of filler items similar to the target item, or is there a large number of filler items to choose?

5.“In detail, we dropped off all the features with more than 99.74% of missing values and distinct values”, this sentence is repeated twice. Why filter so much?

6.The feature preprocessing of knowledge map plays a key role in this experiment, but this paper does not introduce the technical details and principles of this part in detail.

7.What does mean “noisy factual features”?

8.The filler rate has a great influence on the attack effect, and it is not explained in the experiment.

9.In the experiment, this paper does not use more advanced attack models to compare, such as AOP, power item, power user and so on.

Review #3
Anonymous submitted on 30/Jan/2022
Suggestion:
Minor Revision
Review Comment:

The authors exploit background knowledge in the form of knowledge graph to perform semantic versions of existing families of shilling attacks for recommenders. The idea is novel and rather interesting. Each resource is represented as a set of paths. One wonders whether feature engineering is still necessary in the age of GNNs. The authors compare several measures to compare resources based on their 1-hop or 2-hop description. Their three semantic attacks are versions of classical shilling attack that exploit the fact that a knowledge graph serves as known background knowledge for the recommendation to be carried out. The authors compare the attack on two datasets using four known recommendation approaches. The paper is overall quite close to being ready for publication content-wise. Still, I do have a few remarks.
1) The authors use a t-test for significance. Are the requirements for a t-test really met? That does not seem to be guaranteed.
The authors conclude that using background knowledge is profitable for the attacking system. The authors should clarify a few statements:
2) Results are more indicative of attacks' effectiveness: I am puzzled by that conclusion. Why does a smaller standard deviation in results lead to results being more indicative? Could it simply be a stochastic effect? Is that maybe a sign of robustness (you indicate sparsity). Is there a way to measure that these results are indicative of effectiveness and not some other phenomenon?
3) Bandwagon attack: To confirm your hypothesis, please fill the profile with an increasing proportion of common items and check the performance of the attacks.
4) The manuscript still contains quite a few typos (e.g., the the) and inconsistencies (see tenses in Section 4.4 a.o.) as well as even missing words (in this section, devote). Please reread through your manuscript thoroughly.