Sentiment Lexicon Adaptation with Context and Semantics for the Social Web

Tracking #: 1329-2541

Hassan Saif
Miriam Fernandez
Leon Kastler
Harith Alani

Responsible editor: 
Guest Editors Social Semantics 2016

Submission type: 
Full Paper
Sentiment analysis over social streams offers governments and organisations a fast and effective way to monitor the publics' feelings towards policies, brands, business, etc. General purpose sentiment lexicons have been used to compute sentiment from social streams, since they are simple and effective. They calculate the overall sentiment of texts by using a general collection of words, with predetermined sentiment orientation and strength. However, words' sentiment often vary with the contexts in which they appear, and new words might be encountered that are not covered by the lexicon, particularly in social media environments where content emerges and changes rapidly and constantly. In this paper, we propose a lexicon adaptation approach that uses contextual as well as semantic information extracted from DBPedia to update the words' weighted sentiment orientations and to add new words to the lexicon. We evaluate our approach on three different Twitter datasets, and show that enriching the lexicon with contextual and semantic information improves sentiment computation by 3.7% in average accuracy, and by 3% in average F1 measure.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 20/Mar/2016
Review Comment:

The paper describes a method for adapting a sentiment lexicon to other domains. The task is specifically considered for processing social media data.

(1) originality,
There is not yet much work on that topic available. There is some work not yet considered by the authors:
that should be discussed.

(2) significance of the results:
In Social media, topics are changing quite fast, reacting on the current events in society, politcs etc. Such adaptation method is very important for sentiment analysis. Previous work showed that it is necessary to consider the domain since meanings can be different. The approach leads to slight improvements in accuracy of sentiment classification. Even though, the improvements are slight, the method and results could be a good starting point for further research and triggering the research community.

(3) quality of writing.
The paper is very well written

- Did you checked the lexicon updates manually at least for a random sample? What did you experienced?
- Did you made an error analysis for the sentiment classification? This would be helpful to learn about limitations or possible future improvements
- Authors are mentioning that they are using AlchemyAPI: Why? What was the reason to chose that tool? Did you recognized any limitations that might impact the results of your approach? Since you plan to test other tools there must be some limitations.
- How many diffent semantic types is AlchemyAPI extracting? The list in Table 4 looks quite fine-grained in terms of types.
- Adding the explanations of the abbreviations LU, LE etc. to the table heading in Table 6 could be helpful
- Might the rule-based classification be a problem?

Review #2
Anonymous submitted on 04/Apr/2016
Major Revision
Review Comment:

The authors propose automatic methods to refine and extend sentiment lexica
based on contextual information and on conceptual and semantic association
data. While there is a considerable body of work on using contextual
information to extend sentiment lexica, the application of semantic resources
is often limited to restricted synonymy datasets (WordNet). This article
aims at the precise research gap of evaluating how conceptual semantic data
from DBpedia can be used to improve lexica for sentiment analysis, in the
context of the application of SentiStrength to estimate the polarity of

The article explains a lot in detail the methods applied to extend the lexica,
but the statistical methods used to evaluate such improvement are insufficient
to conclude that there is a significant improvement, in particular since it is
relatively small. The description of statistical tests is basically
summarized in the sentence "Statistical significance is done using Wilcoxon
signed-rank test", and little to no detail is ever given about which
distributions are compared, what are the null hypotheses, or what is the
estimate of difference between medians. To conclude a statistically
significant improvement, the authors need to show a test that compares the
Precision, Recall, and F1 distributions resulting from the original method
versus their improvement. Furthermore, since the design includes three models
(context-based, conceptually-enriched, and semantically-adjusted relations)
and three applications (LU, LE, LUE), for each dataset the authors are testing
9 closely related hypothesis, and thus the p-values of the statistical tests
need to be corrected for multiple hypothesis testing.

Since the article aims at the extension with DBpedia data, to assess if that
data does or not increase performance, the same statistical tests should be
performed between the three models proposed by the authors. Unless the
performance of models is compared in a pairwise way, it is impossible to
conclude if the second and third model do really constitute an improvement.

Besides this major concern, I have other smaller comments:

- The article heavily builds on the lexicon of SentiStrength, and very briefly
argues about this choice without further support ("to the best of our
knowledge, it is currently one of the best performing lexicons"). The authors
need to refer to some benchmark study that shows why the chosen method can be
considered state of the art.

- While explained in good detail, there is little argumentation (end of p. 4)
about why the authors need the machinery of the SentiCircle approach. What do
they obtain computing polar coordinates that cannot be done in a simple
averaging with TF-IDF? Why is it desirable to have a dimensional
representation in which extremely negative and extremely positive terms are
very close to each other (area of large r and theta close to 180 degrees)?
What is the framing of the method with respect to simpler, previous

- The rules presented in Table 1 are similar to those used to fine tune
SentiStrength in previous research, the authors should explain in detail the
resources that support the assumptions behind their decision to use those
rules in particular.

- The terminology to refer to lexica and methods is inconsistent. In some
cases the name of the method is used (SentiWordNet) and in others of one of
the authors of some paper using the method (Thelwall-Lexicon). I suggest that
the authors use a consistent terminology, for example referring to
SentiStrength, or citing all the methods by the initials of their authors.

- The use of pie charts in Figure 7 is totally unnecessary and the strange
vertical scale makes it very difficult to judge the actual values. A typical
bar chart or a table would better serve the purpose of that figure.

- The 14 positive and negative paradigm words (p. 9) need to be reported, and
the criterion for their choice has to be defended in the text.

- Two positive aspects help the reader to have a clearer idea of the results.
First, including the SO-PMI benchmark. SentiStrength, as fully unsupervised,
would not use global properties of the tweets in the evaluation dataset, but
the SO-PMI adds a naive approach that, to some extent, improves marginally the
original performance. Second, the descriptive insights on the amount of
changes explained in section 6.5, and on the role of balance in section 6.6.
This kind of post hoc analysis helps to understand better the conditions and
limitations of the results.

To sum up, the article needs important improvements before its main
conclusions can be considered as supported by the results. In particular, the
statistical tests need to be reported in much more detail to conclude that the
models are a significant improvement, and to which extent the semantic
relation and concept data extends the contextual adaptation.

Review #3
Anonymous submitted on 06/Apr/2016
Minor Revision
Review Comment:

The approach described in this paper tackles a problem inherent in sentiment lexicons. A sentiment lexicon stores static sentiment values, which limits its applicability when the context changes. The authors provide two different solutions for this problem: the (i) contextual and (ii) conceptual enrichment of a sentiment lexicon. Contextual enrichment is based on co-occurrence probabilities, while conceptual enrichment leverages semantic knowledge bases such as DBPedia. Both solutions either update the lexicon, expand it, or do both, with varying effects on the quality of the resulting lexicon. The authors focus their experiments on social media, more precisely on Twitter corpora known in the literature. The experiments validate the efficacy of their approach and show, that standard lexicons benefit from such an enrichment procedure.

Approaches to adapt sentiment lexicons to context are not new. Especially, one of their statements is simply wrong "In addition, very little attention has been giving to the use of semantic information as a resource to perform such adaptations.". A big part of Erik Cambria's work focuses on semantic enrichment and has resulted in the well-known resource SenticNet. Furthermore, I personally find their theoretical assumption quite bold. Their assumption is based on the hypothesis that the sentiment values of sentiment terms in a lexicon might be wrong. However, they calculate an update sentiment value (i.e. based on the context) using the sentiment values of other sentiment terms from the sentiment lexicon. This results in a circular contradiction - how can one get accurate new values using values as a basis that have actually been considered as faulty? However, their evaluation results seem to negate my doubts. Other than that the experiments have been planned meticulously and provide the means for comparison with standard corpora. Furthermore, their approach to context awareness is innovative and sounds interesting. Thus I suggest to accept the paper.

However, there is the need for minor revisions, as I discovered many mistakes, ranging from typos to grammatical errors. Also, the choice of highlighting significance of < 0.001 with "non-italic" was poor, since these values are hard to spot and I first believed there are none