Aligning Tweets with Events: Automation via Semantics

Paper Title: 
Aligning Tweets with Events: Automation via Semantics
Authors: 
Matthew Rowe, Milan Stankovic
Abstract: 
Microblogging platforms, such as Twitter, now provide web users with an on-demand service to share and consume fragments of information. Such fragments often refer to real-world events (e.g., shows, conferences) and often refer to a particular event component (such as a particular talk), providing a bridge between the real and virtual worlds. The utility of tweets allows companies and organizations to quickly gauge feedback about their services, and provides event organizers with information describing how participants feel about their event. However, the scale of the Web, and the sheer number of Tweets which are published on an hourly basis, makes manually identifying event tweets difficult. In this paper we present an automated approach to align tweets with the events which they refer to. We aim to provide alignments on the sub-event level of granularity. We test two different machine learning-based techniques: proximity-based clustering and classification using Naive Bayes. We evaluate the performance of our approach using a dataset of tweets collected from the Extended Semantic Web Conference 2010. The best F0.2 scores obtained in our experiments for proximity-based clustering and Naive Bayes were 0.544 and 0.728 respectively.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Guest Editors
Decision/Status: 
Accept
Reviews: 

This is a revised manuscript which has been accepted for publication. This followed an accept pending minor revisions. Below are the reviews for the revision, followed by the reviews for the original submission.

Solicited review by Eraldo Fernandes:

The authors satisfactorily addressed all aspects that I have pointed in my last review. Nevertheless, the paper still needs a carefully proofreading in order to remove some typos, like the ones you can find in the following.

"Fashions shows" -> "Fashion shows"
"which can be then be exposed" -> "which can then be exposed"

Solicited review by David Laniado:

I think the manuscript is ready for publishing.

Solicited review by anonymous reviewer:

This is a much improved version of the paper. Most of the comments have been addressed satisfactorily - in a couple of cases, the authors have instead provided a rebuttal, but it appears there is some misunderstanding about the reviewers' comments. However, these are fairly minor details so I'm prepared to let them go. For example, the suggestion to evaluate the performance of Zemanta was made in order to determine not how good Zemanta is per se (which I agree is somewhat beside the point), but to determine what the effect of errors or missing concepts might be on the overall performance. The point is that unless you have a clear idea of how good the performance of a third party system is, you cannot determine whether it's the best tool for the job, and how much impact on the final result this system might have. I apologise for not making this clear in my original review. Perhaps a sentence or two could just be added to the document that this would be something worth looking at in future.

just a few minor comments remain:

One typo: in the second sentence of the Introduction, "fashions shows" should be "fashion shows".

On page 2, I suggest adding a comma between "to" and "in" in the sentence beginning "Motivated by the need to align tweets...." (otherwise it sounds as if the events are in the paper).

Figures 8 and 7 appear out of order in the paper - it would be better to arrange them so that Figure 7 appears before Figure 8.

In the bibliography, reference 8 has the wrong author names: the authors should be D. Maynard, W. Peters, Y. Li.

Reviews for the original submission:

Solicited review by Eraldo Fernandes:

The authors propose a method to extract sub-event mentions within tweets about a given major event. The major event is specified as a dereferenceable URI and a twitter hash tag. The URI is used to automatically obtain the list of sub-events by a simple heuristic. The hash tag is used to search the tweets concerning the major event.

The main contribution of this work consists in the methods to identify the mentions of sub-events within the set of tweets related to the given major event. The paper presents an evaluation of two methods on a dataset concerning one major event (Extended Semantic Web Conference 2010).

Pros:

The paper is well written, organized and presented. It also deals with a relevant task and uses a good methodology.

Cons:

The evaluation is restricted to one event. Since the classifiers are event specific, it would be important to evaluate the method on other events.

Comments:

It should be clearer along the text (mainly in the abstract and introduction) that the aim of this work is to detect *sub*-events within a given and delimited major event. During the first sections this is not clear and the reader tends to believe that the method identifies general and arbitrary events.

Why do you say in the abstract that the achieved performance is "optimum"? I have no clue what evaluation metric is this optimum result related to.

The F-beta values in Table 2 are inconsistent with the corresponding precision and recall values.

In sec. 5, there are many references to Tab. 6 that should be to Tab. 2.

The number of classes (number of sub-events) can be huge in some cases and turns the proposed classification task very hard. Furthermore the number of examples per class is very limited. This likely is the cause of the poor performance with discriminative methods (SVM, for instance). I think both presented approaches are closely related to TF-IDF weighting for information retrieval. This could be discussed in the text.

Solicited review by David Laniado:

The paper faces the problem of aligning tweets with the events they refer to. More precisely, given a collection of tweets related to an event, the authors propose an approach to correctly assign each tweet to the corresponding subevent; the scenario analyzed is that of a conference with different talks going on.
First, the preprocessing of tweets is described: tweets are represented in a structured format by means of standard ontologies for social Web data and enriched through the Zemanta key extraction API.
The central contribution of the paper is in the proposal and assessment of different machine learning algorithms in order to perform the alignment. Three features are extracted to describe subevents as bags of words, given their official URIs, and leveraging the Web of Linked Data and the Zemanta API. Tweets are also represented as bags of words.
The first algorithm is based on a modified version of k-means, with centroids corresponding to events to be matched, and two distance metrics. The second technique proposed is based on a Naive Bayes classifier, built from the frequency distributions of terms observed in the event descriptions.
The techniques are evaluated on a dataset of tweets about the ESWC conference; in the comparison with a sample of manually labeled data the Naive Bayes classifier outperforms the proximity clustering algorithm, achieving over 70% performance both in terms of precision and recall.

The manuscript provides an interesting experiment that can be relevant for the Semantic Web community, as different machine learning algorithms are applied, tuned and evaluated in the context of Linked Data, to classify microblogging posts.
The paper is well written and structured, and highly readable. The topic is appropriately introduced and motivated; some relevant use cases are also discussed. The algorithms are described in a rigorous way, and the information provided is sufficient to allow for replicability. The publication of a gold standard, based on a sample of manually labeled tweets, offers possibilities for further experiments and comparison of different techniques.

The task of enriching tweets with concepts from DBPedia is delegated to the Zemanta API. More details and discussion about this choice and this step of the process should be provided, to answer questions such as:
* How does Zemanta deal with the specific context of Twitter, where texts are usually much shorter than in blogs, and more prone to typos, misspellings and abbreviations?
* Which additional information is or could be exploited while processing the tweets, in order to improve the performances?
* How are hashtags leveraged?
Given the brevity of tweets, the ability of capturing the topics from those few characters can be a key point for achieving the correct alignment; for this reason I think the possibility of other more ad hoc solutions should be considered and discussed in the paper.
In the field of social tagging, and also of microblogging, other works have been proposed which make use of vector space models to process and disambiguate tags or short texts from social Web data. Literature on the study of emergent semantics from folksonomies could be taken into account and compared with the proposed approach, as the setting resonates with the processing of tweets.

Another point which could be considered and mentioned in the paper, given the importance of time in Twitter, is the possible exploitation of temporal information, i.e. the tweets' publication timestamps and the dates associated to events.

Finally, one potential weakness of this paper stands in the specificity of the explored scenario. The assumption of having a corpus of tweets related to an event seems reasonable, also thanks to the wide usage of hashtags. On the other hand, one could argue that the work risks to be self-referential, as Semantic Web conferences are a very specific context, in which the availability of data in semantic format is straightforward. I suggest to explicitly discuss this issue. Given the generality of the approach, as future work I would encourage the authors to test it in a different context. In this way the results could be more easily generalized to deal with "non-Semantic Web experts".

Minor issues:
- page 4: each of which are defined
- page 5: tthe features
- page 5: equation (1) could be made more readable, adding a symbol between p', o' and , and explaining the meaning of G
- page 7: "the class where distance, or proximity, is minimized" (the use of term "proximity" is misleading: if distance is minimized, proximity is maximized)

Solicited review by anonymous reviewer:

This paper describes interesting and relatively novel work in identifying relevant events from tweets and aligning them using an ML-based approach. It does strike me a little bit as a solution looking for a problem rather than vice versa, but the work is interesting nevertheless. In the case of conferences, the usefulness of this alignment is a bit clearer, but I'm not entirely convinced how widespread the problem really is. In many cases, for example, just identifying key concepts in the tweets would be sufficient, without the alignment to LOD. In the introduction, the authors mention that user profiling could be enhanced through such techniques: however, I think simple concept identification in the tweets could deal with this problem sufficiently. Further evidence to show the extent of the problem (outside the SW conference field) would be useful, though not essential.

In general, the methodology seems reasonable, but it would have been interesting to see how the ML approach compares with a standard NLP approach where ontology-based information extraction is performed on the tweets in order to link key concepts to the information in the ontologies. There are plenty of techniques for this kind of work, which should at least be mentioned in the related work. It would also be useful to evaluate separately the different stages of the methodology: for example, how well does the Zemanta-based concept enrichment work? A few words about how this approach resolves the ambiguity issues mentioned, would be useful here (at the end of Section 3).

In section 4.1, I am not quite clear what you mean by "the abstract form of the task"? This section is also a little confusing: while you do refer to a running example, more precise details of the example would be useful here, e.g. showing exactly the set of triples extracted, what you mean by "the surrounding contextual information", examples of the DBPedia concepts extracted, and so on. A little bit more detail in this section, which forms the meat of the work, would not go amiss.

In the evaluation section, I was a little surprised to see that the combination of F1+F2 leads to such an improvement on F1 alone, given that F2 results are so low. Do you have any explanation for this?
The evaluation results look promising, but it would be nice to have seen a slightly more comprehensive evaluation given the lack of existing similar systems to compare with.

Tags: 

Comments

REVIEWER 1

Cons: The evaluation is restricted to one event. Since the classifiers are event specific, it would be important to evaluate the method on other events.

Response: Our future work will explore the application of our approach to additional events, testing its performance within a different subject domain. We have included an explanation of this at the end of §8 (Conclusions) where we describe our future work.

-------------

It should be clearer along the text (mainly in the abstract and introduction) that the aim of this work is to detect *sub*-events within a given and delimited major event. During the first sections this is not clear and the reader tends to believe that the method identifies general and arbitrary events.

Response: We have introduced multiple changes in the abstract and introduction in order to:
(a) emphasise the notion of composite events that consist of several smaller events (such as conferences) and make it clear that they are our focus in this paper, and
(b) make it clear that our contribution is in aligning the tweets with the particular sub-events within a larger main event.

-------------

Why do you say in the abstract that the achieved performance is "optimum"? I have no clue what evaluation metric is this optimum result related to.

Response: In the phrase “The results from our evaluation yield optimum F0.2 scores of 0.544 and 0.728 for proximity-based clustering and Naive Bayes respectively.” we are not referring to a global optimum, but to the respective best values obtained in our experiments while using the two approaches. We have however changed this statement to “The best F0.2 scores obtained in our experiments for proximity-based clustering and Naive Bayes were 0.544 and 0.728 respectively.” in order to make this clearer to the reader.

-------------

The F-beta values in Table 2 are inconsistent with the corresponding precision and recall values.

Response: The F-beta values refer to average values of micro-evaluations, and therefore it is expected that if one takes the given aggregated values of P and R, the resulting F value is not the same. We have included an explanation of this in §5.3. This makes our application of statistical significance testing clearer also, and provides the reader with more details for replication of experiments.

-------------

In sec. 5, there are many references to Tab. 6 that should be to Tab. 2.

Response: This has been rectified in the manuscript.

-------------

The number of classes (number of sub-events) can be huge in some cases and turns the proposed classification task very hard. Furthermore the number of examples per class is very limited. This likely is the cause of the poor performance with discriminative methods (SVM, for instance). I think both presented approaches are closely related to TF-IDF weighting for information retrieval. This could be discussed in the text.

Response: This is a valid point. We have addressed this in §8 (Conclusions) by including a discussion of the multi-class problem and how this impacts the application of SVMs in such a setting. We have also discussed using TF-IDF feature weights in our future work, and the similarity of such a heuristic with our current methods.

-------------

REVIEWER 2

The task of enriching tweets with concepts from DBPedia is delegated to the Zemanta API. More details and discussion about this choice and this step of the process should be provided, to answer questions such as:
* How does Zemanta deal with the specific context of Twitter, where texts are usually much shorter than in blogs, and more prone to typos, misspellings and abbreviations?

Response: Zemanta is capable of working with short text, according to information we had obtained from Zemanta executives. From our experience, it has proved to work well with abbreviations as well, by relying on redirect links in DBPedia. We have included an additional phrase in 3.2 to explain this better. We have also added an example of the DBPedia URIs that are returned using this method in Fig.8 for clarity.

-------------

* Which additional information is or could be exploited while processing the tweets, in order to improve the performances?

Response: We added the following phrase in the conclusions to address this question: “We are also exploring the use of additional features to improve alignment accuracy, such as the social network of a user of Twitter and how that information can be related to co-citation and authorship networks on the Web of Linked Data”

-------------

* How are hashtags leveraged?

Response: Our method uses unigrams found within both event descriptions and tweets as features from which the bow model is composed. Therefore when building the bow model for each tweet we include any terms that are not filtered out by the stop words list. For events - derived from resource descriptions on the Web of Data - we are provided with acronyms of the events (e.g. the Linking User Profiles workshop has the acronym: lupas2010). Using such features, our approach is therefore able to detect such overlaps and label the tweets accordingly. To provide a more clear description of this process, and how the feature vectors are composed including hashtags, we have added an explanation to §4.2 (feature vector composition).

-------------

Given the brevity of tweets, the ability of capturing the topics from those few characters can be a key point for achieving the correct alignment; for this reason I think the possibility of other more ad hoc solutions should be considered and discussed in the paper.

In the field of social tagging, and also of microblogging, other works have been proposed which make use of vector space models to process and disambiguate tags or short texts from social Web data. Literature on the study of emergent semantics from folksonomies could be taken into account and compared with the proposed approach, as the setting resonates with the processing of tweets.
Another point which could be considered and mentioned in the paper, given the importance of time in Twitter, is the possible exploitation of temporal information, i.e. the tweets' publication timestamps and the dates associated to events.

Response: We agree that a useful additional feature to include in our alignment approach would be time. However, obtaining such information from which our labelling function can be induced is not possibly at present, given its unavailability on the Semantic Dog Food service. We have added a comment regarding this matter and how it could be included into future work in the final paragraph of §8 (Conclusions).

-------------

Finally, one potential weakness of this paper stands in the specificity of the explored scenario. The assumption of having a corpus of tweets related to an event seems reasonable, also thanks to the wide usage of hashtags. On the other hand, one could argue that the work risks to be self-referential, as Semantic Web conferences are a very specific context, in which the availability of data in semantic format is straightforward. I suggest to explicitly discuss this issue. Given the generality of the approach, as future work I would encourage the authors to test it in a different context. In this way the results could be more easily generalized to deal with "non-Semantic Web experts".

Response: We agree that at present the availability of labelled data is restricted to the Semantic Web community and its conferences. However, given the recent increase in the size of the Web of Linked Data1 and the plethora of topics and subject areas that it describes, we anticipate the production of Linked Data for many other conferences external to the Semantic web community. We have placed a discussion of the above issue and how we plan to explore the utility of our approach over datasets from differing domains. This can be found in paragraph 3 of §8 (Conclusions).

-------------

Minor issues: - page 4: each of which are defined - page 5: tthe features - page 5: equation (1) could be made more readable, adding a symbol between p', o' and , and explaining the meaning of G - page 7: "the class where distance, or proximity, is minimized" (the use of term "proximity" is misleading: if distance is minimized, proximity is maximized)

Response: We have fixed the cited errors. For Equation (1) on page 5 we have made the font size larger and explained the meaning of of G, but are unsure what the reviewer means by ‘adding a symbol’. We have also amended the error on page 7 so that distance is minimised.

-------------

REVIEWER 3

This paper describes interesting and relatively novel work in identifying relevant events from tweets and aligning them using an ML-based approach. It does strike me a little bit as a solution looking for a problem rather than vice versa, but the work is interesting nevertheless. In the case of conferences, the usefulness of this alignment is a bit clearer, but I'm not entirely convinced how widespread the problem really is. In many cases, for example, just identifying key concepts in the tweets would be sufficient, without the alignment to LOD. In the introduction, the authors mention that user profiling could be enhanced through such techniques: however, I think simple concept identification in the tweets could deal with this problem sufficiently. Further evidence to show the extent of the problem (outside the SW conference field) would be useful, though not essential.

Response: Our motivation for aligning tweets with events was to provide explicit links between content citing an event and the event itself. We agree with the reviewer, in that the utility of such an approach is more evident when one considers conferences - where the presenters may wish to know exactly what was being said about their work and by whom. We have cited above in the previous response to reviewer 7 that we plan to apply our approach to conference data from non- Semantic Web conference fields, to explore if the performance of our approach differs at all and which features play a greater role.

However, we do not agree that a concept identification approach would be sufficient. If the reviewer means the detection of which concepts are referred to in Tweets, then we have demonstrated the futility of this method in the third feature set: using the Zemanta API. The condensed information form that Tweets are restricted to make concept identification limited. We empirically observed this in our experiments given that only 213 tweets of the 1082 in our dataset were identified as citing DBPedia concepts. This does leave room for exploration though, possibly by trying additional concept identification methods and third party services, and exploring how they contribute to alignment accuracy. Such an investigation could also provide a useful comparative study between the different services.

-------------

In general, the methodology seems reasonable, but it would have been interesting to see how the ML approach compares with a standard NLP approach where ontology-based information extraction is performed on the tweets in order to link key concepts to the information in the
ontologies. There are plenty of techniques for this kind of work, which should at least be mentioned in the related work.

Response: We have added several pieces of related work to our approach from the domain of Ontology-based Information Extraction - these can be found in the final paragraph of §7(Related Work). We have described the similarities between such approaches and our work, and have commented on the suitability of the evaluation measures used to assess such work given the hierarchical structure of concept relations.

-------------

It would also be useful to evaluate separately the different stages of the methodology: for example, how well does the Zemanta-based concept enrichment work? A few words about how this approach resolves the ambiguity issues mentioned, would be useful here (at the end of Section 3).

Response: As mentioned above, a useful further study would be to compare the performance of third party concept identification services, and in particular assess the performance of Zemanta. However within this paper our thesis was that semantics provide a useful means through which tweets can be aligned with events. It was not our intention to evaluate the accuracy of a third party service within this paper, merely to use one as a means through which we were able to obtain concepts for tweets and events.

-------------

In section 4.1, I am not quite clear what you mean by "the abstract form of the task"? This section is also a little confusing: while you do refer to a running example, more precise details of the example would be useful here, e.g. showing exactly the set of triples extracted, what you mean by "the surrounding contextual information", examples of the DBPedia concepts extracted, and so on. A little bit more detail in this section, which forms the meat of the work, would not go amiss.

Response: Agree with the reviewer that ‘abstract form of the task’ was too fuzzy, so have amended this and stated that we explore differing methods to induce the labelling function in the paper. We have used Fig. 5 as an example of what is associated with a given event URI - in this case a paper - and have added two additional figures: Fig. 6 and Fig. 7 as illustrative examples of what is returned using F1 and F2 respectively, showing the exact triples.

We have removed the statement of ‘contextual information’ and have replaced this within an explanation of using information that is one step away from the URI in the Linked Data graph space. To provide an example of the DBPedia concepts returned when querying Zemanta we have also added an additional figure (Fig. 8) using the running example from Fig. 5.

-------------

In the evaluation section, I was a little surprised to see that the combination of F1+F2 leads to such an improvement on F1 alone, given that F2 results are so low. Do you have any explanation for this? The evaluation results look promising, but it would be nice to have seen a slightly more comprehensive evaluation given the lack of existing similar systems to compare with.

Response: The improvement in results when utilising more features is largely due to the inclusion of resource leaves when inducing the labelling function. As our results show, the use of F1 provides information describing the event - such as the paper title, abstract and key terms that will also be found within tweets. When combining this information with 1-step away information - i.e. the names of the authors - we are able to boost our performance over the use of solely F1, as we provide additional terms that can me matched against tweets. The poor performance yielded when using F2 is due to the relatively low information content found within the bow model for that feature set. Taking the example that we have now added to the paper in Fig. 7 we see that only the author’s name would be used, while the use of F1 provides the abstract, title and keywords associated with the paper. This explains the poor performance levels that are produced when using F2 combined with F3, given that the dimensionality of the feature vectors is relatively low when compared to the use of F1 alone - in the former case the feature vector would be indexed to include the terms from the name of the author, and the concepts returned from Zemanta for the event, while the latter would include a larger index - given the various terms provided in the abstract and title of the event.

As mentioned above, we plan to re-run our methods over a dataset containing tweets from alternative conferences and testing the alignment performance. One of the time-consuming aspects however, of evaluating such work, is the construction of a gold standard against which assessments can be made. As we document in the paper in §5.2, several iterations were made to reach a sufficient level of agreement between raters, and produce a suitable gold standard for assessing our methods.