Understanding Large Persons’ Networks through Semantic Categorization

Tracking #: 1318-2530

Alessio Palmero Aprosio
Sara Tonelli
Stefano Menini
Giovanni Moretti

Responsible editor: 
Andreas Hotho

Submission type: 
Full Paper
In this work, we describe a methodology to interpret large persons’ networks extracted from text by classifying cliques using the DBpedia ontology. The approach and the challenges faced when building networks based on persons’ co-occurrence are discussed in detail, especially the problem of mention normalisation and coreference resolution. The classification methodology that first starts from single nodes and then generalises to cliques is effective in terms of performance and is able to deal also with nodes that are not linked to Wikipedia. The gold standard manually developed for evaluation shows that groups of co-occurring entities share in most of the cases a category that can be automatically assigned. The outcome of this work may be of interest in a Big Data scenario to enhance the visualisation of large networks and to provide an additional semantic layer on top of cliques, so as to ease the comprehension of the network from a distance.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 20/Feb/2016
Review Comment:

This paper discusses methods to understand large social network data. The broad goal is to develop tools that can help humanities scholars understand large network data. Developing this system is part of an ongoing project (ALCIDE), of which this paper is a part.

The system itself consists of the following parts, based on a given plain-text input
(1) Detect named entities in some data, using a variety of existing NLP tools
(2) Use filters to match the entities to wikipedia pages
(3) Build a social network of person-to-person links, based on co-occurrence. Normalize based on the frequency of mentions.
(4) Find cliques in the data up to a certain size. DBPedia classes are used to identify cliques consisting of people specifically.

The presentation of the paper is a little long-winded -- the basic setup above took 5 pages to describe, but (other than a few minor technical details) there's really not much there beyond this basic setup.

This is mainly a combination of known and standard techniques. For this reason there's not much to evaluate in terms of technical contribution, the system is worthwhile only to the extent that this combination of parts would be useful to a user.

The experiments, however, are limited to measuring the accuracy of the system in correctly identifying entities. Given that the contribution of the system is not really to do accurate entity classification (and nor are state-of-the-art techniques for entity classification considered), this is not a very compelling way to evaluate the contribution. Although the authors do make a link to their system available, without some kind of a user study it's hard to properly evaluate.

> The paper aims to visualize free-text data from the humanities by finding network structure between the individuals being discussed
> The technical contribution is largely a combination of known parts, and the presentation is a little too long-winded
> The method is evaluated in terms of its ability to correctly classify named entities, though this does not seem the most relevant way to evaluate such a system compared to a user study

Review #2
By Frank Puppe submitted on 22/Feb/2016
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include
(1) originality: high
(2) significance of the results: high
(3) quality of writing: good

The paper is well written and describes an innovative approach to infer the category of a person based on the automatically derived category of a clique of persons it belongs to. I like this paper and recommend its acceptance. Nevertheless, the evaluation in its current form is not convincing. I am unsure, whether this is a "minor revsion" or a "major revision", because I do not know, whether additional evaluations show good or bad results.

A minor concern is, when I read abstract and introduction, I did not really understand this contribution, because they were too general. It became clear in the evaluation and conclusion:

"Finally, we presented and evaluated a strategy to assign a category to the nodes in a clique and then,
by generalisation, to the whole clique. The approach yields good results, especially at clique level, and is able to classify also entities that are not present in Wikipedia. … To our knowledge, this hypothesis was never proved before, and the clique classification task based on DBpedia ontology is an original contribution of this work."

Therefore, the main contribution should be stated more clearly in abstract and introduction.

My main concern is the scope of the evaluation:

1) The authors used only 50 cliques manually labeled. 6 cliques belonged to the most general category "person". It would be helpful to state the categories of the other 44 cliques and also the categories of the addtional nodes being labeled based on their clique category. My concern is, if most of the additional non-linked entities being assigned a category based on their clique membership belong to the category "politician", the assignment of this class to non-linked entitities is not very specific, because in this domain most persons probably are politicians. A simple baseline would be to compute the precision and recall, if all non-linked entities get the category "politician" (and also all non-linked entities without highly ambiguous entities).
2) The evaluation should not only be based on the selection of 50 cliques, but also on the selection of x randomly selected groups of consecutive sentences, because already recognized cliques might imply a bias.

Review #3
Anonymous submitted on 14/Mar/2016
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper offers an approach to aid in the analysis of large corpora by identifying cliques of people and classifying the people and the cliques with respect to an ontology based on DBPedia.

Although the title and several placed in the text emphasize the interest in large corpora, the scale of the experiments is quite small by current standards. The entire corpus is less than 2 million words and comes basically from two sources (Kennedy and Nixon). The small size and limited number of sources make cross-document coreference a much smaller problem

We also note that the cliques produced are quite small. The gold standard has 204 people in 50 cliques, an average of 4 people per clique. The paper does not explain why cliques are the best choice, rather than clusters with a high degree (but not maximal) connectivity.

Limiting the sources is likely to make the person classification easier as well. In particular, many of the people mentioned in the documents are politicians or government employees, leading one to wonder how well a simple baseline would do.

Finally, the title needs some repair. As written it has a clear second reading, involving large people. "Large Networks of People" would be better, or perhaps "Large" should be dropped entirely.

Overall assessment: For the most part the system is built from existing packages, and so the value of the system lies primarily in its integration, and the added functionality of semantic classification. The integration may provide some guidance to developers of similar systems, but is relatively straightforward. So the paper must be judged on the value of the new functionality. How helpful was the added functionality (when compared to a baseline system just providing links between individuals)? We are told that semantic classification was requested by a focus group. Once classification was added, what was user reaction? Was clique classification at all helpful? Answering such questions is not easy, but would greatly increase the value of the paper.