Entity set expansion in opinion documents

  • Authors:
  • Lei Zhang;Bing Liu

  • Affiliations:
  • University of Illinois at Chicago, Chicago, USA;University of Illinois at Chicago, Chicago, USA

  • Venue:
  • Proceedings of the 22nd ACM conference on Hypertext and hypermedia
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Opinion mining has been an active research area in recent years. The task is to extract opinions expressed on entities and their attributes. For example, the sentence, "I love the picture quality of Sony cameras," expresses a positive opinion on the picture quality attribute of Sony cameras. Sony is the entity. This paper focuses on mining entities (e.g., Sony). This is an important problem because without knowing the entity, the extracted opinion is of little use. The problem is similar to the classic named entity recognition problem. However, there is a major difference. In a typical opinion mining application, the user wants to find opinions on some competing entities, e.g., competing or relevant products. However, he/she often can only provide a few names as there are too many of them. The system has to find the rest from a corpus. This implies that the discovered entities must be of the same type/class. This is the set expansion problem. Classic methods for solving the problem are based on distributional similarity. However, we found this method is inaccurate. We then employ a learning-based method called Bayesian Sets. However, directly applying Bayesian Sets produces poor results. We then propose a more sophisticated way to use Bayesian Sets. This method, however, causes two major problems: entity ranking and feature sparseness. For entity ranking, we propose a re-ranking method to solve the problem. For feature sparseness, we propose two methods to re-weight features and to determine the quality of features. These methods help improve the mining results substantially. Additionally, like any learning algorithm, Bayesian Sets requires the user to engineer a set of features. We design some generic features based on part-of-speech tags of words for learning, which thus does not need to engineer features for each specific domain. Experimental results using 10 real-life datasets from diverse domains demonstrated the effectiveness of the proposed technique.