Entity categorization over large document collections

Authors:
Venkatesh Ganti;Arnd C. König;Rares Vernica
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;University of California, Irvine, Irvine, CA, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 18
Cited 8

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
TEG: a hybrid approach to information extraction

Proceedings of the thirteenth ACM international conference on Information and knowledge management
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
What's hot and what's not: tracking most frequent items dynamically

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Named entity recognition using an HMM-based chunk tagger

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Reducing the human overhead in text categorization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A mixture model for contextual text mining

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Show me the money!: deriving the pricing power of product features by mining consumer reviews

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A probabilistic model of redundancy in information extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Self-supervised relation extraction from the web

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems

Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Fine-grained classification of named entities exploiting latent semantic kernels

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Query portals: dynamically generating portals for entity-oriented web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Enhancing the open-domain classification of named entity using linked open data

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
News personalization using enhanced term: document frequency (ETF-IDF) classification method

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
APOLLO: a general framework for populating ontology with named entities via random walks on graphs

Proceedings of the 21st international conference companion on World Wide Web
A graph-based approach for ontology population with named entities

Proceedings of the 21st ACM international conference on Information and knowledge management
Entity discovery and annotation in tables

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entity's context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.