Web-scale named entity recognition

Authors:
Casey Whitelaw;Alex Kehlenbeck;Nemanja Petrovic;Lyle Ungar
Affiliations:
Google, New York, NY, USA;Google, New York, NY, USA;Google, New York, NY, USA;University of Pennsylvania, Philadelphia, PA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 18
Cited 18

Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3)

Computational Linguistics
Unsupervised named entity classification models and their ensembles

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Extracting relations from large text collections

Extracting relations from large text collections
Introduction to the CoNLL-2000 shared task: chunking

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Named entity recognition with character-level models

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Named entity recognition using hundreds of thousands of features

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Classifying web documents in a hierarchy of categories: a comprehensive study

Journal of Intelligent Information Systems
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Solving multiclass learning problems via error-correcting output codes

Journal of Artificial Intelligence Research
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Locating complex named entities in web text

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity

AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence

An Approach to Web-Scale Named-Entity Disambiguation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
A Query Substitution-Search Result Refinement Approach for Long Query Web Searches

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Towards the Extraction of Intelligence about Competitor from the Web

WSKS '09 Proceedings of the 2nd World Summit on the Knowledge Society: Visioning and Engineering the Knowledge Society. A Web Science Perspective
Distributed training strategies for the structured perceptron

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Large scale relation detection

FAM-LbR '10 Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading
Semantic entity detection by integrating CRF and SVM

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Beyond the bag-of-words paradigm to enhance information retrieval applications

Proceedings of the Fourth International Conference on SImilarity Search and APplications
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Automatic acquisition of huge training data for bio-medical named entity recognition

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Focusing on novelty: a crawling strategy to build diverse language models

Proceedings of the 20th ACM international conference on Information and knowledge management
VAHA: verbs associate with human activity --- a study on fairy tales

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Community-based classification of noun phrases in twitter

Proceedings of the 21st ACM international conference on Information and knowledge management
Exploiting the category structure of Wikipedia for entity ranking

Artificial Intelligence
Transfer joint embedding for cross-domain named entity recognition

ACM Transactions on Information Systems (TOIS)
An approach to automatic music band member detection based on supervised learning

AMR'11 Proceedings of the 9th international conference on Adaptive Multimedia Retrieval: large-scale multimedia retrieval and evaluation
Information extraction as a filtering task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment
Effective named entity recognition for idiosyncratic web collections

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.02

Visualization

Abstract

Automatic recognition of named entities such as people, places, organizations, books, and movies across the entire web presents a number of challenges, both of scale and scope. Data for training general named entity recognizers is difficult to come by, and efficient machine learning methods are required once we have found hundreds of millions of labeled observations. We present an implemented system that addresses these issues, including a method for automatically generating training data, and a multi-class online classification training method that learns to recognize not only high level categories such as place and person, but also more fine-grained categories such as soccer players, birds, and universities. The resulting system gives precision and recall performance comparable to that obtained for more limited entity types in much more structured domains such as company recognition in newswire, even though web documents often lack consistent capitalization and grammatical sentence construction.