A scalable machine-learning approach for semi-structured named entity recognition

Authors:
Utku Irmak;Reiner Kraft
Affiliations:
Yahoo! Inc, Santa Clara, CA, USA;Yahoo! Inc, Sunnyvale, CA, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 22
Cited 4

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Geospatial mapping and navigation of the web

Proceedings of the 10th international conference on World Wide Web
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Dates and times in email messages

Proceedings of the 9th international conference on Intelligent user interfaces
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
A bootstrapping method for learning semantic lexicons using extraction pattern contexts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Named entity recognition with a maximum entropy approach

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Estimating the Support of a High-Dimensional Distribution

Neural Computation
Efficient query processing in geographic web search engines

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Event ordering using TERSEO system

Data & Knowledge Engineering - Special issue: Application of natural language to information systems (NLDB04)
Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds

Proceedings of the 16th international conference on World Wide Web
The role of documents vs. queries in extracting class attributes from text

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Leveraging context in user-centric entity detection systems

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Combining automatic acquisition of knowledge with machine learning approaches for multilingual temporal recognition and normalization

Information Sciences: an International Journal
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Contextual Ranking of Keywords Using Click Data

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Organizing and searching the world wide web of facts - step one: the one-million fact extraction challenge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Automatic time expression labeling for english and chinese text

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

On theme location discovery for travelogue services

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Automatic identification of protagonist in fairy tales using verb

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Adaptive context features for toponym resolution in streaming news

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Extending enterprise service design knowledge using clustering

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named entity recognition studies the problem of locating and classifying parts of free text into a set of predefined categories. Although extensive research has focused on the detection of person, location and organization entities, there are many other entities of interest, including phone numbers, dates, times and currencies (to name a few examples). We refer to these types of entities as "semi-structured named entities", since they usually follow certain syntactic formats according to some conventions, although their structure is typically not well-defined. Regular expression solutions require significant amount of manual effort and supervised machine learning approaches rely on large sets of labeled training data. Therefore, these approaches do not scale when we need to support many semi-structured entity types in many languages and regions. In this paper, we study this problem and propose a novel three-level bootstrapping framework for the detection of semi-structured entities. We describe the proposed techniques for phone, date and time entities, and perform extensive evaluations on English, German, Polish, Swedish and Turkish documents. Despite the minimal input from the user, our approach can achieve 95% precision and 84% recall for phone entities, and 94% precision and 81% recall for date and time entities, on average. We also discuss implementation details and report run time performance results, which show significant improvements over regular expression based solutions.