Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Authors:
William W. Cohen;Sunita Sarawagi
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;IIT Bombay, Mumbai, India
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 22
Cited 53

Integrated architecture for learning, planning, and reacting based on approximating dynamic programming

Proceedings of the seventh international conference (1990) on Machine learning
Large margin classification using the perceptron algorithm

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to Parse Natural Language with Maximum Entropy Models

Machine Learning - Special issue on natural language learning
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Multistrategy Learning for Information Extraction

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Segmental semi-markov models and applications to sequence analysis

Segmental semi-markov models and applications to sequence analysis
Ultraconservative online algorithms for multiclass problems

The Journal of Machine Learning Research
Bottom-up relational learning of pattern matching rules for information extraction

The Journal of Machine Learning Research
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Conditional structure versus conditional estimation in NLP models

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Markov models for language-independent named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Chinese named entity recognition using lexicalized HMMs

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficiently linking text documents with relevant structured information

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Hierarchical rule generalisation for speaker identification in fiction books

SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
LIPTUS: associating structured and unstructured information in a banking environment

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting relevant named entities for automated expense reimbursement

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Randomized algorithms for data reconciliation in wide area aggregate query processing

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Probabilistic graphical models and their role in databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A generic software architecture of a text processing system for analyzing product warranty claims data

COMPUTE '08 Proceedings of the 1st Bangalore Annual Compute Conference
Entity ranking in Wikipedia

Proceedings of the 2008 ACM symposium on Applied computing
Helping satisfy multiple objectives during a service desk conversation

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment
Information Extraction

Foundations and Trends in Databases
Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Exploiting web search engines to search structured databases

Proceedings of the 18th international conference on World wide web
Formal Grammar for Hispanic Named Entities Analysis

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Do we mean the same?: disambiguation of extracted keyword queries for database search

Proceedings of the First International Workshop on Keyword Search on Structured Data
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Semantic annotation of unstructured and ungrammatical text

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Learning to follow navigational route instructions

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Generalized expectation criteria for bootstrapping extractors using record-text alignment

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Named entity recognition using acyclic weighted digraphs: a semi-supervised statistical method

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Query portals: dynamically generating portals for entity-oriented web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Learning 5000 relational extractors

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Online annotation of text streams with structured entities

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Constructing reference sets from unstructured, ungrammatical text

Journal of Artificial Intelligence Research
2D correlative-chain conditional random fields for semantic annotation of web objects

Journal of Computer Science and Technology
Methodological Review: Natural Language Processing methods and systems for biomedical ontology learning

Journal of Biomedical Informatics
Joint unsupervised structure discovery and information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Inferring specifications for resources from natural language API documentation

Automated Software Engineering
Models and indices for integrating unstructured data with a relational database

KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Privacy compliance enforcement in email

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Reference table based k-anonymous private blocking

Proceedings of the 27th Annual ACM Symposium on Applied Computing
P-top-k queries in a probabilistic framework from information extraction models

Computers & Mathematics with Applications
Exploiting evidence from unstructured data to enhance master data management

Proceedings of the VLDB Endowment
Provenance-based dictionary refinement in information extraction

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Graph-based reference table construction to facilitate entity matching

Journal of Systems and Software
A Named Entity Recognition Method Based on Decomposition and Concatenation of Word Chunks

ACM Transactions on Asian Language Information Processing (TALIP)
Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

International Journal of Information Retrieval Research
Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of improving named entity recognition (NER) systems by using external dictionaries---more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and high-performance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.