Effective named entity recognition for idiosyncratic web collections

Authors:
Roman Prokofyev;Gianluca Demartini;Philippe Cudré-Mauroux
Affiliations:
University of Fribourg, Fribourg, Switzerland;University of Fribourg, Fribourg, Switzerland;University of Fribourg, Fribourg, Switzerland
Venue:
Proceedings of the 23rd international conference on World wide web
Year:
2014

Citing 25
Cited 0

WordNet: a lexical database for English

Communications of the ACM
OCELOT: a system for summarizing Web pages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Learning Algorithms for Keyphrase Extraction

Information Retrieval
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text

Bioinformatics
Maximum entropy models for named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Extremely randomized trees

Machine Learning
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Introduction to Information Retrieval

Introduction to Information Retrieval
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
A ranking approach to keyphrase extraction

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
TextRunner: open information extraction on the web

NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
Domain-specific keyphrase extraction

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Ad-hoc object retrieval in the web of data

Proceedings of the 19th international conference on World wide web
Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
Scikit-learn: Machine Learning in Python

The Journal of Machine Learning Research
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web
Combining inverted indices and structured search for ad-hoc object retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
TwiNER: named entity recognition in targeted twitter stream

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Unsupervised graph-based topic labelling using dbpedia

Proceedings of the sixth ACM international conference on Web search and data mining
ClausIE: clause-based open information extraction

Proceedings of the 22nd international conference on World Wide Web
Large-scale linked data integration using probabilistic reasoning and crowdsourcing

The VLDB Journal — The International Journal on Very Large Data Bases
Automatic keyphrase extraction from scientific articles

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named Entity Recognition (NER) plays an important role in a variety of online information management tasks including text categorization, document clustering, and faceted search. While recent NER systems can achieve near-human performance on certain documents like news articles, they still remain highly domain-specific and thus cannot effectively identify entities such as original technical concepts in scientific documents. In this work, we propose novel approaches for NER on distinctive document collections (such as scientific articles) based on n-grams inspection and classification. We design and evaluate several entity recognition features---ranging from well-known part-of-speech tags to n-gram co-location statistics and decision trees---to classify candidates. In addition, we show how the use of external knowledge bases (either specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic collections. We evaluate our system on two test collections created from a set of Computer Science and Physics papers and compare it against state-of-the-art supervised methods. Experimental results show that a careful combination of the features we propose yield up to 85% NER accuracy over scientific collections and substantially outperforms state-of-the-art approaches such as those based on maximum entropy.