High-precision phrase-based document classification on a modern scale

Authors:
Ron Bekkerman;Matan Gavish
Affiliations:
LinkedIn Corporation, Mountain View, CA, USA;Stanford University, Palo Alto, CA, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 11
Cited 5

Elements of information theory

Elements of information theory
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Information extraction as a basis for high-precision text classification

ACM Transactions on Information Systems (TOIS)
Lazy learning

Lazy learning
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Learning from labeled features using generalized expectation criteria

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning and generalization with the information bottleneck

Theoretical Computer Science

Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Unsupervised multi-label text classification using a world knowledge ontology

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Semantic Labelling for Document Feature Patterns Using Ontological Subjects

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Mapping semantic knowledge for unsupervised text categorisation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a document classification system that employs lazy learning from labeled phrases, and argue that the system can be highly effective whenever the following property holds: most of information on document labels is captured in phrases. We call this property near sufficiency. Our research contribution is twofold: (a) we quantify the near sufficiency property using the Information Bottleneck principle and show that it is easy to check on a given dataset; (b) we reveal that in all practical cases---from small-scale to very large-scale---manual labeling of phrases is feasible: the natural language constrains the number of common phrases composed of a vocabulary to grow linearly with the size of the vocabulary. Both these contributions provide firm foundation to applicability of the phrase-based classification (PBC) framework to a variety of large-scale tasks. We deployed the PBC system on the task of job title classification, as a part of LinkedIn's data standardization effort. The system significantly outperforms its predecessor both in terms of precision and coverage. It is currently being used in LinkedIn's ad targeting product, and more applications are being developed. We argue that PBC excels in high explainability of the classification results, as well as in low development and low maintenance costs. We benchmark PBC against existing high-precision document classification algorithms and conclude that it is most useful in multilabel classification.