Extremely fast text feature extraction for classification and indexing

Authors:
George Forman;Evan Kirshenbaum
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA, USA;Hewlett-Packard Labs, Palo Alto, CA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 12
Cited 6

Selecting a hashing algorithm

Software—Practice & Experience
The Unicode standard, version 2.0

The Unicode standard, version 2.0
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Efficient string matching: an aid to bibliographic search

Communications of the ACM
A String Matching Algorithm Fast on the Average

Proceedings of the 6th Colloquium, on Automata, Languages and Programming
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Multipattern string matching with q-grams

Journal of Experimental Algorithmics (JEA)
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Scaling up text classification for large file systems

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

A novel traffic analysis for identifying search fields in the long tail of web sites

Proceedings of the 19th international conference on World wide web
Probabilistic anti-spam filtering with dimensionality reduction

Proceedings of the 2010 ACM Symposium on Applied Computing
Constructing efficient information extraction pipelines

Proceedings of the 20th ACM international conference on Information and knowledge management
Information extraction as a filtering task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Mapping semantic knowledge for unsupervised text categorisation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Exact top-k feature selection via l2,0-norm constraint

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.