Software—Practice & Experience
The Unicode standard, version 2.0
The Unicode standard, version 2.0
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Efficient string matching: an aid to bibliographic search
Communications of the ACM
A String Matching Algorithm Fast on the Average
Proceedings of the 6th Colloquium, on Automata, Languages and Programming
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
Lucene in Action (In Action series)
Lucene in Action (In Action series)
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Multipattern string matching with q-grams
Journal of Experimental Algorithmics (JEA)
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Scaling up text classification for large file systems
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A novel traffic analysis for identifying search fields in the long tail of web sites
Proceedings of the 19th international conference on World wide web
Probabilistic anti-spam filtering with dimensionality reduction
Proceedings of the 2010 ACM Symposium on Applied Computing
Constructing efficient information extraction pipelines
Proceedings of the 20th ACM international conference on Information and knowledge management
Information extraction as a filtering task
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Mapping semantic knowledge for unsupervised text categorisation
ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Exact top-k feature selection via l2,0-norm constraint
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.