A comparison of classifiers and document representations for the routing problem
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchic document classification using Ward's clustering method
Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Hidden Markov Models for Text Categorization in Multi-Page Documents
Journal of Intelligent Information Systems
A language and character set determination method based on N-gram statistics
ACM Transactions on Asian Language Information Processing (TALIP)
Double Bigram-Decoding in Phonotactic Language Identification
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
The language observatory project (LOP)
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Introduction to Information Retrieval
Introduction to Information Retrieval
Hi-index | 0.00 |
A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.