Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

Authors:
Izumi Suzuki;Yoshiki Mikami;Ario Ohsato
Affiliations:
-;-;-
Venue:
IEICE - Transactions on Information and Systems
Year:
2008

Citing 12
Cited 0

A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchic document classification using Ward's clustering method

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Hidden Markov Models for Text Categorization in Multi-Page Documents

Journal of Intelligent Information Systems
A language and character set determination method based on N-gram statistics

ACM Transactions on Asian Language Information Processing (TALIP)
Double Bigram-Decoding in Phonotactic Language Identification

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
The language observatory project (LOP)

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Introduction to Information Retrieval

Introduction to Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.