Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Discovering characteristic expressions in literary works
Theoretical Computer Science
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Site level noise removal for search engines
Proceedings of the 15th international conference on World Wide Web
Efficient computation of substring equivalence classes with suffix arrays
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
String Kernels Based on Variable-Length-Don't-Care Patterns
DS '08 Proceedings of the 11th International Conference on Discovery Science
Unsupervised Spam Detection by Document Complexity Estimation
DS '08 Proceedings of the 11th International Conference on Discovery Science
Special factors and the combinatorics of suffix and factor automata
Theoretical Computer Science
Hi-index | 0.00 |
We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams.