COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Selective Sampling Using the Query by Committee Algorithm
Machine Learning
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
On the merits of building categorization systems by supervised clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Topic detection and tracking in English and Chinese
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Unsupervised and supervised clustering for topic tracking
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Maximizing Text-Mining Performance
IEEE Intelligent Systems
Employing EM and Pool-Based Active Learning for Text Classification
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Taxonomies by the numbers: building high-performance taxonomies
Proceedings of the 14th ACM international conference on Information and knowledge management
Topic taxonomy adaptation for group profiling
ACM Transactions on Knowledge Discovery from Data (TKDD)
Category hierarchy maintenance: a data-driven approach
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
In this paper we study the problem of collecting training samples for building enterprise taxonomies. We develop a computer-aided tool named InfoAnalyzer, which can effectively assist the enterprise to prepare large set of samples used for machine learning in text categorization. In our system, the enterprise category tree is initially defined by some keywords, then the Google search engine is used to construct a small set of labeled documents, and topic tracking algorithm based on document length normalization is applied to enlarge the training corpus on the bases of the seed stories. Furthermore, we design a method to check the consistency of the training corpus. Experiments show that the training corpus is good enough for statistical classification methods and meets human's requirements as well.