Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
The Journal of Machine Learning Research
Some Effective Techniques for Naive Bayes Text Classification
IEEE Transactions on Knowledge and Data Engineering
Survey of Text Mining II: Clustering, Classification, and Retrieval
Survey of Text Mining II: Clustering, Classification, and Retrieval
Probabilistic latent semantic analysis
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
A software system for topic extraction and automatic document classification is presented. Given a set of documents, the system automatically extracts the mentioned topics and assists the user to select their optimal number. The user-validated topics are exploited to build a model for multi-label document classification. While topic extraction is performed by using an optimized implementation of the Latent Dirichlet Allocation model, multi-label document classification is performed by using a specialized version of the Multi-Net Naive Bayes model. The performance of the system is investigated by using 10,056 documents retrieved from the WEB through a set of queries formed by exploiting the Italian Google Directory. This dataset is used for topic extraction while an independent dataset, consisting of 1,012 elements labeled by humans, is used to evaluate the performance of the Multi-Net Naive Bayes model. The results are satisfactory, with precision being consistently better than recall for the labels associated with the four most frequent topics.