An introduction to ROC analysis
Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An event-centric model for multilingual document similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
An emerging need in information retrieval is to identify a set of documents conforming to an abstract description. This task presents two major challenges to existing methods of document retrieval and classification. First, similarity based on overall content is less effective because there may be great variance in both content and subject of documents produced for similar functions, e.g. a presidential speech or a government ministry white paper. Second, the function of the document can be defined based on user interests or the specific data set through a set of existing examples, which cannot be described with standard categories. Additionally, the increasing volume and complexity of document collections demands new scalable computational solutions. We conducted a case study using web-archived data from the Latin American Government Documents Archive (LAGDA) to illustrate these problems and challenges. We propose a new hybrid approach based on Naïve Bayes inference that uses mixed n-gram models obtained from a training set to classify documents in the corpus. The approach has been developed to exploit parallel processing for large scale data set. The preliminary work shows promising results with improved accuracy for this type of retrieval problem.