Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Information extraction as a basis for high-precision text classification
ACM Transactions on Information Systems (TOIS)
Training algorithms for linear text classifiers
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
An ontology-based expert system for database design
Data & Knowledge Engineering - Special issue on ER '97
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Automatic Document Classification
Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
Information Retrieval: Uncertainty and Logics: Advanced Models for the Representation and Retrieval of Information
Modern Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Hierarchically Classifying Documents Using Very Few Words
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Recognizing Ontology-Applicable Multiple-Record Web Documents
ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Categorisation of web documents using extraction ontologies
International Journal of Metadata, Semantics and Ontologies
Hi-index | 0.00 |
To retrieve Web documents of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.