An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Authors:
Quan Wang;Yiu-Kai Ng
Affiliations:
Computer Science Department, Brigham Young University, Provo, Utah 84602, USA. qw@email.byu.edu;Computer Science Department, Brigham Young University, Provo, Utah 84602, USA. ng@cs.byu.edu
Venue:
Information Retrieval
Year:
2003

Citing 14
Cited 2

A proposal for a formal model of objects

Object-oriented concepts, databases, and applications
A probabilistic learning approach for document indexing

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
Probabilistic retrieval based on staged logistic regression

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval

ACM Transactions on Information Systems (TOIS)
“Is this document relevant?…probably”: a survey of probabilistic models in information retrieval

ACM Computing Surveys (CSUR)
An ontology-based expert system for database design

Data & Knowledge Engineering - Special issue on ER '97
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval: Uncertainty and Logics: Advanced Models for the Representation and Retrieval of Information

Information Retrieval: Uncertainty and Logics: Advanced Models for the Representation and Retrieval of Information
Modern Information Retrieval

Modern Information Retrieval
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
Recognizing Ontology-Applicable Multiple-Record Web Documents

ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling

Categorisation of web documents using extraction ontologies

International Journal of Metadata, Semantics and Ontologies
Supporting product design by anticipating the success chances of new value profiles

Computers in Industry

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a “record.” This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.