A machine learning approach to web page filtering using content and structure analysis

Authors:
Michael Chau;Hsinchun Chen
Affiliations:
School of Business, The University of Hong Kong, Pokfulam, Hong Kong;Department of Management Information Systems, The University of Arizona, Tucson, Arizona 85721, USA
Venue:
Decision Support Systems
Year:
2008

Citing 42
Cited 11

Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Information extraction as a basis for high-precision text classification

ACM Transactions on Information Systems (TOIS)
Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms

Journal of the American Society for Information Science
The nature of statistical learning theory

The nature of statistical learning theory
Cluster-based text categorization: a comparison of category search strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Internet agents: spiders, wanderers, brokers, and bots

Internet agents: spiders, wanderers, brokers, and bots
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An interactive WWW search engine for user-defined collections

Proceedings of the third ACM conference on Digital libraries
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
An intelligent personal spider (agent) for dynamic Internet/intranet searching

Decision Support Systems - Special issue: intranets and intranetworking
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Making large-scale support vector machine learning practical

Advances in kernel methods
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Yahoo! as an ontology: using Yahoo! categories to describe documents

Proceedings of the eighth international conference on Information and knowledge management
Comparing noun phrasing techniques for use with medical digital library tools

Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Greenstone: Open-source DL software

Communications of the ACM
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Information Retrieval

Information Retrieval
Bayesian online classifiers for text classification and filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Mining the Web's Link Structure

Computer
Automatic Text Categorization and Its Application to Text Retrieval

IEEE Transactions on Knowledge and Data Engineering
ACIRD: Intelligent Internet Document Organization and Retrieval

IEEE Transactions on Knowledge and Data Engineering
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Composite Kernels for Hypertext Categorisation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Feature Reduction for Neural Network Based Text Categorization

DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
Comparison of Three Vertical Search Spiders

Computer
HelpfulMed: intelligent searching for medical information over the internet

Journal of the American Society for Information Science and Technology
Building a scientific knowledge web portal: the NanoPort experience

Decision Support Systems
WebGlimpse: combining browsing and searching

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Incorporating Web Analysis Into Neural Networks: An Example in Hopfield Net Searching

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Using domain-specific knowledge in generalization error bounds for support vector machine learning

Decision Support Systems
Principal-agent learning

Decision Support Systems
On strategies for imbalanced text classification using SVM: A comparative study

Decision Support Systems
Classification by vertical and cutting multi-hyperplane decision tree induction

Decision Support Systems
Commercial Internet filters: Perils and opportunities

Decision Support Systems
Visualizing web search results using glyphs: Design and evaluation of a flower metaphor

ACM Transactions on Management Information Systems (TMIS)
Mining special features to improve the performance of e-commerce product selection and resume processing

International Journal of Computational Science and Engineering
Mining search intents for collaborative cyberporn filtering

Journal of the American Society for Information Science and Technology
Constructing a reliable Web graph with information on browsing behavior

Decision Support Systems
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems
Editorial: A topic-specific crawling strategy based on semantics similarity

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management.