Using urls and table layout for web classification tasks

Authors:
L. K. Shih;D. R. Karger
Affiliations:
Massachusetts Institute of Technology, MA;Massachusetts Institute of Technology, MA
Venue:
Proceedings of the 13th international conference on World Wide Web
Year:
2004

Citing 15
Cited 21

Quantifying inductive bias: AI learning algorithms and Valiant's learning framework

Artificial Intelligence
Learning and Revising User Profiles: The Identification ofInteresting Web Sites

Machine Learning - Special issue on multistrategy learning
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
A hybrid user model for news story classification

UM '99 Proceedings of the seventh international conference on User modeling
On integrating catalogs

Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Web montage: a dynamic personalized start page

Proceedings of the 11th international conference on World Wide Web
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Machine learning on web documents

Machine learning on web documents
Inferring strategies for sentence ordering in multidocument news summarization

Journal of Artificial Intelligence Research

Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Understanding the function of web elements for mobile content delivery using random walk models

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Search Adaptations and the Challenges of the Web

IEEE Internet Computing
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Large-scale text categorization by batch mode active learning

Proceedings of the 15th international conference on World Wide Web
Categorizing web search results into meaningful and stable categories using fast-feature techniques

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Automatic web pages categorization with ReliefF and Hidden Naive Bayes

Proceedings of the 2007 ACM symposium on Applied computing
Classifying web genres in context: A case study documenting the web genres used by a software engineer

Information Processing and Management: an International Journal
Automated Semantic Analysis of Schematic Data

World Wide Web
Web Document Classification Based on Rough Set

RSFDGrC '07 Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Looking Ahead: A Comparison of Page Preview Techniques for Goal-Directed Web Navigation

INTERACT '09 Proceedings of the 12th IFIP TC 13 International Conference on Human-Computer Interaction: Part I
Enhancing web page readability for non-native readers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Framework for building a high-quality web page collection considering page group structure

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Best-match method used in co-training algorithm

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Information retrieval in structured domains

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
A Chinese web page automatic classification system

WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
Importance-based web page classification using cost-sensitive SVM

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Researcher homepage classification using unlabeled data

Proceedings of the 22nd international conference on World Wide Web
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured---describing the position of the item in a tree. We develop a model and algorithm for machine learning using such tree-structured features. We apply our methods in automated tools for recognizing and blocking Web advertisements and for recommending "interesting" news stories to a reader. Experiments show that our algorithms are both faster and more accurate than those based on the text content of Web documents.