Web page classification: Features and algorithms

Authors:
Xiaoguang Qi;Brian D. Davison
Affiliations:
Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA
Venue:
ACM Computing Surveys (CSUR)
Year:
2009

Citing 120
Cited 49

Original Contribution: Stacked generalization

Neural Networks
Automatic feedback using past queries: social searching?

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A probabilistic description-oriented approach for categorizing web documents

Proceedings of the eighth international conference on Information and knowledge management
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Web page classification based on k-nearest neighbor approach

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Scaling question answering to the Web

Proceedings of the 10th international conference on World Wide Web
Using LSI for text classification in the presence of background text

Proceedings of the tenth international conference on Information and knowledge management
Query clustering using user logs

ACM Transactions on Information Systems (TOIS)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
The structure of broad topics on the web

Proceedings of the 11th international conference on World Wide Web
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Learning to map between ontologies on the semantic web

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Strategies for minimising errors in hierarchical web categorisation

Proceedings of the eleventh international conference on Information and knowledge management
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Simple and accurate feature selection for hierarchical categorisation

Proceedings of the 2002 ACM symposium on Document engineering
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text-Learning and Related Intelligent Agents: A Survey

IEEE Intelligent Systems
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Composite Kernels for Hypertext Categorisation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Combining Labeled and Unlabeled Data for MultiClass Text Categorization

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Discovering Test Set Regularities in Relational Domains

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hypertext Categorization using Hyperlink Patterns and Meta Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Finding Similar Queries to Satisfy Searches Based on Query Traces

OOIS '02 Proceedings of the Workshops on Advances in Object-Oriented Information Systems
Combining Labeled and Unlabeled Data for Text Classification with a Large Number of Categories

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Text categorization based on k-nearest neighbor approach for web site classification

Information Processing and Management: an International Journal
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic Web Page Classification in a Dynamic and Hierarchical Way

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Data mining for hypertext: a tutorial survey

ACM SIGKDD Explorations Newsletter
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search

IEEE Transactions on Knowledge and Data Engineering
A scalability analysis of classifiers in text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Two-Phase Web Site Classification Based on Hidden Markov Tree Models

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
WebGuard: Web Based Adult Content Detection and Filtering System

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
PEBL: Web Page Classification without Negative Examples

IEEE Transactions on Knowledge and Data Engineering
OntoKhoj: a semantic web portal for ontology searching, ranking and classification

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Combining Pattern Classifiers: Methods and Algorithms

Combining Pattern Classifiers: Methods and Algorithms
Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web
Experiments with open-domain textual Question Answering

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Effectiveness of web page classification on finding list answers

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting fuzzy classification rules from partially labeled data

Soft Computing - A Fusion of Foundations, Methodologies and Applications
Web page classification without the web page

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Automatically collecting, monitoring, and mining japanese weblogs

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Why collective inference improves relational classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
The Combination of Text Classifiers Using Reliability Indicators

Information Retrieval
Using a web-based categorization approach to generate thematic metadata from texts

ACM Transactions on Asian Language Information Processing (TALIP)
Findex: search result categories help users when document ranking fails

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Algorithmic detection of semantic similarity

WWW '05 Proceedings of the 14th international conference on World Wide Web
Mapping the Semantics of Web Text and Links

IEEE Internet Computing
OCFS: optimal orthogonal centroid feature selection for text categorization

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
An application of text categorization methods to gene ontology annotation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Multi-labelled classification using maximum entropy method

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
PageRank without hyperlinks: structural re-ranking using links induced by language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Query chains: learning to rank from implicit feedback

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Support vector machines classification with a very large-scale taxonomy

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Link mining: a survey

ACM SIGKDD Explorations Newsletter
Parsing and question classification for question answering

ODQA '01 Proceedings of the workshop on Open-domain question answering - Volume 12
Understanding how bloggers feel: recognizing affect in blog posts

CHI '06 Extended Abstracts on Human Factors in Computing Systems
Reinforcing Web-object Categorization Through Interrelationships

Data Mining and Knowledge Discovery
Web ontology segmentation: analysis, classification and use

Proceedings of the 15th international conference on World Wide Web
A comparison of implicit and explicit links for web page classification

Proceedings of the 15th international conference on World Wide Web
Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
Ranking on graph data

ICML '06 Proceedings of the 23rd international conference on Machine learning
Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Topical link analysis for web search

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Adapting ranking SVM to document retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Linear prediction models with graph regularization for web-page categorization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Coarse-grained classification of web sites by their structural properties

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Knowing a web page by the company it keeps

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A neighborhood-based approach for clustering of linked document collections

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web-based list question answering

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Discretization based learning approach to information retrieval

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Hierarchical text categorization and its application to bioinformatics

Hierarchical text categorization and its application to bioinformatics
P-TAG: large scale automatic generation of personalized annotation tags for the web

Proceedings of the 16th international conference on World Wide Web
Web page classification with heterogeneous data fusion

Proceedings of the 16th international conference on World Wide Web
Utility analysis for topically biased PageRank

Proceedings of the 16th international conference on World Wide Web
Altering document term vectors for classification: ontologies as expectations of co-occurrence

Proceedings of the 16th international conference on World Wide Web
Classification in Networked Data: A Toolkit and a Univariate Case Study

The Journal of Machine Learning Research
A Novel Web Page Filtering System by Combining Texts and Images

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Robust classification of rare queries using web knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combining content and link for classification using matrix factorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A semantic approach to contextual advertising

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combining error-correcting output codes and model-refinement for text categorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

The Journal of Machine Learning Research
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Solving multiclass learning problems via error-correcting output codes

Journal of Artificial Intelligence Research
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
When are links useful? experiments in text classification

ECIR'03 Proceedings of the 25th European conference on IR research
Large scale unstructured document classification using unlabeled data and syntactic information

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Blog classification using tags: an empirical study

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Syskill & webert: Identifying interesting web sites

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Importance-based web page classification using cost-sensitive SVM

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Importance of HTML structural elements and metadata in automated subject classification

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
The language of folksonomies: what tags reveal about user classification

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Link-Local features for hypertext classification

EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining
Web document clustering using hyperlink structures

Computational Statistics & Data Analysis

Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Browsing the underdeveloped Web: An experiment on the Arabic Medical Web Directory

Journal of the American Society for Information Science and Technology
Exploring social tagging graph for web object classification

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The transition from web content accessibility guidelines 1.0 to 2.0: what this means for evaluation and repair

Proceedings of the 27th ACM international conference on Design of communication
Novel web page classification techniques in contextual advertising

Proceedings of the eleventh international workshop on Web information and data management
Multi-modality in one-class classification

Proceedings of the 19th international conference on World wide web
Fast dimension reduction for document classification based on imprecise spectrum analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Using semantic techniques to access web data

Information Systems
A Web page classification system based on a genetic algorithm using tagged-terms as features

Expert Systems with Applications: An International Journal
A combined topical/non-topical approach to identifying web sites for children

Proceedings of the fourth ACM international conference on Web search and data mining
Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content

Journal of Biomedical Informatics
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
Adversarial Web Search

Foundations and Trends in Information Retrieval
A solution to the exact match on rare item searches: introducing the lost sheep algorithm

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Learning search tasks in queries and web pages via graph regularization

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Supporting effective health and biomedical information retrieval and navigation: A novel facet view interface evaluation

Journal of Biomedical Informatics
Automatic maintenance of web directories by mining web browsing data

Journal of Web Engineering
Balance support vector machines locally using the structural similarity kernel

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Myngle: unifying and filtering web content for unplanned access between multiple personal devices

Proceedings of the 13th international conference on Ubiquitous computing
Topical categorization of search results based on a domain ontology

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Autonomous and adaptive identification of topics in unstructured text

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
Enhance web pages genre identification using neighboring pages

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Selecting Answers to Questions from Web Documents by a Robust Validation Process

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Comparison of term frequency and document frequency based feature selection metrics in text categorization

Expert Systems with Applications: An International Journal
Analyzing Online Review Helpfulness Using a Regressional ReliefF-Enhanced Text Mining Method

ACM Transactions on Management Information Systems (TMIS)
Classifying Arabic web pages toolkit

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
A new search engine integrating hierarchical browsing and keyword search

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Efficient classifiers for multi-class classification problems

Decision Support Systems
On automatically tagging web documents from examples

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hierarchical classification of web documents by stratified discriminant analysis

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Exploiting latent relevance for relational learning of ubiquitous things

Proceedings of the 21st ACM international conference on Information and knowledge management
Fast dimension reduction for document classification based on imprecise spectrum analysis

Information Sciences: an International Journal
A Cognitive Framework for Core Language Understanding and its Computational Implementation

International Journal of Cognitive Informatics and Natural Intelligence
Competitive intelligence for SMEs: a web-based decision support system

International Journal of Business Information Systems
CatStream: categorising tweets for user profiling and stream filtering

Proceedings of the 2013 international conference on Intelligent user interfaces
A comparative study of classifier combination applied to NLP tasks

Information Fusion
Towards automatic assessment of government web sites

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
A supervised machine learning classification algorithm for research articles

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Browse with a social web directory

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Researcher homepage classification using unlabeled data

Proceedings of the 22nd international conference on World Wide Web
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
What's the deal?: identifying online bargains

AWC '13 Proceedings of the First Australasian Web Conference - Volume 144
Research on adaptive classification algorithm based on non-segment and classified-centre-vector

International Journal of Intelligent Information and Database Systems
Serefind: a crowd-powered search engine

Proceedings of the companion publication of the 17th ACM conference on Computer supported cooperative work & social computing
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems
Explaining data-driven document classifications

MIS Quarterly

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.