A Study of Approaches to Hypertext Categorization

Authors:
Yiming Yang;Seá/n Slattery;Rayid Ghani
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. yiming.yang@cs.cmu.edu;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. sean.slattery@cs.cmu.edu;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA&semi/ Accenture Technology Labs—/Research, Northbrook, IL 60062, USA. rayid.ghani@cs.cmu.edu
Venue:
Journal of Intelligent Information Systems
Year:
2002

Citing 22
Cited 97

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Learning Logical Definitions from Relations

Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
First-Order Learning for Web Mining

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Composite Kernels for Hypertext Categorisation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Multistrategy Learning for Information Extraction

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Combining Multiple Learning Strategies for Effective Cross Validation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Automatically Extracting Features for Concept Learning from the Web

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Discovering Test Set Regularities in Relational Domains

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hypertext Categorization using Hyperlink Patterns and Meta Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Combining Statistical and Relational Methods for Learning in Hypertext Domains

ILP '98 Proceedings of the 8th International Workshop on Inductive Logic Programming

A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Predicting web actions from HTML content

Proceedings of the thirteenth ACM conference on Hypertext and hypermedia
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Improving Naive Bayes Using Class-Conditional ICA

IBERAMIA 2002 Proceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence
A scalability analysis of classifiers in text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Learning probabilistic models of link structure

The Journal of Machine Learning Research
Web unit mining: finding and classifying subgraphs of web pages

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Structured multimedia document classification

Proceedings of the 2003 ACM symposium on Document engineering
Link mining: a new data mining challenge

ACM SIGKDD Explorations Newsletter
An Analytical Approach to Concept Extraction in HTML Environments

Journal of Intelligent Information Systems
Improving text categorization using the importance of sentences

Information Processing and Management: an International Journal
Using the feature projection technique based on a normalized voting method for text classification

Information Processing and Management: an International Journal
An Evaluation of Passage-Based Text Categorization

Journal of Intelligent Information Systems
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Why collective inference improves relational classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Extracting Precise Link Context Using NLP Parsing Technique

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
GE-CKO: A Method to Optimize Composite Kernels for Web Page Classification

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Bayesian network model for semi-structured document classification

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Mining web content outliers using structure oriented weighting techniques and N-grams

Proceedings of the 2005 ACM symposium on Applied computing
Text categorization using feature projections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Automatic text categorization using the importance of sentences

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Using web structure and summarisation techniques for web content mining

Information Processing and Management: an International Journal
Intelligent GP fusion from multiple sources for text classification

Proceedings of the 14th ACM international conference on Information and knowledge management
WebGuard: A Web Filtering Engine Combining Textual, Structural, and Visual Content-Based Analysis

IEEE Transactions on Knowledge and Data Engineering
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
A comparative study of citations and links in document classification

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Dictionary-based text categorization of chemical web pages

Information Processing and Management: an International Journal
Latent linkage semantic kernels for collective classification of link data

Journal of Intelligent Information Systems
Higher order feature selection for text classification

Knowledge and Information Systems
Identifying ontology components from digital archives for the semantic web

ACST'06 Proceedings of the 2nd IASTED international conference on Advances in computer science and technology
Template extraction from candidate template set generation: a structure and content approach

Proceedings of the 43rd annual Southeast regional conference - Volume 2
Learning Contextual Dependency Network Models for Link-Based Classification

IEEE Transactions on Knowledge and Data Engineering
Multi-evidence, multi-criteria, lazy associative document classification

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Comparison of feature selection and classification algorithms in identifying malicious executables

Computational Statistics & Data Analysis
Clustering e-commerce search engines based on their search interface pages using WISE-cluster

Data & Knowledge Engineering - Special issue: WIDM 2004
User behavior modeling and content based speculative web page prefetching

Data & Knowledge Engineering - Special issue: ER 2003
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification

Information Processing and Management: an International Journal
Automatic patent classification using citation network information: an experimental study in nanotechnology

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A study of context inference for Web-based information systems

Electronic Commerce Research and Applications
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Combining content and link for classification using matrix factorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A review of associative classification mining

The Knowledge Engineering Review
Review article: A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management

Computers in Industry
A machine learning approach to web page filtering using content and structure analysis

Decision Support Systems
Node roles and community structure in networks

Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis
Finding and classifying web units in websites

International Journal of Business Intelligence and Data Mining
Social tag prediction

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Recognition of News Web Pages

PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Weighted Hyper-sphere SVM for Hypertext Classification

ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks
WORDS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND THE ETHNIC ORIGIN OF THEIR AUTHORS

Cybernetics and Systems
Classifying networked entities with modularity kernels

Proceedings of the 17th ACM conference on Information and knowledge management
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Text classification from unlabeled documents with bootstrapping and feature projection techniques

Information Processing and Management: an International Journal
Incremental data-driven learning of a novelty detection model for one-class classification with application to high-dimensional noisy data

Machine Learning
Generating Bidirectional Links for Web Annotation Stickies

IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Getting the most out of social annotations for web page classification

Proceedings of the 9th ACM symposium on Document engineering
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
Web corpus mining by instance of Wikipedia

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Improving web page classification by label-propagation over click graphs

Proceedings of the 18th ACM conference on Information and knowledge management
Managing Knowledge in Light of Its Evolution Process: An Empirical Study on Citation Network-Based Patent Classification

Journal of Management Information Systems
Using some web content mining techniques for Arabic text classification

DNCOCO'09 Proceedings of the 8th WSEAS international conference on Data networks, communications, computers
Hypertext Classification Using Tensor Space Model and Rough Set Based Ensemble Classifier

PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Using Web structure and summarisation techniques for Web content mining

Information Processing and Management: an International Journal
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Tensor Framework and Combined Symmetry for Hypertext Mining

Fundamenta Informaticae
Web page classification: a soft computing approach

AWIC'03 Proceedings of the 1st international Atlantic web intelligence conference on Advances in web intelligence
A belief networks-based generative model for structured documents: an application to the XML categorization

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Semantic-based grouping of search engine results using WordNet

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Text categorization of multilingual web pages in specific domain

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Text and hypertext categorization

Artificial intelligence
Classifying documents with link-based bibliometric measures

Information Retrieval
A novel split and merge technique for hypertext classification

Transactions on rough sets XII
Link-based text classification using Bayesian networks

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Document assignment in multi-site search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
Improving text classification with concept index terms and expansion terms

ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part III
Combining file content and file relations for cloud based malware detection

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach

Proceedings of the 20th ACM international conference on Information and knowledge management
Discriminative probabilistic models for relational data

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Robust collective classification with contextual dependency network models

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
A novel framework for web page classification using two-stage neural network

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Importance of HTML structural elements and metadata in automated subject classification

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Towards logical hypertext structure

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems
Multi-lingual detection of terrorist content on the web

WISI'06 Proceedings of the 2006 international conference on Intelligence and Security Informatics
Web classification of conceptual entities using co-training

Expert Systems with Applications: An International Journal
Tensor Framework and Combined Symmetry for Hypertext Mining

Fundamenta Informaticae
Collective classification for fine-grained information status

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Computing geographical serving area based on search logs and website categorization

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Concept comparison engines: A new frontier of search

Decision Support Systems
What's buzzing in the blizzard of buzz? Automotive component isolation in social media postings

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related Web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our results show that the identification of hypertext regularities in the data and the selection of appropriate representations for hypertext in particular domains are crucial, but seldom obvious, in real-world problems. We find that adding the words in the linked neighborhood to the page having those links (both inlinks and outlinks) were helpful for all our classifiers on one data set, but more harmful than helpful for two out of the three classifiers on the remaining datasets. We also observed that extracting meta data from related Web sites was extremely useful for improving classification accuracy in some of those domains. Finally, the relative performance of the classifiers being tested provided insights into their strengths and limitations for solving classification problems involving diverse and often noisy Web pages.