Using web structure for classifying and describing web pages

Authors:
Eric J. Glover;Kostas Tsioutsiouliklis;Steve Lawrence;David M. Pennock;Gary W. Flake
Affiliations:
NEC Research Institute, Princeton, NJ;NEC Research Institute, Princeton, NJ and Princeton University, Princeton, NJ;NEC Research Institute, Princeton, NJ;NEC Research Institute, Princeton, NJ;NEC Research Institute, Princeton, NJ
Venue:
Proceedings of the 11th international conference on World Wide Web
Year:
2002

Citing 15
Cited 87

The nature of statistical learning theory

The nature of statistical learning theory
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Using analytic QP and sparseness to speed training of support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Digital Libraries and Autonomous Citation Indexing

Computer
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Feature Selection in Web Applications By ROC Inflections and Powerset Pruning

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
Improving Category Specific Web Search by Learning Query Modifications

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
Using extra-topical user preferences to improve web-based metasearch

Using extra-topical user preferences to improve web-based metasearch

LearnMiner: deductive, tolerant agents for discovering didactic resources on the web

SEKE '02 Proceedings of the 14th international conference on Software engineering and knowledge engineering
Inferring hierarchical descriptions

Proceedings of the eleventh international conference on Information and knowledge management
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Searching the workplace web

WWW '03 Proceedings of the 12th international conference on World Wide Web
An approach to confidence based page ranking for user oriented Web search

ACM SIGMOD Record
Building a web thesaurus from web link structure

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Statistical Relational Learning for Document Mining

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
THESUS: Organizing Web document collections based on link semantics

The VLDB Journal — The International Journal on Very Large Data Bases
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Fine-grained, structured configuration management for web projects

Proceedings of the 13th international conference on World Wide Web
Towards the self-annotating web

Proceedings of the 13th international conference on World Wide Web
PageCluster: Mining conceptual link hierarchies from Web log files for adaptive Web site navigation

ACM Transactions on Internet Technology (TOIT)
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A unified model of literal mining and link analysis for ranking web resources

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Web page summarization using dynamic content

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Extracting Precise Link Context Using NLP Parsing Technique

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
GE-CKO: A Method to Optimize Composite Kernels for Web Page Classification

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Using web structure and summarisation techniques for web content mining

Information Processing and Management: an International Journal
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Intelligent GP fusion from multiple sources for text classification

Proceedings of the 14th ACM international conference on Information and knowledge management
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
WebGuard: A Web Filtering Engine Combining Textual, Structural, and Visual Content-Based Analysis

IEEE Transactions on Knowledge and Data Engineering
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Reinforcing Web-object Categorization Through Interrelationships

Data Mining and Knowledge Discovery
Towards automated customer self-help

BT Technology Journal
A comparison of implicit and explicit links for web page classification

Proceedings of the 15th international conference on World Wide Web
A comparative study of citations and links in document classification

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Categorizing web search results into meaningful and stable categories using fast-feature techniques

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Automatically labeling hierarchical clusters

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Knowing a web page by the company it keeps

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A Voting Method for the Classification of Web Pages

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Temporal multi-page summarization

Web Intelligence and Agent Systems
Noise reduction through summarization for Web-page classification

Information Processing and Management: an International Journal
Review article: A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management

Computers in Industry
Floatcascade learning for fast imbalanced web mining

Proceedings of the 17th international conference on World Wide Web
Identifying a hierarchy of bipartite subgraphs for web site abstraction

Web Intelligence and Agent Systems
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Extraction and classification of dense implicit communities in the Web graph

ACM Transactions on the Web (TWEB)
A framework to derive web page context from hyperlink structure

International Journal of Information and Communication Technology
Accelerating Web Content Filtering by the Early Decision Algorithm

IEICE - Transactions on Information and Systems
HITS algorithm improvement using anchor-related text extracted by DOM structure analysis

Proceedings of the 2009 ACM symposium on Applied Computing
PathRank: Web Page Retrieval with Navigation Path

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Hypertext classification to filtrate information on the web

Proceedings of the 2009 Euro American Conference on Telematics and Information Systems: New Opportunities to increase Digital Citizenship
Ontology based Text Annotation --OnTeA

Proceedings of the 2007 conference on Information Modelling and Knowledge Bases XVIII
Getting the most out of social annotations for web page classification

Proceedings of the 9th ACM symposium on Document engineering
Managing knowledge on the Web - Extracting ontology from HTML Web

Decision Support Systems
Serving Comparative Shopping Links Non-invasively

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
sDoc: exploring social wisdom for document enhancement in web mining

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting term relationship to boost text classification

Proceedings of the 18th ACM conference on Information and knowledge management
Novel web page classification techniques in contextual advertising

Proceedings of the eleventh international workshop on Web information and data management
Concept-Based, Personalized Web Information Gathering: A Survey

KSEM '09 Proceedings of the 3rd International Conference on Knowledge Science, Engineering and Management
Using Web structure and summarisation techniques for Web content mining

Information Processing and Management: an International Journal
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
HITS algorithm improvement using semantic text portion

Web Intelligence and Agent Systems
Solving problems two at a time: classification of web pages using a generic pair-wise multiple classifier system

MCS'03 Proceedings of the 4th international conference on Multiple classifier systems
Extraction of anchor-related text and its evaluation by user studies

Proceedings of the 2007 conference on Human interface: Part I
Document clustering of scientific texts using citation contexts

Information Retrieval
Empowering automatic semantic annotation in grid

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Design of SMACA: synthesis and its analysis through rule vector graph for web based application

International Journal of Intelligent Information and Database Systems
Mining the web with hierarchical crawlers – a resource sharing based crawling approach

International Journal of Intelligent Information and Database Systems
A knowledge-based model using ontologies for personalized web information gathering

Web Intelligence and Agent Systems
Classifying documents with link-based bibliometric measures

Information Retrieval
Information retrieval in structured domains

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
WIA: a web inspection architecture

International Journal of Knowledge and Web Intelligence
Web driving: an image-based opportunistic web browser that visualizes a peripheral information space

WISE'06 Proceedings of the 7th international conference on Web Information Systems
A PDD-Based searching approach for expert finding in intranet information management

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Classifying web data in directory structures

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Hierarchical web structuring from the web as a graph approach with repetitive cycle proof

APWeb'06 Proceedings of the 2006 international conference on Advanced Web and Network Technologies, and Applications
A novel web page categorization algorithm based on block propagation using query-log information

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Importance of HTML structural elements and metadata in automated subject classification

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Literal-matching-biased link analysis

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
A path-based approach for web page retrieval

World Wide Web
WebDriving: web browsing based on a driving metaphor for improved children's e-learning

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Search engine indexing storage optimisation using Hamming distance

International Journal of Intelligent Information and Database Systems
Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Category labelling for automatic classification scheme generation

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Search for minority information from wikipedia based on similarity of majority information

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Web classification of conceptual entities using co-training

Expert Systems with Applications: An International Journal
Improving MeSH classification of biomedical articles using citation contexts

Journal of Biomedical Informatics
Extracting information networks from the blogosphere

ACM Transactions on the Web (TWEB)
Generation of SMACA and its application in web services

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Computing geographical serving area based on search logs and website categorization

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Extract and rank web communities

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
Towards improving the online shopping experience: A client-based platform for post-processing Web search results

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.