Using the structure of HTML documents to improve retrieval

Authors:
Michal Cutler;Yungming Shih;Weiyi Meng
Affiliations:
Department of Computer Science, State University of New York at Binghamton, Binghamton, NY;Department of Computer Science, State University of New York at Binghamton, Binghamton, NY;Department of Computer Science, State University of New York at Binghamton, Binghamton, NY
Venue:
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Year:
1997

Citing 10
Cited 18

Searching for information in a hypertext medical handbook

Communications of the ACM
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Hypermedia and free text retrieval

Information Processing and Management: an International Journal - Special issue on hypertext and information retrieval
Retrieval strategies for hypertext

Information Processing and Management: an International Journal - Special issue on hypertext and information retrieval
Making use of hypertext links when retrieving information

ECHT '92 Proceedings of the ACM conference on Hypertext
HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Proceedings of the the seventh ACM conference on Hypertext
Information Retrieval and HyperText

Information Retrieval and HyperText
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
WISE: A World Wide Web Resource Database System

IEEE Transactions on Knowledge and Data Engineering
Search and Ranking Algorithms for Locating Resources on the World Wide Web

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering

Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
A Graphical User Interface for Structured Document Retrieval

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
A New Study on Using HTML Structures to Improve Retrieval

ICTAI '99 Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence
An indexing model of HTML documents

Proceedings of the 2003 ACM symposium on Applied computing
A graphical user interface for the retrieval of hierarchically structured documents

Information Processing and Management: an International Journal
FleXPath: flexible structure and full-text querying for XML

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Improving search results with data mining in a thematic search engine

Computers and Operations Research
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
An algorithm to cluster documents based on relevance

Information Processing and Management: an International Journal
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web page title extraction and its application

Information Processing and Management: an International Journal
Formal Verification of Websites

Electronic Notes in Theoretical Computer Science (ENTCS)
Managing knowledge on the Web - Extracting ontology from HTML Web

Decision Support Systems
An algorithm to cluster documents based on relevance

Information Processing and Management: an International Journal
Web document modeling

The adaptive web
A domain-based intelligent search engine

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Extracting search-focused key n-grams for relevance ranking in web search

Proceedings of the fifth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web (WWW) is a gigantic information resource, which is growing daily. As more and more data are added to the WWW, it is becoming increasingly difficult to effectively locate useful information from this environment. In this paper, we propose a method for making use of the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. Our study assigns the occurrences of terms in a document collection into six classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). Based on the assignment, we extend the weighting schemes in traditional information retrieval by incorporating different importance factors to terms in different classes. The rationale is that terms appearing in different places of a document may have different significance in identifying the document. For this research we have built a Web based search tool, Webor, created a testbed, and conducted extensive experiments to determine an optimal class importance factor combination. Our study indicates that substantial improvement of retrieval effectiveness can be achieved using this technique.