Extraction of web texts using content-density distribution

Authors:
Saori Kitahara;Koya Tamura;Kenji Hatano
Affiliations:
Graduate School of Culture and Information Science, Doshisha University, Kyoto, Japan;UX Department, Mixi Inc., Tokyo, Japan;Faculty of Culture and Information Science, Doshisha University, Kyoto, Japan
Venue:
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Year:
2011

Citing 8
Cited 0

Approaches to passage retrieval in full text information systems

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance score normalization for metasearch

Proceedings of the tenth international conference on Information and knowledge management
On the Use of Density Distribution of Keywords for Automated Generation of Hypertext Links from Arbitrary Parts of Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Extracting Relevant Snippets fromWeb Documents through Language Model based Text Segmentation

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Introduction to Information Retrieval

Introduction to Information Retrieval
Positional language models for information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Lexical cohesion based topic modeling for summarization

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a method for grasping the content of each Web page and extracting a part of the Web page related to query keywords, in order to make more effective snippets of a Web search engine. We regard the content as a set of words in the text of a Web page, and we generate the content-density distribution by using both the position and the influence of the word. In our experiments, we found that the proposed method facilitated the recognition of the content of Web pages, as compared to conventional methods based on snippets.