Extracting Precise Link Context Using NLP Parsing Technique

Authors:
Qingyang Xu;Wanli Zuo
Affiliations:
Jilin University, China;Jilin University, China
Venue:
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2004

Citing 13
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Deriving link-context from HTML tag tree

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Link context has been exploited extensively ever since the advent of the World Wide Web, but the approach to extracting precise link context has not been fully explored and many state-of-the-art extraction methods are based on simplistic heuristics and require ad-hoc parameters. In this paper, we propose a novel two-step extraction model, which aims to systematically derive link context of quality as high as anchor text. In the macroscopic analysis step, a systematic web page structure analysis is performed to locate the content cohesive text region and potential relevant header or header like tags. In the microscopic extraction step, an English parser is used to extract the relevant sentence fragments in the text region and the nearest heading text is encompassed if the need arises. Preliminary experimental results proved our approach's effectiveness.