Narrative text classification for automatic key phrase extraction in web document corpora

Authors:
Yongzheng Zhang;Nur Zincir-Heywood;Evangelos Milios
Affiliations:
Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada
Venue:
Proceedings of the 7th annual ACM international workshop on Web information and data management
Year:
2005

Citing 13
Cited 10

Lexical analysis and stoplists

Information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
OCELOT: a system for summarizing Web pages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Probabilistic question answering on the web

Proceedings of the 11th international conference on World Wide Web
Using part-of-speech patterns to reduce query ambiguity

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Learning Algorithms for Keyphrase Extraction

Information Retrieval
KPSpotter: a flexible information gain-based keyphrase extraction system

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
World wide web site summarization

Web Intelligence and Agent Systems
Coherent keyphrase extraction via web mining

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Report on the 7th ACM International Workshop on Web Information and Data Management: (WIDM 2005)

ACM SIGMOD Record
Automatic document indexing in large medical collections

HIKM '06 Proceedings of the international workshop on Healthcare information and knowledge management
The AMTEx approach in the medical document indexing and retrieval application

Data & Knowledge Engineering
CollabRank: towards a collaborative approach to single-document keyphrase extraction

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A parametric methodology for text classification

Journal of Information Science
SUT: Quantifying and mitigating URL typosquatting

Computer Networks: The International Journal of Computer and Telecommunications Networking
Constructing personal knowledge base: automatic key-phrase extraction from multiple-domain web pages

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Concept extraction for online shopping

Proceedings of the 14th Annual International Conference on Electronic Commerce
Combining Supervised Learning Techniques to Key-Phrase Extraction for Biomedical Full-Text

International Journal of Intelligent Information Technologies
Discovering filter keywords for company name disambiguation in twitter

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contain diverse and heterogeneous contents, is of particular interest and challenge in the information age. In this work, we investigate the significance of narrative text classification in the task of automatic key phrase extraction in Web document corpora. We benchmark three methods, TFIDF, KEA, and Keyterm, used to extract key phrases from all the plain text and from only the narrative text of Web pages. ANOVA tests are used to analyze the ranking data collected in a user study using quantitative measures of acceptable percentage and quality value. The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages. This demonstrates that narrative text classification is indispensable for effective key phrase extraction in Web document corpora.