The impact of document structure on keyphrase extraction

Authors:
Katja Hofmann;Manos Tsagkias;Edgar Meij;Maarten de Rijke
Affiliations:
University of Amsterdam, Amsterdam, Netherlands;University of Amsterdam, Amsterdam, Netherlands;University of Amsterdam, Amsterdam, Netherlands;University of Amsterdam, Amsterdam, Netherlands
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 10
Cited 6

Reader's models of text structures: the case of academic articles

International Journal of Man-Machine Studies
KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Improving browsing in digital libraries with keyphrase indexes

Decision Support Systems - From information retrieval to knowledge management: enabling technologies and best practices
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
A language model approach to keyphrase extraction

MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Studying human judgments of relevance: interactions in context

IIiX Proceedings of the 1st international conference on Information interaction in context
KP-Miner: A keyphrase extraction system for English and Arabic documents

Information Systems
Keyphrase extraction in scientific publications

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Text-level structure of research papers: implications for text-based information processing systems

IRSG'97 Proceedings of the 19th Annual BCS-IRSG conference on Information Retrieval Research

On the evaluation of entity profiles

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Finding images of difficult entities in the long tail

Proceedings of the 20th ACM international conference on Information and knowledge management
Semantic Labelling for Document Feature Patterns Using Ontological Subjects

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Can back-of-the-book indexes be automatically created?

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Mapping semantic knowledge for unsupervised text categorisation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Integrating semantic relatedness and words' intrinsic features for keyword extraction

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Document structure may contain useful information about which parts or phrases of a document are important, but has rarely been considered as a source of information for keyphrase extraction. We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new, large corpus that consists of full-text journal articles, where the rich collection and document structure available at the publishing stage is explicitly annotated. We explore features based on the XML tags contained in the documents, and based on generic section types derived using position and cue words in section titles. For XML tags we find sections, abstract, and title to perform best, but many smaller elements may be beneficial in combination with other features. Of the generic section types, the discussion section is found to be most useful for keyphrase extraction.