AUTOMATIC MACHINE LEARNING OF KEYPHRASE EXTRACTION FROM SHORT HTML DOCUMENTS WRITTEN IN HEBREW

Authors:
Yaakov HaCohen-Kerner;Ittay Stern;David Korkus;Erick Fredj
Affiliations:
Department of Computer Science, Jerusalem College of Technology (Machon Lev), Jerusalem, Israel;Department of Computer Science, Jerusalem College of Technology (Machon Lev), Jerusalem, Israel;Department of Computer Science, Jerusalem College of Technology (Machon Lev), Jerusalem, Israel;Department of Computer Science, Jerusalem College of Technology (Machon Lev), Jerusalem, Israel
Venue:
Cybernetics and Systems
Year:
2007

Citing 9
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning
Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
Support-Vector Networks

Machine Learning
Automatic condensation of electronic publications by sentence selection

Information Processing and Management: an International Journal - Special issue: summarizing text
Encyclopedia of Artificial Intelligence

Encyclopedia of Artificial Intelligence
Advances in Automatic Text Summarization

Advances in Automatic Text Summarization
Automatic extraction of document keyphrases for use in digital libraries: evaluation and applications

Journal of the American Society for Information Science and Technology
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Domain-Specific Keyphrase Extraction

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence

Using Wikipedia concepts and frequency in language to extract key terms from support documents

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Keyphrases extracted from documents may save precious time for tasks such as filtering, summarization, and categorization. A few such systems are available for documents written in English. In this paper, we propose a model called LEH_KEY (Learning to Extract Hebrew KEYphrases) that for the first time learns to extract keyphrases for documents written in Hebrew. Firstly, we introduce a relatively high number (15) of baseline extraction methods as opposed to other related systems that use combinations of a low number (two/three) of baseline extraction methods. In contrast, we have investigated various combinations of larger number of baseline methods and various machine learning methods have been tested. The best results have been achieved by a combination of six baseline methods using J48 (an improved variant of C4.5). Our results have been found to be at least of equal quality to those achieved by extraction systems for documents written in English, which are regarded as state-of-the art.