HUMB: Automatic key term extraction from scientific articles in GROBID

Authors:
Patrice Lopez;Laurent Romary
Affiliations:
INRIA, Berlin, Germany;INRIA & HUB-IDSL, Berlin, Germany
Venue:
SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Year:
2010

Citing 5
Cited 3

KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Automatic glossary extraction: beyond terminology identification

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Domain-independent automatic keyphrase indexing with small training sets

Journal of the American Society for Information Science and Technology
GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries

“Without the clutter of unimportant words”: Descriptive keyphrases for text visualization

ACM Transactions on Computer-Human Interaction (TOCHI)
Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms

Journal of Information Science
Automatic keyphrase extraction from scientific articles

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Semeval task 5 was an opportunity for experimenting with the key term extraction module of GROBID, a system for extracting and generating bibliographical information from technical and scientific documents. The tool first uses GROBID's facilities for analyzing the structure of scientific articles, resulting in a first set of structural features. A second set of features captures content properties based on phraseness, informativeness and keywordness measures. Two knowledge bases, GRISP and Wikipedia, are then exploited for producing a last set of lexical/semantic features. Bagged decision trees appeared to be the most efficient machine learning algorithm for generating a list of ranked key term candidates. Finally a post ranking was realized based on statistics of cousage of keywords in HAL, a large Open Access publication repository.