Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

Authors:
Fethi Fkih;Mohamed Nazih Omri
Affiliations:
MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia;MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia
Venue:
International Journal of Information Retrieval Research
Year:
2012

Citing 21
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
Scaling question answering to the web

ACM Transactions on Information Systems (TOIS)
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
ExtrAns: Extracting Answers from Technical Texts

IEEE Intelligent Systems
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Surface grammatical analysis for the extraction of terminological noun phrases

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 3
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Chinese named entity recognition using lexicalized HMMs

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Collective information extraction with relational Markov networks

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Improving the scalability of semi-Markov conditional random fields for named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting product features and opinions from reviews

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Proximity Window Context Method for Term Extraction in Ontology Learning from Text

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Fast full parsing by linear-chain conditional random fields

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Improving term extraction with terminological resources

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
A Roadmap to Integrate Document Clustering in Information Retrieval

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Textual data remain the most interesting source of information in the web. In the authors' research, they focus on a very specific kind of information namely "complex terms". Indeed, complex terms are defined as semantic units composed of several lexical units that can describe in a relevant and exhaustive way the text content. In this paper, they present a new model for complex terminology extraction COTEM, which integrates linguistic and statistical knowledge. Thus, the authors try to focus on three main contributions: firstly, they show the possibility of using a linear Conditional Random Fields CRF for complex terminology extraction from a specialized text corpus. Secondly, prove the ability of a Conditional Random Field to model linguistic knowledge by incorporating grammatical observations in the CRF's features. Finally, the authors present the benefits gained by the integration of statistical knowledge on the quality of the terminology extraction.