LoLo: a system based on terminology for multilingual extraction

Authors:
Yousif Almas;Khurshid Ahmad
Affiliations:
University of Surrey, Guildford, Surrey, UK;Trinity College, Dublin, Ireland
Venue:
IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Year:
2006

Citing 6
Cited 1

Automatic grammar induction and parsing free text: a transformation-based approach

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Counter-training in discovery of semantic patterns

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A learning based model for headline extraction of news articles to find explanatory sentences for events

Proceedings of the 3rd international conference on Knowledge capture
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
A semantic approach to IE pattern induction

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

Crime profiling for the Arabic language using computational linguistic techniques

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

An unsupervised learning method, based on corpus linguistics and special language terminology, is described that can extract time-varying information from text streams. The method is shown to be 'language-independent' in that its use leads to sets of regular-expressions that can be used to extract the information in typologically distinct languages like English and Arabic. The method uses the information related to the distribution of N-grams, for automatically extracting 'meaning bearing' patterns of usage in a training corpus. The analysis of an English news wire corpus (1,720,142 tokens) and Arabic news wire corpus (1,720,154 tokens) show encouraging results.