A survey of methods to ease the development of highly multilingual text mining applications

Authors:
Ralf Steinberger
Affiliations:
European Commission, Joint Research Centre (JRC), Ispra, Italy 21027
Venue:
Language Resources and Evaluation
Year:
2012

Citing 16
Cited 2

Generic text summarization using relevance measure and latent semantic analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Architectural elements of language engineering robustness

Natural Language Engineering
Language-specific models in multilingual topic tracking

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
NE recognition without training data on a language you don't speak

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Evaluating cross-language annotation transfer in the MultiSemCor corpus

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Named entity discovery using comparable news articles

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Improving Communication in E-democracy Using Natural Language Processing

IEEE Intelligent Systems
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
Learning Machine Translation

Learning Machine Translation
Fips, a "deep" linguistic multilingual parser

DeepLP '07 Proceedings of the Workshop on Deep Linguistic Processing
Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

Information Sciences: an International Journal
Using parallel corpora for multilingual (multi-document) summarisation evaluation

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
NewsGist: a multilingual statistical news summarizer

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

Language Resources and Evaluation
Creating sentiment dictionaries via triangulation

WASSA '11 Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis

Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

Language Resources and Evaluation
Cross-lingual geo-parsing for non-structured data

Proceedings of the 7th Workshop on Geographic Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines--most of all: extreme simplicity--can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications ( http://emm.newsbrief.eu/overview.html ). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that--to achieve this--the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.