A scalable and distributed NLP architecture for web document annotation

Authors:
Julien Deriviere;Thierry Hamon;Adeline Nazarenko
Affiliations:
LIPN – UMR CNRS 7030, Villetaneuse, France;LIPN – UMR CNRS 7030, Villetaneuse, France;LIPN – UMR CNRS 7030, Villetaneuse, France
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 8
Cited 0

Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
An open distributed architecture for reuse and integration of heterogeneous NLP components

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
The Talent system: TEXTRACT architecture and data model

Natural Language Engineering
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Evolving GATE to meet new challenges in language engineering

Natural Language Engineering
KIM – a semantic platform for information extraction and retrieval

Natural Language Engineering
Event-based information extraction for the biomedical domain: the Caderige project

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the context of the ALVIS project, which aims at integrating linguistic information in topic-specific search engines, we develop a NLP architecture to linguistically annotate large collections of web documents. This context leads us to face the scalability aspect of Natural Language Processing. The platform can be viewed as a framework using existing NLP tools. We focus on the efficiency of the platform by distributing linguistic processing on several machines. We carry out an an experiment on 55,329 web documents focusing on biology. These 79 million-word collections of web documents have been processed in 3 days on 16 computers.