A system for adaptive information extraction from highly informal text

Authors:
Laura Alonso i Alemany;Rafael Carrascosa
Affiliations:
NLP Group, FaMAF-UNC, Córdoba, Argentina;NLP Group, FaMAF-UNC, Córdoba, Argentina
Venue:
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Year:
2011

Citing 7
Cited 0

Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised Maximum Entropy based approach to acronym and abbreviation normalization in medical texts

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
NLTK: the natural language toolkit

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Learning stochastic edit distance: Application in handwritten character recognition

Pattern Recognition
Phoebus: a system for extracting and integrating data from unstructured and ungrammatical sources

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
An unsupervised model for text message normalization

CALC '09 Proceedings of the Workshop on Computational Approaches to Linguistic Creativity

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a first version of ado, a system for Adaptive Data Organization, that is, information extraction from highly informal text: short text messages, classified ads, tweets, etc. It is built on a modular architecture that integrates in a transparent way off-the-shelf NLP tools, general procedures on strings and machine learning and processes tailored to a domain. The system is called adaptive because it implements a semi-supervised approach. Knowledge resources are initially built by hand, and they are updated automatically by feeds from the corpus. This allows ado to adapt to the rapidly changing user-generated language. In order to estimate the impact of future developments, we have carried out an orientative evaluation of the system with a small corpus of classified advertisements of the real estate domain in Spanish.This evaluation shows that tokenization and chunking can be well resolved by simple techniques, but normalization, morphosyntactic and semantic tagging require either more complex techniques or a bigger training corpus.