Distributed parse mining

  • Authors:
  • Scott A. Waterman

  • Affiliations:
  • Microsoft Live Search/Powerset, San Francisco

  • Venue:
  • SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe the design and implementation of a system for data exploration over dependency parses and derived semantic representations in a large-scale NLP-based search system at powerset.com. Because of the distributed nature of the document repository and the processing infrastructure, and also the complex representations of the corpus data, standard text analysis tools such as grep or awk or language modeling toolkits are not applicable. This paper explores the challenges of extracting statistical information and of building language models in such a distributed NLP environment, and introduces a corpus analysis system, Oceanography, that simplifies the writing of analysis code and transparently takes advantage of existing distributed processing infrastructure.