Distributed parse mining

Authors:
Scott A. Waterman
Affiliations:
Microsoft Live Search/Powerset, San Francisco
Venue:
SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
Year:
2009

Citing 4
Cited 2

UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Google's MapReduce programming model – Revisited

Science of Computer Programming

Mining of parsed data to derive deverbal argument structure

GEAF '09 Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks
Using large-scale parser output to guide grammar development

GEAF '09 Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design and implementation of a system for data exploration over dependency parses and derived semantic representations in a large-scale NLP-based search system at powerset.com. Because of the distributed nature of the document repository and the processing infrastructure, and also the complex representations of the corpus data, standard text analysis tools such as grep or awk or language modeling toolkits are not applicable. This paper explores the challenges of extracting statistical information and of building language models in such a distributed NLP environment, and introduces a corpus analysis system, Oceanography, that simplifies the writing of analysis code and transparently takes advantage of existing distributed processing infrastructure.