Constructing a large scale text corpus based on the grid and trustworthiness

Authors:
Peifeng Li;Qiaoming Zhu;Peide Qian;Geoffrey C. Fox
Affiliations:
School of Computer Science & Technology, Soochow University, Suzhou, China and Community Grids Lab, Indiana University, Bloomington, IN;School of Computer Science & Technology, Soochow University, Suzhou, China;School of Computer Science & Technology, Soochow University, Suzhou, China;Community Grids Lab, Indiana University, Bloomington, IN
Venue:
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Year:
2007

Citing 8
Cited 0

Grid Services for Distributed System Integration

Computer
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Building Minority Language Corpora by Learning to Generate Web Search Queries

Knowledge and Information Systems
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Annotated web as corpus

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Corporator: a tool for creating RSS-based specialized corpora

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Automatic acquisition of chinese–english parallel corpus from the web

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.