Constructing a large scale text corpus based on the grid and trustworthiness

  • Authors:
  • Peifeng Li;Qiaoming Zhu;Peide Qian;Geoffrey C. Fox

  • Affiliations:
  • School of Computer Science & Technology, Soochow University, Suzhou, China and Community Grids Lab, Indiana University, Bloomington, IN;School of Computer Science & Technology, Soochow University, Suzhou, China;School of Computer Science & Technology, Soochow University, Suzhou, China;Community Grids Lab, Indiana University, Bloomington, IN

  • Venue:
  • TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The construction of a large scale corpus is a hard task. A novel approach is designed to automatically build a large scale text corpus with low cost and short building period based on the trustworthiness. It mainly solves two problems: how to automatically build a large scale text corpus on the Web and how to correct mistakes in the corpus. As Grid provides the infrastructure for processing large scale data, our approach uses Grid to collect and process language materials on the Web in the first stage. Then it picks out untrustworthy language materials in the corpus according to their trustworthiness, and checks them manually by users. After the check finishes, our approach computes the trustworthiness of each checked result and selects those ones with the highest trustworthiness as the correct results.