Analysis of source identified text corpora: exploring the statistics of the reused text and authorship

Authors:
Akiko Aizawa
Affiliations:
National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
Venue:
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Year:
2003

Citing 4
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
dSCAM: finding document copies across multiple databases

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper aims at providing a view of text recycled, within a short time, by the authors themselves. We first present a simple and general method for extracting reused term sequences, and then analyze several author-identified text collections to compare the statistical quantities. The ratio of recycling is also measured for each collection. Finally, related research topics are introduced together with some discussion of future research directions.