Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

Authors:
Owen Kaser;Daniel Lemire
Affiliations:
University of New Brunswick;Université du Québec à Montréal
Venue:
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Year:
2007

Citing 9
Cited 0

Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
What's hot and what's not: tracking most frequent items dynamically

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Effective pattern matching of source code using abstract syntax patterns

Software—Practice & Experience
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Plagiarism Detection in arXiv

ICDM '06 Proceedings of the Sixth International Conference on Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg™ corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.