Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

Authors:
Daisuke Ikeda;Yasuhiro Yamada;Sachio Hirokawa
Affiliations:
-;-;-
Venue:
DS '01 Proceedings of the 4th International Conference on Discovery Science
Year:
2001

Citing 12
Cited 3

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Text algorithms

Text algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Bounds on the Complexity of the Longest Common Subsequence Problem

Journal of the ACM (JACM)
The Complexity of Some Problems on Subsequences and Supersequences

Journal of the ACM (JACM)
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
SCOOP: A Record Extractor without Knowledge on Input

DS '01 Proceedings of the 4th International Conference on Discovery Science
Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words

ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation

Automatic Wrapper Generation for Multilingual Web Resources

DS '02 Proceedings of the 5th International Conference on Discovery Science
SCOOP: A Record Extractor without Knowledge on Input

DS '01 Proceedings of the 4th International Conference on Discovery Science
The q-gram distance for ordered unlabeled trees

DS'05 Proceedings of the 8th international conference on Discovery Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.