Improved count suffix trees for natural language data

Authors:
Guido Sautter;Cristina Abba;Klemens Böhm
Affiliations:
Universität Karlsruhe (TH), Karlsruhe;Corso Francia, Torino;Universität Karlsruhe (TH), Karlsruhe
Venue:
IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Year:
2008

Citing 11
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Analysis of a very large web search engine query log

ACM SIGIR Forum
One-dimensional and multi-dimensional substring selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Word hy-phen-a-tion by com-put-er (hyphenation, computer)

Word hy-phen-a-tion by com-put-er (hyphenation, computer)
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Morphologic Non-Word Error Detection

DEXA '04 Proceedings of the Database and Expert Systems Applications, 15th International Workshop
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Substring Count Estimation in Extremely Long Strings*This work was supported in part by the Brain Korea 21 Project and in part by the Ministry of Information & Communications, Korea, under the Information Technology Research Center (ITRC) Support Program in 2005.

IEICE - Transactions on Information and Systems
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Quantified Score

Hi-index	0.00

Visualization

Abstract

With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.