Real-time data pre-processing technique for efficient feature extraction in large scale datasets

Authors:
Ying Liu;Lucian V. Lita;R. Stefan Niculescu;Kun Bai;Prasenjit Mitra;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA, USA;Siemens Medical Solutions, Malven, PA, USA;Siemens Medical Solutions, Malven, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 13
Cited 1

A fast string searching algorithm

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A String Matching Algorithm Fast on the Average

Proceedings of the 6th Colloquium, on Automata, Languages and Programming
Matching web site structure and content

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Snort - Lightweight Intrusion Detection for Networks

LISA '99 Proceedings of the 13th USENIX conference on System administration
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Finding advertising keywords on web pages

Proceedings of the 15th international conference on World Wide Web
SecuBat: a web vulnerability scanner

Proceedings of the 15th international conference on World Wide Web
On-line Approximate String Matching in Natural Language

Fundamenta Informaticae
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Mining contiguous sequential patterns from web logs

Proceedings of the 16th international conference on World Wide Web
Discovering interesting usage patterns in text collections: integrating text mining with visualization

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Orientation distance-based discriminative feature extraction for multi-class classification

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.