Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Fast hashing of variable-length text strings
Communications of the ACM
A vector space model for automatic indexing
Communications of the ACM
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Time and area efficient pattern matching on FPGAs
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Efficient packet classification for network intrusion detection using FPGA
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Introduction to Information Retrieval
Introduction to Information Retrieval
Proceedings of the conference on Design, automation and test in Europe
Heterogeneous Multi-core Design for Information Retrieval Efficiency on the Vector Space Model
FSKD '08 Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05
Advanced Data Mining Techniques
Advanced Data Mining Techniques
Exploring GPU architectures to accelerate semantic comparison for intention-based search
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Detection of HTTP-GET attack with clustering and information theoretic measurements
FPS'12 Proceedings of the 5th international conference on Foundations and Practice of Security
Hi-index | 0.00 |
This paper describes our approach to adapting a text document similarity classifier based on the Term Frequency Inverse Document Frequency (TFIDF) metric to two massively multi-core hardware platforms. The TFIDF classifier is used to detect web attacks in HTTP data. In our parallel hardware approaches, we design streaming, real time classifiers by simplifying the sequential algorithm and manipulating the classifier's model to allow decision information to be represented compactly. Parallel implementations on the Tilera 64-core System on Chip and the Xilinx Virtex 5-LX FPGA are presented. For the Tilera, we employ a reduced state machine to recognize dictionary terms without requiring explicit tokenization, and achieve throughput of 37 MB/s at a slightly reduced accuracy. For the FPGA, we have developed a set of software tools to help automate the process of converting training data to synthesizable hardware and to provide a means of trading off between accuracy and resource utilization. The Xilinx Virtex 5-LX implementation requires 0.2% of the memory used by the original algorithm. At 166 MB/s (80X the software) the hardware implementation is able to achieve Gigabit network throughput at the same accuracy as the original algorithm.