Latent semantic indexing: a probabilistic analysis
Journal of Computer and System Sciences - Special issue on the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems
A vector space model for automatic indexing
Communications of the ACM
Characterizing the behavior of a program using multiple-length N-grams
Proceedings of the 2000 workshop on New security paradigms
Random projection in dimensionality reduction: applications to image and text data
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval
Learning to detect malicious executables in the wild
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
N-Gram-Based Detection of New Malicious Code
COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts - Volume 02
Learning similarity measures in non-orthogonal space
Proceedings of the thirteenth ACM international conference on Information and knowledge management
A Feature Selection and Evaluation Scheme for Computer Virus Detection
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Learning to Detect and Classify Malicious Executables in the Wild
The Journal of Machine Learning Research
Biologically inspired defenses against computer viruses
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Proceedings of the 48th Annual Southeast Regional Conference
Using randomized projection techniques to aid in detecting high-dimensional malicious applications
Proceedings of the 49th Annual Southeast Regional Conference
Proceedings of the 50th Annual Southeast Regional Conference
Hi-index | 0.00 |
This paper describes a research effort to improve the use of the cosine similarity information retrieval technique to detect unknown, known or variances of known rogue software by applying the feature extraction technique of randomized projection. Document similarity techniques, such as cosine similarity, have been used with great success in several document retrieval applications. By following a standard information retrieval methodology, software, in machine readable format, can be regarded as documents in the corpus. These "documents" may or may not have a known rogue functionality. The query is software, again in machine readable format, which contains a certain type of rogue software. This methodology provides an ability to search the corpus with a query and retrieve/identify potentially rogue software as well as other instances of the same type of vulnerability. This retrieval is based on the similarity of the query to a given document in the corpus. To overcome what is known as the 'the curse of dimensionality' that can occur with the use of this type of information retrieval technique, randomized projections are used to create a low-order embedding of the high-dimensional data. For our experiment, we obtain Microsoft Windows applications, infect a subset of them with several common Trojans and apply our dimensionality and prediction methodology. Preliminary results show promise when applying randomized projections to cosine similarity in both speed of prediction and efficiency of required space when compared with using only cosine similarity.