Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for Commodity Microprocessors

Authors:
Yedidya Hilewitz;Ruby B. Lee
Affiliations:
Princeton Architecture Laboratory for Multimedia and Security (PALMS), Department of Electrical Engineering, Princeton University, Princeton, USA 08544;Princeton Architecture Laboratory for Multimedia and Security (PALMS), Department of Electrical Engineering, Princeton University, Princeton, USA 08544
Venue:
Journal of Signal Processing Systems
Year:
2008

Citing 18
Cited 2

Precision Architecture

Computer
Pathlength reduction features in the PA-RISC architecture

COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
Architectural support for fast symmetric-key cryptography

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Hacker's Delight

Hacker's Delight
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
MicroUnity's MediaProcessor Architecture

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro
Cryptography Efficient Permutation Instructions for Fast Software

IEEE Micro
Computer Based Steganography: How It Works and Why Therefore Any Restrictions on Cryptography Are Nonsense, at Best

Proceedings of the First International Workshop on Information Hiding
Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures

ASAP '00 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Bit Permutation Instructions for Accelerating Software Cryptography

ASAP '00 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Fast Subword Permutation Instructions Using Omega and Flip Network Stages

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Architectural Enhancements for Fast Subword Permutations with Repetitions in Cryptographic Applications

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Architectural techniques for accelerating subword permutations with repetitions

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2001 international conference on computer design (ICCD)
On Permutation Operations in Cipher Design

ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
Fast Parallel Table Lookups to Accelerate Symmetric-Key Cryptography

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
On-Chip Lookup Tables for Fast Symmetric-Key Encryption

ASAP '05 Proceedings of the 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors
Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions

ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors

Run-time generation of partial FPGA configurations for subword operations

Microprocessors & Microsystems
Synthesis and optimization of reversible circuits—a survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated bit operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit permutation instructions (including group, butterfly and inverse butterfly). We show that all these instructions can be implemented efficiently using both the fast butterfly and inverse butterfly network datapaths. Specifically, we show that parallel deposit can be mapped onto a butterfly circuit and parallel extract can be mapped onto an inverse butterfly circuit. We define static, dynamic and loop invariant versions of the instructions, with static versions utilizing a much simpler functional unit. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. The simplest functional unit we propose is smaller and faster than an ALU. We also show that these instructions yield significant speedups over a basic RISC architecture for a variety of different application kernels taken from applications domains including bioinformatics, steganography, coding, compression and random number generation.