About Me
Policy Analysis


Serpent S Boxes as Boolean Functions

The fastest implementations of Serpent depend on finding efficient boolean function decompositions for the eight Serpent S boxes and their inverses.  One way of looking for such expressions is to perform a recursive search of all combinations of primitive boolean terms up to some limit.  Here is the program I use to do this - it works simply by looking for the minimum number of terms required to express the S box in question.

Sam Simpson has been running this program as a background task on a cluster of high performance servers.  After a search involving around 1000 machine hours we have found the S box decompositions given here.  

Although these are the S boxes we have found with the minimum number of terms, it is not always true that these expansions give the best results.  On the Pentium Pro and Pentium II  dependency chains, out of order execution and cache line relationships between temporary variables means that the results can vary considerably as small changes are made in the way the S box functions are actually calculated.

The functions given above are those directly output by the analyser.  On any particular machine it will be desirable to experiment with the order of terms (where the is quite a lot of flexibility) and with the reuse of the temporary variables used during function evaluation (the raw analyser output does not reuse temporaries).

Serpent Performance

Pentium II/III without MMX

Dag Arne Osvik has recently used a new search technique to find S box boolean functions tuned for the Pentium architecture.  I have coded Serpent in assembler using these functions and obtained the following performance results (using my AES2 testing environment).   The testing was undertaken on a 600MHz PIII system but the speed figures are given for the 200MHz reference platform.

 Key Length   Operation   Static Key Schedule (C Interface)    Encapsulated Key Schedule (C++ Interface) 
    cycles    speed (mbits/second)    cycles   speed (mbits/second)
128 key set 1292   1361  
encrypt 759 33.7 771 33.2
decrypt 775 33.0 769 33.2
192 key set 1294   1360  
encrypt 761 33.6 768 33.3
decrypt 772 33.1 765 33.4
256 key set 1273   1350  
encrypt 769 33.2 764 33.5
decrypt 767 33.3 770 33.2

Pentium II/III with MMX

Serpent maps well onto the Pentium MMX registers since the bit-slice technique it uses allows two blocks to be processed in parallel by placing the corresponding 32-bit words of each block into the upper and lower 32-bit words of each MMX register respectively. This technique is illustrated by this assembler implementation using the NASM assembler with a Microsoft Visual C/C++ test harness.  This implementation uses a pre-computed key schedule and is tuned for bulk encryption with the following results (cycles for two 128-bit blocks with the resulting speed based on the 200MHz reference platform):

 Key Length   Operation   Assembler with a C Interface 
    cycles    speed (mbits/second)
128 key set 1484  
encrypt 1138 44.9
decrypt 1112 46.0
192 key set 1492  
encrypt 1134 45.1
decrypt 1111 46.0
256 key set 1468  
encrypt 1134 45.1
decrypt 1122 45.6

There is a penalty compared with non-MMX code because the MMX instruction set lacks a rotate instruction but this is more than compensated by the ability to process two blocks at a time.  The use of S boxes that do not require table lookups is also a considerable benefit since the whole algorithm can be run in the MMX registers with only the key schedule requiring memory accesses.

Back to Brian Gladman's Home Page