Serpent S Boxes

The fastest implementations of Serpent depend on finding efficient boolean function decompositions for the eight Serpent S boxes and their inverses. One way of looking for such expressions is to perform a recursive search of all combinations of primitive boolean terms up to some limit. Here is the program I use to do this - it works simply by looking for the minimum number of terms required to express the S box in question.

Sam Simpson has been running this program as a background task on a cluster of high performance servers. After a search involving around 1000 machine hours we have found the S box decompositions given here.

Although these are the S boxes we have found with the minimum number of terms, it is not always true that these expansions give the best results. On the Pentium Pro and Pentium II dependency chains, out of order execution and cache line relationships between temporary variables means that the results can vary considerably as small changes are made in the way the S box functions are actually calculated.

The functions given above are those directly output by the analyser. On any particular machine it will be desirable to experiment with the order of terms (where the is quite a lot of flexibility) and with the reuse of the temporary variables used during function evaluation (the raw analyser output does not reuse temporaries).

Serpent Performance

Pentium II/III without MMX

Dag Arne Osvik has recently used a new search technique to find S box boolean functions tuned for the Pentium architecture. I have coded Serpent in assembler using these functions and obtained the following performance results (using my AES2 testing environment). The testing was undertaken on a 600MHz PIII system but the speed figures are given for the 200MHz reference platform.

Key Length	Operation	Static Key Schedule (C Interface)		Encapsulated Key Schedule (C++ Interface)
Key Length	Operation	cycles	speed (mbits/second)	cycles	speed (mbits/second)
128	key set	1292		1361
	encrypt	759	33.7	771	33.2
	decrypt	775	33.0	769	33.2
192	key set	1294		1360
	encrypt	761	33.6	768	33.3
	decrypt	772	33.1	765	33.4
256	key set	1273		1350
	encrypt	769	33.2	764	33.5
	decrypt	767	33.3	770	33.2

Pentium II/III with MMX

Serpent maps well onto the Pentium MMX registers since the bit-slice technique it uses allows two blocks to be processed in parallel by placing the corresponding 32-bit words of each block into the upper and lower 32-bit words of each MMX register respectively. This technique is illustrated by this assembler implementation using the NASM assembler with a Microsoft Visual C/C++ test harness. This implementation uses a pre-computed key schedule and is tuned for bulk encryption with the following results (cycles for two 128-bit blocks with the resulting speed based on the 200MHz reference platform):

Key Length	Operation	Assembler with a C Interface
Key Length	Operation	cycles	speed (mbits/second)
128	key set	1484
	encrypt	1138	44.9
	decrypt	1112	46.0
192	key set	1492
	encrypt	1134	45.1
	decrypt	1111	46.0
256	key set	1468
	encrypt	1134	45.1
	decrypt	1122	45.6

There is a penalty compared with non-MMX code because the MMX instruction set lacks a rotate instruction but this is more than compensated by the ability to process two blocks at a time. The use of S boxes that do not require table lookups is also a considerable benefit since the whole algorithm can be run in the MMX registers with only the key schedule requiring memory accesses.

Serpent

Serpent S Boxes as Boolean Functions

Serpent Performance

Pentium II/III without MMX

Pentium II/III with MMX