AES Algorithm Efficiency
This page gives the results I have obtained in implementing AES candidate algorithms from an efficiency perspective.
The figures given in the table below are in clock cycles for key set-up and clock cycles per block for encryption and decryption. The rows giving speed figures provide encryption and decryption speeds (and their average) for 128 bit keys on the 200 MHz Pentium Pro reference platform without any I/O byte/word/block inversion costs. The key set-up timings are for computing both the encryption and decryption key schedules. For Rijndael and CRYPTON the cost of the key schedule for decryption is much higher than that for encryption so two figures are given for these algorithms (the encryption and decryption key schedule cycle counts respectively). The encryption and decryption speeds for 128 bit keys and full input/output byte order conversion are provided below the main figures where these are different (note, however, that these are not fair for comparison purposes since many of the AES algorithms chose arbitrary byte order conventions).
Here (as a result of many requests) are the source code files used to obtain these results:
For comparison DES in C (des.h, des.c and key.c), optimised for the Pentium Pro, requires 456 cycles for each 64 bit data block, corresponding to a speed of 28 Mbits/second on this platform running at 200 Mhz (faster than many AES design teams appear to believe).
1. The order of the algorithms in the table is based on mean encryption/decryption speed for 128 bit keys.
2. Different designers have chosen a different number of rounds for their algorithms and this clearly impacts in a major way on performance. This is illustrated by Serpent in particular, which has a much higher diffusion count than other algorithms.
3. Both DFC and HPC make use of 64 bit arithmetic. Since this will be the norm by the time the AES algorithm is selected it makes little sense to judge the speed of these algorithms on 32 bit architectures.
4. I suspect that there is quite a bit that can still be done for E2 and SAFER+ since I have not spent a great deal of time on these. I am grateful for suggestions from Kazumaro Aoki of NTT Japan for speeding up my E2 implementation.
5. For TWOFISH I have pushed encryption/decryption speed without concern for the impact on the cost of the key schedule or memory use (I employ 8K+ bytes of table space). There are many optimisation alternatives for this algorithm and major reductions in key schedule and memory cost can be achieved if high volume encryption/decryption speed is not the primary concern. My thanks to Niels Ferguson for suggesting some useful speed ups for my original implementation.
6. Where an algorithm meets a superset of the AES requirements I have only implemented the AES compliant elements. Hence Rijndael is only implemented for a block size of 128 bits and for HPC only HPC(medium) has been coded.
7. For Crypton I have implemented both the version submitted for AES and the revised version 1 (issued on December 21st 1998).
8. The DEAL timings are based on the DES source code in which the IP/FP permutations have been implemented in DEAL and not in DES (as described in the DEAL specification).
9. I have implemented Loki97, Frog and MAGENTA for completeness although it appears that they have all been broken.
Security versus Performance Issues (HEALTH WARNINGS)
The work presented here is primarily about performance but it is not sensible to consider this in complete isolation from security since there are inevitably trade-offs between these attributes in the real world. One obvious example is the number of rounds employed by each algorithm and another is the quality of each round in achieving the 'mixing' that the cipher is intended to provide. The lower rows in the table are a first attempt to consider the ways in which security and performance are being traded off by each of the AES algorithms.
Diffusion. The lower rows dealing with diffusion should not be seen as a direct measure of algorithm security. They simply give an indication of how good the algorithms are at spreading out the influence of input bits on output bits. Obviously, diffusion counts say nothing about the 'quality' of the diffusion achieved and this alone means that this sort of analysis needs to be treated with extreme caution.
Minimum Number of Secure Rounds. Eli Biham has suggested that AES algorithm performance should be measured by timing the 'minimum number of secure rounds' for each candidate - that is the estimated number of rounds needed to make a brute force key search the most efficient form of attack. This is a controversial suggestion that some AES participants do not accept even in principle (for example, see the paper by Bruce Schneier and the Twofish team). In my view, however, we have to look at this in two steps - it seems to me that the principle has merit but that there is (and may continue to be) a problem in obtaining impartial and widely accepted values for the minimum number of secure rounds for each of the AES algorithms.
At the moment the table uses figures equivalent to those suggested by Eli Biham. However as an AES algorithm contributor, Eli will not been seen by all as an impartial observer and it would hence be preferable to have a wider community basis for setting the values to be used in such a process. For the present therefore it is important to treat these figures (and any conclusions derived from them) with caution.
Implementation Assurance. It seems certain that there will be a significant number of the AES algorithms that will be judged to be secure at an algorithm level. In practice, however, most failures in cryptographic systems derive not from weaknesses in the algorithms used but rather from the exploitation of subtle flaws in the way the algorithms are implemented or through the exploitation of interactions between algorithm implementations and the environments in which they operate. This means that it is vital to judge implementations not simply from the perspective of the performance that they provide but also from the perspective of 'implementation assurance' - can we be confident that the implementation does what the algorithm requires, no more and no less, and can we also be confident that it does not interact with the wider environment in ways that can be exploited to undermine the security that the algorithm itself is otherwise capable of providing.
To answer these questions we need much more hard evidence derived from the implementation of the AES candidates in environments that are representative of practical use. In particular we have very little evidence at the moment of how the AES candidates behave in low end environments of the kind that may apply in much of electronic commerce and consumer applications.
We also need to understand whether the different AES algorithms are architecturally different in their ability to support implementation assurance techniques. None of the existing specifications are precise enough to support formal design and analysis methods and it even seems possible that at least some of the algorithms on offer are sufficiently complex to make these techniques difficult or even impossible to apply. This may also be true of other approaches for implementation assurance. This is an area where much more work is needed before any final algorithm choices are made.
Performance Measurement Approach
All the implementations provided and discussed here have been coded from scratch using the AES specification documents without prior reference to the source code provided by the design teams. However, comparisons with the reference code has sometimes been necessary because the specification and reference code for some algorithms are not consistent.
I have tried to keep a fairly consistent style across the source code to aid fair comparison. All the routines have been biased towards encryption and decryption speed without regard to key set-up or to memory use in tables. I have also used VTune, the Intel performance tuning software, to look for optimised code sequences. This was kindly supplied by Intel and I much appreciate their generous support. The code has been optimised for the Pentium Pro/II and it is hence likely to be relatively poor on the Pentium. Although I have pushed for speed, I have not used some techniques where the resulting source code has become especially obscure.
All the algorithms and results reported are based on implementation using Microsoft Visual C++ version 6. I have generally coded in standard C but I have used non-standard features where this is reasonable and would be done in practice. For example I have used the intrinsic circular rotate operations that Microsoft C provides. Some of the algorithms are endian neutral, others are definitively either big-endian or little-endian. Most of the difficulties I have had with implementation relate to confusion about endianness. I am running on a little endian machine (a Pentium II) and I have checked out the code in this context only. My inputs and outputs to algorithms are in 32 bit words, not bytes, and this impacts on the byte numbering conventions that are used. The algorithm source code provided below contains code to invert byte order on input and output to match the standard test values but timings do not include these sections of code.
Please bear in mind that there may be mistakes in the source code - please let me know if you find any. If you use any of it, be especially careful about endianess since I have not designed to eliminate such problems.
I am happy for the source code given here to be used provided that any conditions set by the algorithm designers are complied with. My only additional requirement if you do use it is that you acknowledge its origin. At the request of NTT, E2 source code is available for academic study purposes only.
Obtaining accurate and repeatable timings has proved more difficult than I expected. After many experiments I have settled on the use of the Pentium Pro/II cycle counter as the basis for timing. For each of the three routines in question (key set-up, encrypt and decrypt) I time two code sequences containing respectively single and double calls to the routines in question (with random key and data inputs). After running the routines to eliminate initialisation effects I then time these two sequences 100 times each and take the difference between their minimum timings as the cycle count for the routine in question. My thanks to Doug Whiting for a useful exchange of views on timing issues.
Fair Algorithm Comparison - Timing and Byte Order
There is scope for confusion in the definition of the AES interface on whether any required changes to byte order lie are internal or external to this interface. The programming interface is byte oriented and suggests that byte order changes are not needed. But the test vectors are 'numeric' and this suggests that such changes are needed as a part of the AES algorithms. I cover this in more detail here.
A consequence of this uncertainty is that some design teams have implemented byte order inversion for input and output blocks whereas others have not. Since byte order change involves processing costs this can impact on performance comparisons.
I am running on a little-endian machine (Pentium II) and this means that algorithms which involve byte, word or block order inversion ('big-endian' algorithms) are at a disadvantage if their speed is compared with 'little-endian' algorithms in my environment. Equally if I were using a big-endian machine it would be 'little-endian' AES algorithms would be put at a disadvantage. In order to obtain fair comparisons I have therefore chosen to base my timings on those that can be achieved when an algorithm is run on a machine that matches its own endianness. In practice this means that I do not include the processing cost of byte, word or block order inversions on input and output in my timing measurements. For some algorithms there are more deeply buried endian issues but, fortunately, these turn out to be small in comparison with the costs involved in input/output byte order changes. I hence believe that the timings given here are a fair assessment of what the AES algorithms can achieve when run in environments that match their own design endianess.
Since other comparisons, for example, that compiled by Helger Lipmaa, include results for both big and little endian processors it is important that they use full timing data including endian conversion costs (however care has to be taken to ensure that results on big and little endian machines are given equal weight). The figures I have provided for Helger are hence full timings that include I/O conversion costs.
I have attempted to measure how effective the algorithms are at mixing up bits by measuring their 'diffusion'. I do this as follows. After setting up random key and data input blocks I run the encryption round function for the algorithm in question once or twice. I then change just one bit in the data input block and count how many bits in the output block change. I do this for each bit in the input block and I repeat this for a large number of random inputs to obtain average values. I then calculate for each input bit how many output bits change on average; I also calculate for each output bit how many input bits change its value (these two outputs are very similar on average but they are sometimes very different for particular bits). Since some algorithms operate using 'half-rounds' I use the best of the 1 and 2 round diffusion values to calculate an overall diffusion count for each algorithm.
A number of people have contributed ideas for improvements in the source code and the analysis presented here. I gratefully acknowledge the contributions made by the following people: