The possible options for encryption/decryption :
- AES CBC
CBC mode has an implicit in limitation that in the encoding steering ( decoding does n’t have this problem ) the 16 byte blocks need to be processed one after another, limiting parallelism. When you need to process many messages / streams in parallel, you can get enough parallelism to saturate AESNI units, but that depends on traffic model .
- AES CTR
CTR mode and modes based on CTR ( like GCM and CCM ), on the other hand, can process all blocks in analogue. This means that with long adequate messages, a single flow of data can saturate AESNI units. 768 bytes is long adequate.
AES is either slow or insecure without special hardware support. fortunately, since ~all x86 chips since 2010 have AESNI, AES is fast on x86. On the order of a few gigabytes ( not gigabits ) per second gear per congress of racial equality, when AES blocks can be processed in parallel ( indeed CTR and CTR based, not CBC ) .
ChaCha20 exists to be fast on chips that do n’t have hardware AES, like phones and tablets. On x86 ( for exemplar, on the other slope of a connection from a call ) it is fast adequate, possibly 1.6GB/sec. On phones and tablets, something like 100MB/sec, depending on the micro-architecture and the power/cooling budget .
The potential options for MAC :
- HMAC SHA
HMAC is a bit more complicate than the sensitive hash function, but for longer messages it is just a morsel slower than the crude hashish function. SHA-256 is decelerate, on the order of 400MB/sec. With AVX when processing twin streams or with Intel SHA Extensions, it can be o, up to a few gigabytes per second per core ( e.g. see this ). The SHA instructions are new, not park. digest is bad. Intel ‘s IPSec library supports them though .
- CBCMAC (AES-XCBC-MAC-96 or CCM)
CBCMAC uses the lapp AES execution units to process the lapp come of data, so it at best halves your throughput compared to precisely encryption/decryption. It uses CBC internally, so it ‘s slower than CTR. The advantage is that a single accelerator circumference can be used for both encoding and MAC, saving silicon sphere / cost, and besides lower complexity ( prospect of bugs ).
- GMAC (for GCM)
GCM uses GMAC for MAC ( and CTR for encoding ). GMAC is slow without hardware support in the imprint of CLMUL instructions. fortunately ~all x86 chips since 2010 have that .
Poly1305 was invented to be fast on chips that do n’t have special hardware like SHA extensions or CLMUL, like phones and tablets. It ‘s fast enough there ( possibly 100MB/sec ) and on x86 ( possibly 2GB/sec ) .
In theory, the encoding and MAC modes can be mix and match, but in exercise they are used in lone a few combinations .
immediately, how to combine the performance numbers for the encoding part with the performance numbers for the MAC ?
Read more: Ciphertext indistinguishability – Wikipedia
If they can not be overlapped, you need to use
(X*Y)/(X+Y) where ten and Y are the performance of the two person parts in the same units ( e.g. MB/sec ). thus, for exemplar, 520MB/sec encoding and 670MB/sec MAC leads to 293MB/sec combined throughout .
But in some cases they can be overlapped and then you might get the performance of the slowest of the two, with the other being “ rid ”. This is the shell with GCM, which can run at the like rush as CTR, with GMAC being efficaciously for absolve. This is why it is popular. TLS and SSH switched to GCM ( with ChaPoly for mobile ). WireGuard is ChaPoly-only, for simplicity .
The above is generalities. I would suggest you constantly run your own benchmarks of the exact implementation on the exact hardware you intend to use. message sizes play a role, processing multiple message streams in parallel plays a character and ability to overlap encoding and MAC plays a function .