UPSTREAM: crypto: arm/chacha20 - faster 8-bit rotations and other optimizations
authorEric Biggers <ebiggers@google.com>
Sat, 1 Sep 2018 07:17:07 +0000 (00:17 -0700)
committerEric Biggers <ebiggers@google.com>
Wed, 5 Dec 2018 20:30:44 +0000 (12:30 -0800)
commitcead285da698a148cb4995c193d06ffdfc09e8cc
treee23f9bd0c28a6b524f08cb9abcd23b8bc581560c
parente5577b7a50f1c7c002a05082d2a650adf6baa8bf
UPSTREAM: crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

Optimize ChaCha20 NEON performance by:

- Implementing the 8-bit rotations using the 'vtbl.8' instruction.
- Streamlining the part that adds the original state and XORs the data.
- Making some other small tweaks.

On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
about 12.08 cycles per byte to about 11.37 -- a 5.9% improvement.

There is a tradeoff involved with the 'vtbl.8' rotation method since
there is at least one CPU (Cortex-A53) where it's not fastest.  But it
seems to be a better default; see the added comment.  Overall, this
patch reduces Cortex-A53 performance by less than 0.5%.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit a1b22a5f45fe884147a99e7c381bcc48d9b2acef)
Bug: 112008522
Test: As series, see Ic61c13b53facfd2173065be715a7ee5f3af8760b
Change-Id: Id7d26e6079cee50111f6d9616459547c60e6cb3e
Signed-off-by: Eric Biggers <ebiggers@google.com>
arch/arm/crypto/chacha20-neon-core.S