[ARM] Improve csum_fold, cleanup csum_tcpudp_magic()
csum_fold doesn't need two assembly instructions to perform its task,
it can simply add the high and low parts together by rotating by 16
bits, and the carry into the upper-16 bits will automatically happen.
Also, since csum_tcpudp_magic() is just csum_tcpudp_nofold + csum_fold,
use those two functions to achieve this. Also note that there is a
csum_fold() at the end of ip_fast_csum() as well, so use the real
csum_fold() there as well.
Boot tested on Versatile.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>