ARM: Optimize multi-CPU tlb flushing a little more
The compiler does not conditionalize the assembly instructions for
the tlb operations, which leads to sub-optimal code being generated
when building a kernel for multiple CPUs.
We can tweak things fairly simply as the code fragment below shows:
17f8:
e3120001 tst r2, #1 ; 0x1
...
1800:
0a000000 beq 1808 <handle_pte_fault+0x194>
1804:
ee061f10 mcr 15, 0, r1, cr6, cr0, {0}
1808:
e3120004 tst r2, #4 ; 0x4
180c:
0a000000 beq 1814 <handle_pte_fault+0x1a0>
1810:
ee081f36 mcr 15, 0, r1, cr8, cr6, {1}
becomes:
17f0:
e3120001 tst r2, #1 ; 0x1
17f4:
1e063f10 mcrne 15, 0, r3, cr6, cr0, {0}
17f8:
e3120004 tst r2, #4 ; 0x4
17fc:
1e083f36 mcrne 15, 0, r3, cr8, cr6, {1}
Overall, for Realview with V6 and V7 CPUs configured:
text data bss dec hex filename
4153998 207340
5371036 9732374 948116 ../build/realview/vmlinux.before
4153366 207332
5371036 9731734 947e96 ../build/realview/vmlinux.after
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>