November 11, 2019, 11:12:09 pm


Have you visited the Allwinner Chipset wiki? -

Compiler Tweaks

Started by lawrence, February 02, 2013, 04:01:57 am

Previous topic - Next topic


February 02, 2013, 04:01:57 am Last Edit: February 02, 2013, 04:06:44 am by lawrence
I'm a believer in getting the basics going first, *then* move onto tweaking for chip specifics.
I'm at that point now though with my code, so I thought I'd give a few notes on some A10 specific optimizations that can be made while compiling.

The A10 includes support for the following:

  • ARM Cortex A8 architecture

  • Thumbv2 (smaller compile size instructions)

  • Hard floating point - NEON, and VFPv3

A more detailed explanation of each of these features is here on the TI site  (as the AllWinner site is rather empty when it comes to info related to this) -

Hard (float) choices
So, how do we tell our compiler to optimize for some of these things?

Well, the first option's we have are choosing between Hard Floating point, or Soft Floating point.
Our CPU has hardware floating point instructions, which are faster than software ones, and can be used.
Seems like a simple choice - hardware is faster than software implementations, so lets use that.

Bzzzzt.  Wrong.

The caveat is that these choices are not compatible due to the ABI interfaces used in Linux, so, you use one, or the other, but not both at the same time. 

So, you either choose hard float, or soft float, and stick to that throughout your Kernel / User Space apps.

Luckily we have hard float compilers readily available, and our kernel is hard float compatible, and most of the kernel produced by others all seem to be using Hard Float ABI's so its an easy choice.

Or next choice is choosing *which* hard float set to use!  We have 2 sets of optimizations to choose from.
NEON and VFPv3.   Lucky or what!  I remember when you had to buy a math co-processor, and it cost crazy money.  Ahh, progress :)

So, how do we tell our compiler to use hard float for floating point stuff -

For VFP -


For NEON -


Which do I choose?
Well, probably NEON, as its a superset of VFP, but it depends on the math (you do)  ::)

More info on that here -

So, hard float using NEON looks like its a no brainer, but what about the other stuff?

Our next choice is quite easy - our CPU is a Cortex-A8, so we choose that as our compiler optimization for cpu.

If we include our previous optimizations


But wait, theres more!

We can tweak more.   "Safe(ish)" things to choose are ones like

So, lets do that, and see what our final evil call looks like

-mfloat-abi=hard  -mfpu=neon  -mcpu=cortex-a8 -mtune=cortex-a8 -O3 -funroll-loops

Great, now how do we integrate that with our compiles?

export CFLAGS='-mfloat-abi=hard  -mfpu=neon  -mcpu=cortex-a8 -mtune=cortex-a8 -O3 -funroll-loops'

Then ./configure / make / make install as usual.

If you want to be even scarier, there is... more.

export CFLAGS='-mfloat-abi=hard  -mfpu=neon  -mcpu=cortex-a8 -mtune=cortex-a8 -O3 -funroll-loops -ftree-vectorize -fassociative-math -funsafe-math-optimizations -Os'

We're getting into Gentoo linux territory here though (mild joke).

What do those new bits do?

Needed to enable auto-vectorization on arm. Part of -funsafe-math-optimizations -ffast-math -Ofast

Needed to enable auto-vectorization for NEON (because its not fully IEEE754 compatible). Part of -ffast-math -Ofast

NAND/SD/Caches should be the bottleneck.

Activates auto-vectorization but should be kicked out. Gives between zero and negligible performance gains with NEON (or overall...broken part of gcc or other compilers). Part of -O3 -Ofast

Do note that they are also known as the  good old Segfaultflags, as turning on compiler optimization leads to strange things (bugs...).  As I have troubleshooted my way from issues all the way back to compiler bugs and gone grr, I often don't go that far unless I really need to.  YMMV though...

So, to recap:

(Below is a sliding scale of safeness vs Speed, in order)

Safe +-
-mfloat-abi=hard  -mfpu=neon  -mcpu=cortex-a8 -mtune=cortex-a8  -O3

Less Safe
-mfloat-abi=hard  -mfpu=neon  -mcpu=cortex-a8 -mtune=cortex-a8  -O3  -funroll-loops

May even work, but I wouldn't build a kernel with it  8)
-mfloat-abi=hard  -mfpu=neon  -mcpu=cortex-a8 -mtune=cortex-a8 -O3 -funroll-loops -ftree-vectorize -fassociative-math -funsafe-math-optimizations -Os

To use -
export CFLAGS=' <your choice from above> '


More references -

Speed runs for different options -

Testing bits you can use for .. testing


found more:

export CFLAGS="-mthumb -march=armv7-a"

& kernel config option
Kernel Features --->
  (*) Compilethe kernel in Thumb-2 mode [CONFIG_THUMB2_KERNEL=y]


The linux-sunxi sources doesn't support thumb2. Kernel build with this doessn't boot.


Using neon as fp core without unsafe-math operations enabled is pritty pointless as the gcc core will discount most operations due to the lack of ieee compatibility.

NEVER compile your kernel with all the speedy options enabled. Programs however will do fine with them an might even get a decent speedup.

-fassociative-math with -funsafe-math-optimizations is pointless as the second enables the first by default.

-Os in combination with any other optimizations (Os implies O2 and some more) is not recommended as it negates the things you want to do with O3 and more. My recommendation is to avoid any size optimizations.
Unroll loops with size optimizations is like using water to get more fire.

see for more fun.