What ARM Floating Point Performance Comparison across the product line A9, A12, A15, A17, A53 etc
I've been research the relative performance of ARM parts, and have found it difficult to come up with comparisons... especially for the new A12, A17 and the 64 bit A53 an A57s;
What I've come up with so far, based on Linpack scores of different phone/tablets is this:
- A9 = 32 MFLOPS/Core/GHz
- A15 = 100 MFLOPS/Core/GHz
- Apple A7 = 350 MFLOPS /Core/GHz (!)
So a couple questions arise:
- Where does the A12, A17, A53, A57 fit into this spectrum
- Why is Apple's A7 floating point so good? It is purely due to the 64bit internals, or something else?
Those Linpack results are via Java. Below are my Linpack results via C and Java. Also see the following for lots more MFLOPS speeds.
System ARM MHz Android Linpackv5 Linpackv7 LinpackSP NEONLinpack LinpackJava MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS T1 926EJ 800 2.2 5.63 5.67 9.61 N/A 2.33 P4 v7-A8 800 2.3.5 80.18 28.34 @G T2 v7-A9 800 2.3.4 10.56 101.39 129.05 255.77 33.36 P5 v7-A9 1500 4.0.3 171.39 50.87 @G T4 v7-A9 1500a 4.0.3 16.86 155.52 204.61 382.46 56.89 T6 v7-A9 1600 4.0.3 196.47 T7 v7-A9 1300a 4.1.2 17.08 151.05 201.30 376.00 56.44 T9 926EJ 800 2.2 5.66 T11 v7-A15 2000b 4.2.2 28.82 459.17 803.04 1334.90 143.06 T12 v7-A9 1600 4.1.2 147.07 T14 v7-A9 1500 4.0.4 180.95 P11 v7-A9 1400 4.0.4 19.89 184.44 235.54 454.21 56.99 P10 QU-S4 1500 4.0.3 254.90 Measured MHz a=1200, b=1700 System - T = Tablet, P = Phone, E = Emulator, @G = GreenComputing, QU = Qualcomm CPU
I'm not sure where you got those numbers, but they're not at all reflective of what the hardware is capable of. Cortex-A9, for example, can execute a double-precision flop every cycle; that's 1GFLOP/Core/GHz. With a good implementation of LAPACK and BLAS, you can realize 700+MFLOP/Core/GHz on linpack. That's 20x faster than the number you're reporting. Cortex-A15 and Apple A7 are much more powerful still.
I expect that you're running some random "linpack benchmark" rather than an architecturally-tuned implementation. You're not measuring the FP capabilities of the hardware; you're measuring a pale shadow of that filtered through the quality of some random developer's code and the quality of the JIT and runtime or compiler that is used to compile/execute the program.
It is impossible to understand those comparative benchmark charts unless the CPU model and actual working GHz is identified. There are also unidentified variable such as cache sizes and memory speed. For example, the Galaxy S4 can have ARM or Snapdragon CPUs, different numbers of cores and different GHz specs (that they might not run at). Then the source of the benchmarks should be identified. Even if they were compiled from the same C code, you cannot always trust the compilation provided or tuned by a particular supplier.
A case in point is the Linpack benchmark which I know very well. My original C version for PCs was accepted by Jack Dongarra, author of the original benchmark, and is available at Netlib (and from my site). There two basic versions (plus others for parallel processors). Except for an MP version, all my initial Linpack benchmarks used the fixed method of Linpack 1, with a matrix of order 100. The second version is for solving a system of equations of order 1000, with no restriction on the method or its implementation. This can demonstrate the advantage of particular architectures or clever mathematicians and can normally reflect MP performance.
The favourite version of Linpack for Androids/Java is from GreeneComputing. According to the numeric results this is double precision, probably a N=500. I am suspicious about the MP version. It produces different numeric results (Norm Res) to the single CPU version, both can indicate “Precision inconsistent results” and crashes my tablet with a Cortex A15 CPU.
I have a single CPU and multi-thread Linpack versions, all using Linpack 1 fixed methods and now can run from N=100 upwards. These can be slower as N increases due to using RAM data instead of cache. The MP versions are pathetically slow but at least produce the same Norm Res value.
The Apple version has N as in input variable. One set of results I found indicated N=100 118 MFLOPS, N=500 737 MFLOPS, N=1000 918 MFLOPS. N=500 is probably quoted but is not appropriate for comparing with my results.
One reason why the A7 is faster than earlier Apple cores, is that double precision NEON instructions are provided. My NEON result is for single precision.