Optimizing mathematics on arrays of floats in Ada 95 with GNAT
Consider the bellow code. This code is supposed to be processing data at a fixed rate, in one second batches, It is part of an overal system and can't take up too much time.
When running over 100 lots of 1 seconds worth of data the program takes 35 seconds (or 35%), executing this function in a loop. The test loop is timed specifically with Ada.RealTime. The data is pregenerated so the majority of the execution time is definatetly in this loop.
How do I improce the code to get the processing time down to a minimum?
The code will be running on an Intel Pentium-M which is a P3 with SSE2.
package FF is new Ada.Numerics.Generic_Elementary_Functions(Float); N : constant Integer := 820; type A is array(1 .. N) of Float; type A3 is array(1 .. 3) of A; procedure F(state : in out A3; result : out A3; l : in A; r : in A) is s : Float; t : Float; begin for i in 1 .. N loop t := l(i) + r(i); t := t / 2.0; state(1)(i) := t; state(2)(i) := t * 0.25 + state(2)(i) * 0.75; state(3)(i) := t * 1.0 /64.0 + state(2)(i) * 63.0 /64.0; for r in 1 .. 3 loop s := state(r)(i); t := FF."**"(s, 6.0) + 14.0; if t > MAX then t := MAX; elsif t < MIN then t := MIN; end if; result(r)(i) := FF.Log(t, 2.0); end loop; end loop; end;
psuedocode for testing
create two arrays of 80 random A3 arrays, called ls and rs; init the state and result A3 array record the realtime time now, called last for i in 1 .. 100 loop for j in 1 .. 80 loop F(state, result, ls(j), rs(j)); end loop; end loop; record the realtime time now, called curr output the duration between curr and last
I'll back up Marc C here (he generally knows his stuff). I've used gprof with Gnat before. It can be tough to setup, but it works like a champ. If you like, you can use it to get a % of your runtime used by every line of code above.
I could make some suggestions (like precalculating 63.0/64.0) but a good optimizer should already be doing most of them. You need to figure out what it is not doing in the particularly CPU-consuming lines, and speed up that.
Looking over the code, I'm guessing the profiler will show you that the exponentiation and log operations are chewing up the lion's share of the time. If you could find a way to precalcuate some of that stuff that might help. That's getting ahead of myself though. Profile!
First let me try to correct my answer:
That should be FF."****"(s, 6.0), and s ** 6, (not FF."*"(s, 6.0) and s * 6) in my answer. [That's odd .. the editor is still trying to remove *'s from my text.]
I just checked the source code Marc C pointed to .. by gad, it does do s ** 6!
I'll add only that I expect that some improvement would come from doing s**6 yourself, using s2 := s*s, and s_to_the_6 := s2 * s2 * s2; -- Jonathan
It might be faster to replace the t := FF."**"(s, 6.0) + 14.0; with t := s ** 6 + 14.0; The floating point exponentiation is probably done with log's and exp's. -- Jonathan
The following code improvements got the run-time down to 8 seconds.
Then, adding the command line options -O3 and -mtune=pentium-m and -msse2 got that runtime down to 0.8 seconds.
My suspicions are:
- Log(x, y) is implemented as Log(x) / Log(y) which is expencsive if y is constant. (7sec)
- Power(x, y) is implemented as a loop; and conditionals and jumps are expensive. (20sec)
procedure "**" may be ...
r := x; for i in 2 .. y loop; r := r * x; end loop; return r;
package FF is new Ada.Numerics.Generic_Elementary_Functions(Float); N : constant Integer := 820; type A is array(1 .. N) of Float; type A3 is array(1 .. 3) of A; procedure F(state : in out A3; result : out A3; l : in A; r : in A) is -- Keep the Log of 2 so it is not recalculated ONE_ON_LOG_TWO : constant Float := 1 / FF.Log(2.0); M1 : constant Float := 1.0 / 64.0; M2 : constant Float := 63.0 / 64.0; s : Float; t : Float; begin for i in 1 .. N loop t := l(i) + r(i); -- Multiply Not Divide t := t * 0.5; state(1)(i) := t; state(2)(i) := t * 0.25 + state(2)(i) * 0.75; state(3)(i) := t * M1 + state(2)(i) * M2; for r in 1 .. 3 loop s := state(r)(i); -- Since we know the power hared code the multiply. t := s * s * s * s * s * s + 14.0; if t > MAX then t := MAX; elsif t < MIN then t := MIN; end if; -- Don't use Log(x,y) in a loop when y is constant. ' result(r)(i) := FF.Log(t) * ONE_ON_LOG_TWO; end loop; end loop; end;
You might want to look into making
type A3 is array(1 .. N, 1 .. 3) of Float;
That way, each operation in the outer loop will access neighbouring memory locations and you should get much better support from the cache:
state(i)(1) := t; state(i)(2) := t * 0.25 + state(i)(2) * 0.75; state(i)(3) := t * M1 + state(i)(2) * M2;
Using a rename as
cur_state : array (1..3) of Float renames state(i);
cur_state := (t, t * 0.25 + cur_state(2) * 0.75, t * M1 + cur_state(2) * M2)
might give the compiler some more hints for optimisations.