# Math optimization in C#

I've been profiling an application all day long and, having optimized a couple bits of code, I'm left with this on my todo list. It's the activation function for a neural network, which gets called over a 100 million times. According to dotTrace, it amounts to about 60% of the overall function time.

How would you optimize this?

public static float Sigmoid(double value) { return (float) (1.0 / (1.0 + Math.Pow(Math.E, -value))); }

## Answers

Try:

public static float Sigmoid(double value) { return 1.0f / (1.0f + (float) Math.Exp(-value)); }

**EDIT:** I did a quick benchmark. On my machine, the above code is about 43% faster than your method, and this mathematically-equivalent code is the teeniest bit faster (46% faster than the original):

public static float Sigmoid(double value) { float k = Math.Exp(value); return k / (1.0f + k); }

**EDIT 2:** I'm not sure how much overhead C# functions have, but if you #include <math.h> in your source code, you should be able to use this, which uses a float-exp function. It might be a little faster.

public static float Sigmoid(double value) { float k = expf((float) value); return k / (1.0f + k); }

Also if you're doing millions of calls, the function-calling overhead might be a problem. Try making an inline function and see if that's any help.

If it's for an activation function, does it matter terribly much if the calculation of e^x is completely accurate?

For example, if you use the approximation (1+x/256)^256, on my Pentium testing in Java (I'm assuming C# essentially compiles to the same processor instructions) this is about 7-8 times faster than e^x (Math.exp()), and is accurate to 2 decimal places up to about x of +/-1.5, and within the correct order of magnitude across the range you stated. (Obviously, to raise to the 256, you actually square the number 8 times -- don't use Math.Pow for this!) In Java:

double eapprox = (1d + x / 256d); eapprox *= eapprox; eapprox *= eapprox; eapprox *= eapprox; eapprox *= eapprox; eapprox *= eapprox; eapprox *= eapprox; eapprox *= eapprox; eapprox *= eapprox;

Keep doubling or halving 256 (and adding/removing a multiplication) depending on how accurate you want the approximation to be. Even with n=4, it still gives about 1.5 decimal places of accuracy for values of x beween -0.5 and 0.5 (and appears a good 15 times faster than Math.exp()).

P.S. I forgot to mention -- you should obviously not *really* divide by 256: multiply by a constant 1/256. Java's JIT compiler makes this optimisation automatically (at least, Hotspot does), and I was assuming that C# must do too.

Have a look at this post. it has an approximation for e^x written in Java, this should be the C# code for it (untested):

public static double Exp(double val) { long tmp = (long) (1512775 * val + 1072632447); return BitConverter.Int64BitsToDouble(tmp << 32); }

In my benchmarks this is more than **5 times faster than Math.exp()** (in Java). The approximation is based on the paper "A Fast, Compact Approximation of the Exponential Function" which was developed exactly to be used in neural nets. It is basically the same as a lookup table of 2048 entries and linear approximation between the entries, but all this with IEEE floating point tricks.

**EDIT:** According to Special Sauce this is ~3.25x faster than the CLR implementation. Thanks!

- Remember, that
**any changes in this activation function come at cost of different behavior**. This even includes switching to float (and thus lowering the precision) or using activation substitutes. Only experimenting with your use case will show the right way. - In addition to the simple code optimizations, I would also recommend to consider
**parallelization of the computations**(i.e.: to leverage multiple cores of your machine or even machines at the Windows Azure Clouds) and improving the training algorithms.

**UPDATE:** Post on lookup tables for ANN activation functions

**UPDATE2:** I removed the point on LUTs since I've confused these with the complete hashing. Thanks go to Henrik Gustafsson for putting me back on the track. So the memory is not an issue, although the search space still gets messed up with local extrema a bit.

At 100 million calls, i'd start to wonder if profiler overhead isn't skewing your results. Replace the calculation with a no-op and see if it is *still* reported to consume 60% of the execution time...

Or better yet, create some test data and use a stopwatch timer to profile a million or so calls.

If you're able to interop with C++, you could consider storing all the values in an array and loop over them using SSE like this:

void sigmoid_sse(float *a_Values, float *a_Output, size_t a_Size){ __m128* l_Output = (__m128*)a_Output; __m128* l_Start = (__m128*)a_Values; __m128* l_End = (__m128*)(a_Values + a_Size); const __m128 l_One = _mm_set_ps1(1.f); const __m128 l_Half = _mm_set_ps1(1.f / 2.f); const __m128 l_OneOver6 = _mm_set_ps1(1.f / 6.f); const __m128 l_OneOver24 = _mm_set_ps1(1.f / 24.f); const __m128 l_OneOver120 = _mm_set_ps1(1.f / 120.f); const __m128 l_OneOver720 = _mm_set_ps1(1.f / 720.f); const __m128 l_MinOne = _mm_set_ps1(-1.f); for(__m128 *i = l_Start; i < l_End; i++){ // 1.0 / (1.0 + Math.Pow(Math.E, -value)) // 1.0 / (1.0 + Math.Exp(-value)) // value = *i so we need -value __m128 value = _mm_mul_ps(l_MinOne, *i); // exp expressed as inifite series 1 + x + (x ^ 2 / 2!) + (x ^ 3 / 3!) ... __m128 x = value; // result in l_Exp __m128 l_Exp = l_One; // = 1 l_Exp = _mm_add_ps(l_Exp, x); // += x x = _mm_mul_ps(x, x); // = x ^ 2 l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_Half, x)); // += (x ^ 2 * (1 / 2)) x = _mm_mul_ps(value, x); // = x ^ 3 l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver6, x)); // += (x ^ 3 * (1 / 6)) x = _mm_mul_ps(value, x); // = x ^ 4 l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver24, x)); // += (x ^ 4 * (1 / 24)) #ifdef MORE_ACCURATE x = _mm_mul_ps(value, x); // = x ^ 5 l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver120, x)); // += (x ^ 5 * (1 / 120)) x = _mm_mul_ps(value, x); // = x ^ 6 l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver720, x)); // += (x ^ 6 * (1 / 720)) #endif // we've calculated exp of -i // now we only need to do the '1.0 / (1.0 + ...' part *l_Output++ = _mm_rcp_ps(_mm_add_ps(l_One, l_Exp)); } }

However, remember that the arrays you'll be using should be allocated using _aligned_malloc(some_size * sizeof(float), 16) because SSE requires memory aligned to a boundary.

Using SSE, I can calculate the result for all 100 million elements in around half a second. However, allocating that much memory at a time will cost you nearly two-third of a gigabyte so I'd suggest processing more but smaller arrays at a time. You might even want to consider using a double buffering approach with 100K elements or more.

Also, if the number of elements starts to grow considerably you might want to choose to process these things on the GPU (just create a 1D float4 texture and run a very trivial fragment shader).

FWIW, here's my C# benchmarks for the answers already posted. (Empty is a function that just returns 0, to measure the function call overhead)

Empty Function: 79ms 0 Original: 1576ms 0.7202294 Simplified: (soprano) 681ms 0.7202294 Approximate: (Neil) 441ms 0.7198783 Bit Manip: (martinus) 836ms 0.72318 Taylor: (Rex Logan) 261ms 0.7202305 Lookup: (Henrik) 182ms 0.7204863

public static object[] Time(Func<double, float> f) { var testvalue = 0.9456; var sw = new Stopwatch(); sw.Start(); for (int i = 0; i < 1e7; i++) f(testvalue); return new object[] { sw.ElapsedMilliseconds, f(testvalue) }; } public static void Main(string[] args) { Console.WriteLine("Empty: {0,10}ms {1}", Time(Empty)); Console.WriteLine("Original: {0,10}ms {1}", Time(Original)); Console.WriteLine("Simplified: {0,10}ms {1}", Time(Simplified)); Console.WriteLine("Approximate: {0,10}ms {1}", Time(ExpApproximation)); Console.WriteLine("Bit Manip: {0,10}ms {1}", Time(BitBashing)); Console.WriteLine("Taylor: {0,10}ms {1}", Time(TaylorExpansion)); Console.WriteLine("Lookup: {0,10}ms {1}", Time(LUT)); }

Off the top of my head, this paper explains a way of approximating the exponential by abusing floating point, (click the link in the top right for PDF) but I don't know if it'll be of much use to you in .NET.

Also, another point: for the purpose of training large networks quickly, the logistic sigmoid you're using is pretty terrible. See section 4.4 of Efficient Backprop by LeCun et al and use something zero-centered (actually, read that whole paper, it's immensely useful).

**F# Has Better Performance than C# in .NET math algorithms.** So rewriting neural network in F# might improve the overall performance.

If we re-implement LUT benchmarking snippet (I've been using slightly tweaked version) in F#, then the resulting code:

- executes sigmoid1 benchmark in
**588.8ms instead of 3899,2ms** - executes sigmoid2 (LUT) benchmark in
**156.6ms instead of 411.4 ms**

More details could be found in the blog post. Here's the F# snippet JIC:

#light let Scale = 320.0f; let Resolution = 2047; let Min = -single(Resolution)/Scale; let Max = single(Resolution)/Scale; let range step a b = let count = int((b-a)/step); seq { for i in 0 .. count -> single(i)*step + a }; let lut = [| for x in 0 .. Resolution -> single(1.0/(1.0 + exp(-double(x)/double(Scale)))) |] let sigmoid1 value = 1.0f/(1.0f + exp(-value)); let sigmoid2 v = if (v <= Min) then 0.0f; elif (v>= Max) then 1.0f; else let f = v * Scale; if (v>0.0f) then lut.[int (f + 0.5f)] else 1.0f - lut.[int(0.5f - f)]; let getError f = let test = range 0.00001f -10.0f 10.0f; let errors = seq { for v in test -> abs(sigmoid1(single(v)) - f(single(v))) } Seq.max errors; open System.Diagnostics; let test f = let sw = Stopwatch.StartNew(); let mutable m = 0.0f; let result = for t in 1 .. 10 do for x in 1 .. 1000000 do m <- f(single(x)/100000.0f-5.0f); sw.Elapsed.TotalMilliseconds; printf "Max deviation is %f\n" (getError sigmoid2) printf "10^7 iterations using sigmoid1: %f ms\n" (test sigmoid1) printf "10^7 iterations using sigmoid2: %f ms\n" (test sigmoid2) let c = System.Console.ReadKey(true);

And the output (Release compilation against F# 1.9.6.2 CTP with no debugger):

Max deviation is 0.001664 10^7 iterations using sigmoid1: 588.843700 ms 10^7 iterations using sigmoid2: 156.626700 ms

**UPDATE:** updated benchmarking to use 10^7 iterations to make results comparable with C

**UPDATE2:** here are the performance results of the C implementation from the same machine to compare with:

Max deviation is 0.001664 10^7 iterations using sigmoid1: 628 ms 10^7 iterations using sigmoid2: 157 ms

**Note:** This is a follow-up to this post.

**Edit:** Update to calculate the same thing as this and this, taking some inspiration from this.

Now look what you made me do! You made me install Mono!

$ gmcs -optimize test.cs && mono test.exe Max deviation is 0.001663983 10^7 iterations using Sigmoid1() took 1646.613 ms 10^7 iterations using Sigmoid2() took 237.352 ms

C is hardly worth the effort anymore, the world is moving forward :)

So, just over a factor 10 6 faster. Someone with a windows box gets to investigate the memory usage and performance using MS-stuff :)

Using LUTs for activation functions is not so uncommon, especielly when implemented in hardware. There are many well proven variants of the concept out there if you are willing to include those types of tables. However, as have already been pointed out, aliasing might turn out to be a problem, but there are ways around that too. Some further reading:

- NEURObjects by Giorgio Valentini (there's also a paper on this)
- Neural networks with digital LUT activation functions
- Boosting neural network feature extraction by reduced accuracy activation functions
- A New Learning Algorithm for Neural Networks with Integer Weights and Quantized Non-linear Activation Functions
- The effects of quantization on high order function neural networks

Some gotchas with this:

- The error goes up when you reach outside the table (but converges to 0 at the extremes); for x approx +-7.0. This is due to the chosen scaling factor. Larger values of SCALE give higher errors in the middle range, but smaller at the edges.
- This is generally a very stupid test, and I don't know C#, It's just a plain conversion of my C-code :)
- Rinat Abdullin is very much correct that aliasing and precision loss might cause problems, but since I have not seen the variables for that I can only advice you to try this. In fact, I agree with everything he says except for the issue of lookup tables.

Pardon the copy-paste coding...

using System; using System.Diagnostics; class LUTTest { private const float SCALE = 320.0f; private const int RESOLUTION = 2047; private const float MIN = -RESOLUTION / SCALE; private const float MAX = RESOLUTION / SCALE; private static readonly float[] lut = InitLUT(); private static float[] InitLUT() { var lut = new float[RESOLUTION + 1]; for (int i = 0; i < RESOLUTION + 1; i++) { lut[i] = (float)(1.0 / (1.0 + Math.Exp(-i / SCALE))); } return lut; } public static float Sigmoid1(double value) { return (float) (1.0 / (1.0 + Math.Exp(-value))); } public static float Sigmoid2(float value) { if (value <= MIN) return 0.0f; if (value >= MAX) return 1.0f; if (value >= 0) return lut[(int)(value * SCALE + 0.5f)]; return 1.0f - lut[(int)(-value * SCALE + 0.5f)]; } public static float error(float v0, float v1) { return Math.Abs(v1 - v0); } public static float TestError() { float emax = 0.0f; for (float x = -10.0f; x < 10.0f; x+= 0.00001f) { float v0 = Sigmoid1(x); float v1 = Sigmoid2(x); float e = error(v0, v1); if (e > emax) emax = e; } return emax; } public static double TestPerformancePlain() { Stopwatch sw = new Stopwatch(); sw.Start(); for (int i = 0; i < 10; i++) { for (float x = -5.0f; x < 5.0f; x+= 0.00001f) { Sigmoid1(x); } } sw.Stop(); return sw.Elapsed.TotalMilliseconds; } public static double TestPerformanceLUT() { Stopwatch sw = new Stopwatch(); sw.Start(); for (int i = 0; i < 10; i++) { for (float x = -5.0f; x < 5.0f; x+= 0.00001f) { Sigmoid2(x); } } sw.Stop(); return sw.Elapsed.TotalMilliseconds; } static void Main() { Console.WriteLine("Max deviation is {0}", TestError()); Console.WriteLine("10^7 iterations using Sigmoid1() took {0} ms", TestPerformancePlain()); Console.WriteLine("10^7 iterations using Sigmoid2() took {0} ms", TestPerformanceLUT()); } }

First thought: How about some stats on the values variable?

- Are the values of "value" typically small -10 <= value <= 10?

If not, you can probably get a boost by testing for out of bounds values

if(value < -10) return 0; if(value > 10) return 1;

- Are the values repeated often?

If so, you can probably get some benefit from Memoization (probably not, but it doesn't hurt to check....)

if(sigmoidCache.containsKey(value)) return sigmoidCache.get(value);

If neither of these can be applied, then as some others have suggested, maybe you can get away with lowering the accuracy of your sigmoid...

Soprano had some nice optimizations your call:

public static float Sigmoid(double value) { float k = Math.Exp(value); return k / (1.0f + k); }

If you try a lookup table and find it uses too much memory you could always looking at the value of your parameter for each successive calls and employing some caching technique.

For example try caching the last value and result. If the next call has the same value as the previous one, you don't need to calculate it as you'd have cached the last result. If the current call was the same as the previous call even 1 out of a 100 times, you could potentially save yourself 1 million calculations.

Or, you may find that within 10 successive calls, the value parameter is on average the same 2 times, so you could then try caching the last 10 values/answers.

Idea: Perhaps you can make a (large) lookup table with the values pre-calculated?

This is slightly off topic, but just out of curiosity, I did the same implementation as the one in C, C# and F# in Java. I'll just leave this here in case someone else is curious.

Result:

$ javac LUTTest.java && java LUTTest Max deviation is 0.001664 10^7 iterations using sigmoid1() took 1398 ms 10^7 iterations using sigmoid2() took 177 ms

I suppose the improvement over C# in my case is due to Java being better optimized than Mono for OS X. On a similar MS .NET-implementation (vs. Java 6 if someone wants to post comparative numbers) I suppose the results would be different.

Code:

public class LUTTest { private static final float SCALE = 320.0f; private static final int RESOLUTION = 2047; private static final float MIN = -RESOLUTION / SCALE; private static final float MAX = RESOLUTION / SCALE; private static final float[] lut = initLUT(); private static float[] initLUT() { float[] lut = new float[RESOLUTION + 1]; for (int i = 0; i < RESOLUTION + 1; i++) { lut[i] = (float)(1.0 / (1.0 + Math.exp(-i / SCALE))); } return lut; } public static float sigmoid1(double value) { return (float) (1.0 / (1.0 + Math.exp(-value))); } public static float sigmoid2(float value) { if (value <= MIN) return 0.0f; if (value >= MAX) return 1.0f; if (value >= 0) return lut[(int)(value * SCALE + 0.5f)]; return 1.0f - lut[(int)(-value * SCALE + 0.5f)]; } public static float error(float v0, float v1) { return Math.abs(v1 - v0); } public static float testError() { float emax = 0.0f; for (float x = -10.0f; x < 10.0f; x+= 0.00001f) { float v0 = sigmoid1(x); float v1 = sigmoid2(x); float e = error(v0, v1); if (e > emax) emax = e; } return emax; } public static long sigmoid1Perf() { float y = 0.0f; long t0 = System.currentTimeMillis(); for (int i = 0; i < 10; i++) { for (float x = -5.0f; x < 5.0f; x+= 0.00001f) { y = sigmoid1(x); } } long t1 = System.currentTimeMillis(); System.out.printf("",y); return t1 - t0; } public static long sigmoid2Perf() { float y = 0.0f; long t0 = System.currentTimeMillis(); for (int i = 0; i < 10; i++) { for (float x = -5.0f; x < 5.0f; x+= 0.00001f) { y = sigmoid2(x); } } long t1 = System.currentTimeMillis(); System.out.printf("",y); return t1 - t0; } public static void main(String[] args) { System.out.printf("Max deviation is %f\n", testError()); System.out.printf("10^7 iterations using sigmoid1() took %d ms\n", sigmoid1Perf()); System.out.printf("10^7 iterations using sigmoid2() took %d ms\n", sigmoid2Perf()); } }

I realize that it has been a year since this question popped up, but I ran across it because of the discussion of F# and C performance relative to C#. I played with some of the samples from other responders and discovered that delegates appear to execute faster than a regular method invocation but there is no apparent peformance advantage to F# over C#.

- C: 166ms
- C# (delegate): 275ms
- C# (method): 431ms
- C# (method, float counter): 2,656ms
- F#: 404ms

The C# with a float counter was a straight port of the C code. It is much faster to use an int in the for loop.

You might also consider experimenting with alternative activation functions which are cheaper to evaluate. For example:

f(x) = (3x - x**3)/2

(which could be factored as

f(x) = x*(3 - x*x)/2

for one less multiplication). This function has odd symmetry, and its derivative is trivial. Using it for a neural network requires normalizing the sum-of-inputs by dividing by the total number of inputs (limiting the domain to [-1..1], which is also range).

A mild variation on Soprano's theme:

public static float Sigmoid(double value) { float v = value; float k = Math.Exp(v); return k / (1.0f + k); }

Since you're only after a single precision result, why make the Math.Exp function calculate a double? Any exponent calculator that uses an iterative summation (see the expansion of the ex) will take longer for more precision, each and every time. And double is twice the work of single! This way, you convert to single first, *then* do your exponential.

But the expf function should be faster still. I don't see the need for soprano's (float) cast in passing to expf though, unless C# doesn't do implicit float-double conversion.

Otherwise, just use a *real* language, like FORTRAN...

There are a lot of good answers here. I would suggest running it through this technique, just to make sure

- You're not calling it any more times than you need to. (Sometimes functions get called more than necessary, just because they are so easy to call.)
- You're not calling it repeatedly with the same arguments (where you could use memoization)

BTW the function you have is the inverse logit function, or the inverse of the log-odds-ratio function log(f/(1-f)).

(Updated with performance measurements)(Updated again with real results :)

I think a lookup table solution would get you very far when it comes to performance, at a negligible memory and precision cost.

The following snippet is an example implementation in C (I don't speak c# fluently enough to dry-code it). It runs and performs well enough, but I'm sure there's a bug in it :)

#include <math.h> #include <stdio.h> #include <time.h> #define SCALE 320.0f #define RESOLUTION 2047 #define MIN -RESOLUTION / SCALE #define MAX RESOLUTION / SCALE static float sigmoid_lut[RESOLUTION + 1]; void init_sigmoid_lut(void) { int i; for (i = 0; i < RESOLUTION + 1; i++) { sigmoid_lut[i] = (1.0 / (1.0 + exp(-i / SCALE))); } } static float sigmoid1(const float value) { return (1.0f / (1.0f + expf(-value))); } static float sigmoid2(const float value) { if (value <= MIN) return 0.0f; if (value >= MAX) return 1.0f; if (value >= 0) return sigmoid_lut[(int)(value * SCALE + 0.5f)]; return 1.0f-sigmoid_lut[(int)(-value * SCALE + 0.5f)]; } float test_error() { float x; float emax = 0.0; for (x = -10.0f; x < 10.0f; x+=0.00001f) { float v0 = sigmoid1(x); float v1 = sigmoid2(x); float error = fabsf(v1 - v0); if (error > emax) { emax = error; } } return emax; } int sigmoid1_perf() { clock_t t0, t1; int i; float x, y = 0.0f; t0 = clock(); for (i = 0; i < 10; i++) { for (x = -5.0f; x <= 5.0f; x+=0.00001f) { y = sigmoid1(x); } } t1 = clock(); printf("", y); /* To avoid sigmoidX() calls being optimized away */ return (t1 - t0) / (CLOCKS_PER_SEC / 1000); } int sigmoid2_perf() { clock_t t0, t1; int i; float x, y = 0.0f; t0 = clock(); for (i = 0; i < 10; i++) { for (x = -5.0f; x <= 5.0f; x+=0.00001f) { y = sigmoid2(x); } } t1 = clock(); printf("", y); /* To avoid sigmoidX() calls being optimized away */ return (t1 - t0) / (CLOCKS_PER_SEC / 1000); } int main(void) { init_sigmoid_lut(); printf("Max deviation is %0.6f\n", test_error()); printf("10^7 iterations using sigmoid1: %d ms\n", sigmoid1_perf()); printf("10^7 iterations using sigmoid2: %d ms\n", sigmoid2_perf()); return 0; }

Previous results were due to the optimizer doing its job and optimized away the calculations. Making it actually execute the code yields slightly different and much more interesting results (on my way slow MB Air):

$ gcc -O2 test.c -o test && ./test Max deviation is 0.001664 10^7 iterations using sigmoid1: 571 ms 10^7 iterations using sigmoid2: 113 ms

TODO:

There are things to improve and ways to remove weaknesses; how to do is is left as an exercise to the reader :)

- Tune the range of the function to avoid the jump where the table starts and ends.
- Add a slight noise function to hide the aliasing artifacts.
- As Rex said, interpolation could get you quite a bit further precision-wise while being rather cheap performance-wise.

There are a much faster functions that do very similar things:

x / (1 + abs(x)) – fast replacement for TAHN

And similarly:

x / (2 + 2 * abs(x)) + 0.5 - fast replacement for SIGMOID

Compare plots with actual sigmoid

Doing a Google search, I found an alternative implementation of the Sigmoid function.

public double Sigmoid(double x) { return 2 / (1 + Math.Exp(-2 * x)) - 1; }

Is that correct for your needs? Is it faster?

http://dynamicnotions.blogspot.com/2008/09/sigmoid-function-in-c.html

1) Do you call this from only one place? If so, you may gain a small amount of performance by moving the code out of that function and just putting it right where you would normally have called the Sigmoid function. I don't like this idea in terms of code readability and organization but when you need to get every last performance gain, this might help because I think function calls require a push/pop of registers on the stack, which could be avoided if your code was all inline.

2) I have no idea if this might help but try making your function parameter a ref parameter. See if it's faster. I would have suggested making it const (which would have been an optimization if this were in c++) but c# doesn't support const parameters.

If you need a giant speed boost, you could probably look into parallelizing the function using the (ge)force. IOW, use DirectX to control the graphics card into doing it for you. I have no idea how to do this, but I've seen people use graphics cards for all kinds of calculations.

I've seen that a lot of people around here are trying to use approximation to make Sigmoid faster. However, it is important to know that Sigmoid can also be expressed using tanh, not only exp. Calculating Sigmoid this way is around 5 times faster than with exponential, and by using this method you are not approximating anything, thus the original behaviour of Sigmoid is kept as-is.

public static double Sigmoid(double value) { return 0.5d + 0.5d * Math.Tanh(value/2); }

Of course, parellization would be the next step to performance improvement, but as far as the raw calculation is concerned, using Math.Tanh is faster than Math.Exp.