exp() functions are very common in a variety of problems including machine learning. I maintain the 3B microscopy software and that makes heavy use of the forward algorithm. Within this algorithm one has to sum probability densities to do a marginalisation. Since these tend to get very large or very small they are stored as logs. To sum them up, one does:
algebraically, Z is arbitrary and has no effect. Numerically it’s convenient to set if then the largest exp is always of 0 and the rest are smaller.
Currently the code uses the standard library exp(). This is a beautifully engineered function which gives full precision over the complete range of inputs and deals gracefully with values out of range and special values like Inf and NaN. It’s definitely the only good choice for general purpose use, but it’s also slow and is the bottleneck in my code.
For 3B, I don’t need those properties. Firstly, I never have NaN in the code. Secondly, the numbers in are always zero or smaller. Thirdly, I’m always adding exp() of them to 1, which means it only needs to have good accuracy relative to 1 and so exp(-large number) can always be approximated as zero. Finally, it’s a Monte Carlo algorithm working on noisy 9 or 10 bit data at the most, so having 48 bits of accuracy isn’t really necessary.
So, first a test rig (all these examples are on Ubuntu 12.04 using the default gcc 4.8.2-19ubuntu1 on a 1.73GHz cire i7 Q820).The compile options are “-O3 -ffast-math -std=c++11 -Wall -Wextra -Werror”
#include <iostream> #include <chrono> #include <cmath> #include <cstdint> using namespace std; using namespace std::chrono; //Marsagalia's excellent xorshift PRNG uint32_t x = 1 , y = 0, z = 0, w = 0; uint32_t xorshift128(void) { uint32_t t = x ^ (x << 11); x = y; y = z; z = w; return w = w ^ (w >> 19) ^ t ^ (t >> 8); } //Random numbers between 0 and -20 float rngf() { return xorshift128() * (10.f / 4294967296.f); } double dummy=0; //Benchmarking function, which runs the given function a //bunch of times. It also optionally subtracts off a cost. template<class C> double time_func(const C& f, const string& str, double ns) { int N = 1000000; int M = 100; double ns_sum=0, ns_sum_square=0; high_resolution_clock clock; for(int j=0; j < M; j++) { auto t1 = clock.now(); //We have to use the results of the benchmark //because the optimizer is smart enough to remove //the test function otherwise! for(int i=0; i < N; i++) dummy += f(-rngf()); auto t2 = clock.now(); double time_ns = duration_cast<nanoseconds>(t2 - t1).count() *1.0 / N - ns; ns_sum += time_ns; ns_sum_square += time_ns*time_ns; } double mean = ns_sum / M; double std = sqrt(ns_sum_square/M -mean*mean); cout << mean << " +- " << std << " ns per " << str << " number "; return mean; } //Compare a function to std::exp template<class C> void test_func(const C& f, const string& str, double ns) { time_func(f, str, ns); int N = 10000000; float err=0; for(int i=0; i < N; i++) { float a = -rngf(); err = max(err, abs(f(a) - expf(a))); } cout << " max err = " << err << endl; } // // Exp functions go here. // int main() { //Benchmark the random number generator double rng = time_func([](float f){return f;}, "RNG", 0); cout << endl; //Benchmark and test the functions. test_func(expf, "expf", rng); //Use the dummy variable to stop it being optimized out. cout << dummy << endl; }
The test rig does mostly what you’d expect. I’ve used Xorshift rather than std::anything partly out of pre C++11 force of habit and partly because it’s faster than any std::random and more than good enough and that makes the tests run more quickly. I’m impatient. The main things to note is that I’m only exp()ing numbers between 0 and -10, since that’s about the range where the accuracy matters.
The results are:
5.5559 +- 0.972546 ns per RNG number 25.655 +- 1.64518 ns per expf number max err = 0
That’s quite a lot of clock cycles per exp. No wonder it’s slow!
Getting started: a Taylor series
The most obvious first try is to use the Taylor series . There is a caveat though: the Taylor series only gives good results when you’re “close” to 0. However can use the result
to make a lookup table. Specifically, we ues
. We can precompute the table of exponentiated integers.
The code to do the Taylor series is written as a template function, so we can control the number of terms at compile time. Conveniently the easiest way to write it ends up using Horner’s method which minimizes the number of multiplications.
constexpr int factorial (int n) { return n > 0 ? n * factorial( n - 1 ) : 1; } template<int N> float taylor_exp(float f) { double accumulate=0; for(int n=N; n > 0; n--) accumulate = (accumulate + 1.0f/factorial(n)) * f; return accumulate + 1; } const float exps_0_15[16]= { exp(- 0.f), exp(- 1.f), exp(- 2.f), exp(- 3.f), exp(- 4.f), exp(- 5.f), exp(- 6.f), exp(- 7.f), exp(- 8.f), exp(- 9.f), exp(-10.f), exp(-11.f), exp(-12.f), exp(-13.f), exp(-14.f), exp(-15.f), }; float exp_taylor_1(float n) { //Input in the range 0 to -large //only needs to be accurate relative to 1 //exp(a+b) = exp(a) exp(b) //flt_epsilon = 1e-7 //and exp(-16) = 1.1254e-07 //so any exps this small when added to 1 will pretty much vanish. if( n <= -16) return 0; int fn = -ceil(n); n += fn; //n is between 0 and 1/16 return exps_0_15[fn] * taylor_exp<5>(n); }
That’s not too bad, and the results are:
48.4559 +- 2.23374 ns per exp_taylor_1 number max err = 0.00121278 25.8883 +- 1.71585 ns per expf number max err = 0 5.70293 +- 0.85916 ns per RNG number
😦
So far we’ve successfully pessimised exp() and have worse accuracy. It turns out that on a modern compiler and CPU, the Horner scheme is awful and the naive way is nearly as good as it’s possible to get. So, let’s try that:
template<int N> float int_pow(float x) { return int_pow<N-1>(x) * x; } template<> float int_pow<1>(float x) { return x; } template<int N> float taylor_exp_silly(float x) { return int_pow<N>(x)*(1.0f/factorial(N)) + taylor_exp_silly<N-1>(x); } template<> float taylor_exp_silly<0>(float) { return 1; } float exp_taylor_1_silly(float n) { if( n <= -16) return 0; int fn = -ceil(n); n += fn; return exps_0_15[fn] * taylor_exp_silly<5>(n); }
And the results are:
49.4578 +- 1.955 ns per exp_taylor_1 number max err = 0.00121278 26.3264 +- 1.51038 ns per expf number max err = 0 20.2760 +- 1.40977 ns per exp_taylor_1_silly number max err = 0.00121278 5.77306 +- 0.616445 ns per RNG number
And only after 2 attempts we’ve beaten the builtin exp function! In truth there were some worse ones I also tried. But this is a good start. Without a huge amount of work, we’ve got a faster exp with reasonable accuracy.
If you fiddle with the code a bit, it becomes pretty clear that the polynomial is the slow bit. Since the compiler already does a pretty good job of optimzation, the only way to speed it up significantly is to get an equally good approximation with fewer terms.
Better approximating functions.
There are many ways of approximating functions using elementary operations: minimax approximations or using a truncated Chebyshev series can improve the worst case accuracy for a given order by spreading out the errors over the range of the inputs. They’re a bit of a pain in practice, so I’m not going to try them this time.
Rational functions also give a good bang for the buck, and rational approximations are given in Padé tables. The nice thing about those is that there are half the number of multiplications as the order at the penalty of a single division. It’s also a very simple equation. Let’s try it:
float exp_rational_1(float n) { if( n <= -16) return 0; int fn = -ceil(n); n += fn; n /=2; return exps_0_15[fn] * (1 + n + n*n * (1.f/3.f) ) / (1-n + n*n*(1.f/3.f)); }
And the results are:
49.3604 +- 2.34379 ns per exp_taylor_1 number max err = 0.00121278 26.4619 +- 1.43851 ns per expf number max err = 0 20.3236 +- 1.50521 ns per exp_taylor_1_silly number max err = 0.00121281 16.5280 +- 1.42434 ns per exp_rational_1 number max err = 0.000541627 5.67616 +- 0.899543 ns per RNG number
Well, that’s great! We’ve got quite considerably faster and our function is not too complex.
Shorter polynomials using larger tables
The other way of making the polynomial faster is to use a larger lookup table so that fewer terms are needed. First we need to make a large lookup table. I think this can be done in C++11, and I’ll make it the subject of a future post. For now, I’ll use awk, and save the results as “exp_256.h”:
BEGIN{ print "const double exps_0_16_256[257]="; print "{"; for(i=0; i <= 256; i++) print "exp(" 0- i * 16 / 256 "),"; print "};"; }
And the corresponding C++ code for both Taylor series and rational functions:
float exp_rational_2(float n) { if( n <= -16) return 0; int fn = -ceil(n*16); n += fn *1.f/16; n*=0.5f; return exps_0_16_256[fn] * (1 + n) / (1-n); } float exp_taylor_2(float n) { if( n <= -16) return 0; int fn = -ceil(n*16); n += fn *1.f/16; return exps_0_16_256[fn] * taylor_exp_silly<2>(n); }
And the results are:
52.0466 +- 2.66404 ns per exp_taylor_1 number max err = 0.00121278 30.2947 +- 2.28403 ns per exp_rational_2 number max err = 1.91927e-05 27.9873 +- 1.747 ns per expf number max err = 0 20.6494 +- 1.79951 ns per exp_taylor_1_silly number max err = 0.00121281 17.4536 +- 1.74277 ns per exp_taylor_2 number max err = 4.00543e-05 17.4435 +- 1.76992 ns per exp_rational_1 number max err = 0.000541627 5.71171 +- 0.832463 ns per RNG number
So far the winner is the small Taylor series with the large lookup table. It’s essentially the same speed as the rational function but about 3 bits (or of you prefer a factor of 10) more accurate. For some reason, the rational function is awful.
The thing is, now the lookup table is getting quite big. We can essentially shrink it exponentially by using a series of lookup tables, where the exponential factor is the table size. Size 16 tables seems reasonable.
float exps_0_1[17]= { exp(- 0 / 16.f), exp(- 1 / 16.f), exp(- 2 / 16.f), exp(- 3 / 16.f), exp(- 4 / 16.f), exp(- 5 / 16.f), exp(- 6 / 16.f), exp(- 7 / 16.f), exp(- 8 / 16.f), exp(- 9 / 16.f), exp(-10 / 16.f), exp(-11 / 16.f), exp(-12 / 16.f), exp(-13 / 16.f), exp(-14 / 16.f), exp(-15 / 16.f), exp(-1.f) }; float taylor_exp_double_lookup(float n) { if( n <= -16) return 0; int fn = -ceil(n); n += fn; //n is between 0 and 1 int fn2 = -ceil(n*16); n += fn2 * 1.f/16; //n is between 0 and 1/16 return exps_0_15[fn] * exps_0_1[fn2] * taylor_exp_silly<2>(n); }
Well, that’s much prettier than a large lookup table. So how does it do?
46.1554 +- 2.17428 ns per exp_taylor_1 number max err = 0.00121278 27.8598 +- 1.78498 ns per exp_rational_2 number max err = 1.91927e-05 24.2195 +- 1.68243 ns per expf number max err = 0 24.0807 +- 2.35904 ns per taylor_exp_double number max err = 4.01139e-05 19.0864 +- 1.93974 ns per exp_taylor_1_silly number max err = 0.00121281 15.4396 +- 1.25958 ns per exp_rational_1 number max err = 0.000541657 15.2347 +- 1.37960 ns per exp_taylor_2 number max err = 4.00543e-05 5.98766 +- 0.90143 ns per RNG number
awww 😦
Well that’s no good. Turns out that the 256 element lookup table fits into the L1 cache many times over, so shrinking it won’t achieve anything. I guess that serves me right for ploughing ahead without thinking. Interestingly at this point moving to lower degree polynomials doesn’t help. That’s presumably because the CPU can run the table lookup fetch in parallel with the polynomial evaluation.
Even larger tables
Well, the table lookup strategy looks worthwhile, as does reducing the number of operations in the polynomial. Why not try even larger lookup tables!
float exp_taylor_3(float n) { if( n <= -16) return 0; int fn = -ceil(n*64); n += fn *1.f/64; return exps_0_64_1024[fn] * taylor_exp_silly<1>(n); }
46.4604 +- 2.86348 ns per exp_taylor_1 number max err = 0.00121278 27.5389 +- 1.63889 ns per exp_rational_2 number max err = 1.91927e-05 24.4224 +- 1.60809 ns per expf number max err = 0 21.1663 +- 1.48631 ns per taylor_exp_double number max err = 0.00191307 18.7157 +- 1.49931 ns per exp_taylor_1_silly number max err = 0.00121281 15.7689 +- 1.25929 ns per exp_taylor_2 number max err = 0.00191307 15.6992 +- 1.10481 ns per exp_taylor_3 number max err = 0.000121415 15.3896 +- 1.30146 ns per exp_rational_1 number max err = 0.000541657 5.40330 +- 0.62257 ns per RNG number
Well, as expected, because the larger lookup table still fits in cache, it’s just as fast as before, but we get more accuracy by virtue of having a larger table. But just how far can we push it? If we go far enough, then we can drop the order of the taylor series down by 1…
float exp_taylor_4(float n) { if( n <= -16) return 0; int fn = -ceil(n*256); n += fn *1.f/256; return exps_0_16_4096[fn] * taylor_exp_silly<1>(n); } float exp_taylor_4a(float n) { if( n <= -16) return 0; int fn = -ceil(n*256); n += fn *1.f/256; return exps_0_16_4096[fn] * taylor_exp_silly<0>(n); }
47.8244 +- 2.64700 ns per exp_taylor_1 number max err = 0.00121278 28.6373 +- 1.62028 ns per exp_rational_2 number max err = 1.91927e-05 24.6311 +- 1.35607 ns per expf number max err = 0 24.1567 +- 1.74667 ns per taylor_exp_double number max err = 4.01139e-05 19.7450 +- 1.59549 ns per exp_taylor_1_silly number max err = 0.00121281 16.6356 +- 1.39842 ns per exp_taylor_4 number max err = 7.62939e-06 16.5854 +- 1.72065 ns per exp_taylor_3 number max err = 2.02656e-06 16.1388 +- 1.54983 ns per exp_rational_1 number max err = 0.000541657 15.8857 +- 1.58612 ns per exp_taylor_2 number max err = 4.00543e-05 6.15813 +- 0.87407 ns per exp_taylor_4a number max err = 0.00389856 5.33836 +- 0.51697 ns per RNG number
Well, as expected, it’s just as fast and we get more accuracy for free. Based on the logic that the lookup and polynomial can run in parallel, I also bumped up the order for exp_taylor_3 this time. And that also doesn’t affect the speed, but is even more accurate than with the large lookup table. Of course we can play leapfrog with those numbers all day, but they’re more than accurate enough now, so we may as well pick the one with the smaller cache footprint.
There’s no slowdown in fact all the way up to polynomials of O(3) and at that point we’ve actually hit the limits of the accuracy of float(), so in this context we can get more or less perfect results with a speedup of about 1.5. Bumping up to O(4) makes things much slower, so I think at that point we exceed the parallelism within the CPU.
But what about 4a? Well, if you don’t care about accuracy all that much, you can drop down to an O(0) approximation (i.e. ). It’s fast and the accuracy is pretty terrible, but maybe acceptable. I guess that dropping the polynomial allows the CPUs parallelism to be used elsewhere.
Blatantly stealing code off the internet
The easiest way of writing a faster exp() is to steal someone else’s. These work by reinterpreting a float as an int, figuring out what the int should be based on the float it represents, then doing some tricky algebra and bit manipulations to get a function out quickly. I tried his updated fastexp and fasterexp functions:
//From here: #include "fastonebigheader.h" #define cast_uint32_t static_cast<uint32_t> static inline float fastpow2 (float p) { float offset = (p < 0) ? 1.0f : 0.0f; float clipp = (p < -126) ? -126.0f : p; int w = clipp; float z = clipp - w + offset; union { uint32_t i; float f; } v = { cast_uint32_t ( (1 << 23) * (clipp + 121.2740575f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z) ) }; return v.f; } static inline float fastexp (float p) { return fastpow2 (1.442695040f * p); } static inline float fasterpow2 (float p) { float clipp = (p < -126) ? -126.0f : p; union { uint32_t i; float f; } v = { cast_uint32_t ( (1 << 23) * (clipp + 126.94269504f) ) }; return v.f; } static inline float fasterexp (float p) { return fasterpow2 (1.442695040f * p); }
48.1445 +- 4.01450 ns per exp_taylor_1 number max err = 0.00121278 27.4882 +- 1.55120 ns per exp_rational_2 number max err = 1.91927e-05 27.3199 +- 2.67695 ns per expf number max err = 0 24.092 +- 2.38582 ns per taylor_exp_double number max err = 4.01139e-05 21.0399 +- 1.97700 ns per exp_taylor_1_silly number max err = 0.00121281 20.4665 +- 1.35137 ns per fastexp number max err = 4.13656e-05 17.1155 +- 1.79687 ns per exp_taylor_2 number max err = 4.00543e-05 16.1131 +- 1.68323 ns per exp_taylor_3 number max err = 1.84774e-06 15.8211 +- 1.68255 ns per exp_rational_1 number max err = 0.000541657 15.0374 +- 1.74190 ns per exp_taylor_4 number max err = 1.84774e-06 9.19174 +- 1.13393 ns per fasterexp number max err = 0.0286525 5.91207 +- 1.28270 ns per exp_taylor_4a number max err = 0.00389856 5.77552 +- 0.838646 ns per RNG number
Well, in terms of raw speed, exp_taylor4a is the clear winner, at the penalty of having poor accuracy, though not the worst. It’s kind of obnoxious that a large, dumb lookup table beats deep cunning with bit fiddling in both speed and accuracy, but there you go.
The final stages: fiddling with compiler flags and stuff
Well the last stage is fiddling with flags. We don’t always have the luxury of -ffast-math. So how much worse is it without that?
40.4608 +- 3.21249 ns per exp_taylor_1 number max err = 0.00121278 25.8433 +- 2.77110 ns per expf number max err = 0 25.1485 +- 2.34913 ns per exp_rational_2 number max err = 1.91927e-05 22.3017 +- 2.79495 ns per taylor_exp_double number max err = 4.01139e-05 19.2444 +- 1.15690 ns per fastexp number max err = 4.33326e-05 18.3797 +- 1.39186 ns per exp_taylor_3 number max err = 1.84774e-06 18.3455 +- 1.30289 ns per exp_taylor_4 number max err = 1.84774e-06 17.3415 +- 1.95789 ns per exp_taylor_2 number max err = 4.00543e-05 17.1502 +- 1.76520 ns per exp_taylor_1_silly number max err = 0.00121281 8.46855 +- 1.05033 ns per exp_rational_1 number max err = 0.000541657 5.72898 +- 0.93221 ns per RNG number 3.23796 +- 0.90463 ns per fasterexp number max err = 0.0286525 2.24955 +- 0.58446 ns per exp_taylor_4a number max err = 0.00389856
Huh. Didn’t expect things to get faster.
What about the random number generator. Surely that can’t make any difference, right? Because I measured the speed of that and subtracted it off, right?
#include <random> //... //I love C++11! mt19937 rng; uniform_real_distribution<float> unif(0, 10); float rngf() { return unif(rng); }
41.4247 +- 2.72788 ns per exp_taylor_1 number max err = 0.00121278 27.8752 +- 4.46173 ns per expf number max err = 0 15.9981 +- 1.56283 ns per exp_taylor_1_silly number max err = 0.00121281 15.9583 +- 1.23581 ns per exp_rational_2 number max err = 1.91927e-05 14.9642 +- 1.26806 ns per fastexp number max err = 4.33326e-05 14.9292 +- 2.12149 ns per exp_taylor_4 number max err = 1.84774e-06 13.4167 +- 1.51129 ns per exp_taylor_3 number max err = 1.84774e-06 12.7261 +- 1.38275 ns per taylor_exp_double number max err = 4.01139e-05 12.3376 +- 1.94665 ns per exp_taylor_2 number max err = 4.00543e-05 12.0379 +- 1.33376 ns per RNG number 10.5822 +- 1.22292 ns per exp_rational_1 number max err = 0.000541627 4.59183 +- 1.51845 ns per exp_taylor_4a number max err = 0.00389439 2.96676 +- 1.49019 ns per fasterexp number max err = 0.0286524
What???? How did fastexp suddenly become so fast? Why did exp_taylor_4a halve in speed?
I think this demonstrates that we’ve reached the limits of microbenchmarking here. It seems that there’s going to be a large factor at this point which depends very much on the code surrounding the exp() because how much that sucks up the execution units on the CPU will affect the speed.
Conclusions
1. You can make a special purpose exp() much faster
2. I hate benchmarking.
The complete code is here: