Writing a faster exp() and by the way, I hate benchmarking

exp() functions are very common in a variety of problems including machine learning. I maintain the 3B microscopy software and that makes heavy use of the forward algorithm. Within this algorithm one has to sum probability densities to do a marginalisation. Since these tend to get very large or very small they are stored as logs. To sum them up, one does:

Z+ \log \sum_i e^{n_i - Z}

algebraically, Z is arbitrary and has no effect. Numerically it’s convenient to set if Z = \max_i n_i then the largest exp is always of 0 and the rest are smaller.

Currently the code uses the standard library exp(). This is a beautifully engineered function which gives full precision over the complete range of inputs and deals gracefully with values out of range and special values like Inf and NaN. It’s definitely the only good choice for general purpose use, but it’s also slow and is the bottleneck in my code.

For 3B, I don’t need those properties. Firstly, I never have NaN in the code. Secondly, the numbers in are always zero or smaller. Thirdly, I’m always adding exp() of them to 1, which means it only needs to have good accuracy relative to 1 and so exp(-large number) can always be approximated as zero. Finally, it’s a Monte Carlo algorithm working on noisy 9 or 10 bit data at the most, so having 48 bits of accuracy isn’t really necessary.

So, first a test rig (all these examples are on Ubuntu 12.04 using the default gcc 4.8.2-19ubuntu1 on a 1.73GHz cire i7 Q820).The compile options are “-O3 -ffast-math -std=c++11 -Wall -Wextra -Werror”

#include <iostream>
#include <chrono>
#include <cmath>
#include <cstdint>
using namespace std;
using namespace std::chrono;

//Marsagalia's excellent xorshift PRNG
uint32_t x = 1 , y = 0, z  = 0, w = 0;
uint32_t xorshift128(void) {
    uint32_t t = x ^ (x << 11);
    x = y; y = z; z = w;
    return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}

//Random numbers between 0 and -20
float rngf()
{
	return xorshift128() * (10.f / 4294967296.f);
}

double dummy=0;

//Benchmarking function, which runs the given function a
//bunch of times. It also optionally subtracts off a cost.
template<class C>
double time_func(const C& f, const string& str, double ns)
{
	int N = 1000000;
	int M = 100;

	double ns_sum=0, ns_sum_square=0;

	high_resolution_clock clock;

	for(int j=0; j < M; j++)
	{
		auto t1 = clock.now();

		//We have to use the results of the benchmark
		//because the optimizer is smart enough to remove
		//the test function otherwise!
		for(int i=0; i < N; i++)
			dummy += f(-rngf());

		auto t2 = clock.now();

		double time_ns = duration_cast<nanoseconds>(t2 - t1).count() *1.0 / N - ns;

		ns_sum += time_ns;
		ns_sum_square += time_ns*time_ns;
	}

	double mean = ns_sum / M;
	double std = sqrt(ns_sum_square/M -mean*mean);

	cout << mean << " +- " << std << "  ns per " << str << " number ";

	return mean;
}

//Compare a function to std::exp
template<class C>
void test_func(const C& f, const string& str, double ns)
{
	time_func(f, str, ns);

	int N = 10000000;
	float err=0;
	for(int i=0; i < N; i++)
	{
		float a = -rngf();
		err = max(err, abs(f(a) - expf(a)));
	}

	cout << " max err = " << err << endl;
}

//
// Exp functions go here.
//

int main()
{
	//Benchmark the random number generator
	double rng = time_func([](float f){return f;}, "RNG", 0);
	cout << endl;

	//Benchmark and test the functions.
	test_func(expf, "expf", rng);

	//Use the dummy variable to stop it being optimized out.
	cout << dummy << endl;
}

The test rig does mostly what you’d expect. I’ve used Xorshift rather than std::anything partly out of pre C++11 force of habit and partly because it’s faster than any std::random and more than good enough and that makes the tests run more quickly. I’m impatient. The main things to note is that I’m only exp()ing numbers between 0 and -10, since that’s about the range where the accuracy matters.

The results are:

5.5559 +- 0.972546  ns per RNG number
25.655 +- 1.64518  ns per expf number  max err = 0

That’s quite a lot of clock cycles per exp. No wonder it’s slow!

Getting started: a Taylor series

The most obvious first try is to use the Taylor series e^x = \sum_{i=0}^{\infty} \frac{x^i}{i!}. There is a caveat though: the Taylor series only gives good results when you’re “close” to 0. However can use the result e^{a+b} = e^b e^a to make a lookup table. Specifically, we ues e^a= e^{\lfloor a \rfloor + a - \lfloor a\rfloor } = e^{\lfloor a\rfloor} e^{a - \lfloor a \rfloor }. We can precompute the table of exponentiated integers.

The code to do the Taylor series is written as a template function, so we can control the number of terms at compile time. Conveniently the easiest way to write it ends up using Horner’s method which minimizes the number of multiplications.

constexpr int factorial (int n)
{
    return n > 0 ? n * factorial( n - 1 ) : 1;
}

template<int N>
float taylor_exp(float f)
{
	double accumulate=0;
	for(int n=N; n > 0; n--)
		accumulate = (accumulate + 1.0f/factorial(n)) * f;

	return accumulate + 1;
}

const float exps_0_15[16]=
{
	exp(- 0.f),
	exp(- 1.f),
	exp(- 2.f),
	exp(- 3.f),
	exp(- 4.f),
	exp(- 5.f),
	exp(- 6.f),
	exp(- 7.f),
	exp(- 8.f),
	exp(- 9.f),
	exp(-10.f),
	exp(-11.f),
	exp(-12.f),
	exp(-13.f),
	exp(-14.f),
	exp(-15.f),
};

float exp_taylor_1(float n)
{
	//Input in the range  0 to -large
	//only needs to be accurate relative to 1

	//exp(a+b) = exp(a) exp(b)

	//flt_epsilon = 1e-7
	//and exp(-16) = 1.1254e-07
	//so any exps this small when added to 1 will pretty much vanish.
	if( n <= -16)
		return 0;

	int fn = -ceil(n);
	n += fn;

	//n is between 0 and 1/16

	return exps_0_15[fn] * taylor_exp<5>(n);
}

That’s not too bad, and the results are:

48.4559 +- 2.23374  ns per exp_taylor_1 number  max err = 0.00121278
25.8883 +- 1.71585  ns per expf number  max err = 0
5.70293 +- 0.85916  ns per RNG number

😦

So far we’ve successfully pessimised exp() and have worse accuracy. It turns out that on a modern compiler and CPU, the Horner scheme is awful and the naive way is nearly as good as it’s possible to get. So, let’s try that:


template<int N> float int_pow(float x)
{
	return int_pow<N-1>(x) * x;
}

template<> float int_pow<1>(float x)
{
	return x;
}

template<int N> float taylor_exp_silly(float x)
{
	return int_pow<N>(x)*(1.0f/factorial(N)) + taylor_exp_silly<N-1>(x);
}

template<> float taylor_exp_silly<0>(float)
{
	return 1;
}

float exp_taylor_1_silly(float n)
{
	if( n <= -16)
		return 0;

	int fn = -ceil(n);
	n += fn;

	return exps_0_15[fn] * taylor_exp_silly<5>(n);
}

And the results are:

49.4578 +- 1.955  ns per exp_taylor_1 number  max err = 0.00121278
26.3264 +- 1.51038  ns per expf number  max err = 0
20.2760 +- 1.40977  ns per exp_taylor_1_silly number  max err = 0.00121278
5.77306 +- 0.616445  ns per RNG number

And only after 2 attempts we’ve beaten the builtin exp function! In truth there were some worse ones I also tried. But this is a good start. Without a huge amount of work, we’ve got a faster exp with reasonable accuracy.

If you fiddle with the code a bit, it becomes pretty clear that the polynomial is the slow bit. Since the compiler already does a pretty good job of optimzation, the only way to speed it up significantly is to get an equally good approximation with fewer terms.

Better approximating functions.

There are many ways of approximating functions using elementary operations: minimax approximations or using a truncated Chebyshev series can improve the worst case accuracy for a given order by spreading out the errors over the range of the inputs. They’re a bit of a pain in practice, so I’m not going to try them this time.

Rational functions also give a good bang for the buck, and rational approximations are given in Padé tables. The nice thing about those is that there are half the number of multiplications as the order at the penalty of a single division. It’s also a very simple equation. Let’s try it:

float exp_rational_1(float n)
{
	if( n <= -16)
		return 0;

	int fn = -ceil(n);
	n += fn;
	n /=2;

	return exps_0_15[fn] * (1 + n + n*n * (1.f/3.f) ) / (1-n + n*n*(1.f/3.f));	

}

And the results are:

49.3604 +- 2.34379  ns per exp_taylor_1 number  max err = 0.00121278
26.4619 +- 1.43851  ns per expf number  max err = 0
20.3236 +- 1.50521  ns per exp_taylor_1_silly number  max err = 0.00121281
16.5280 +- 1.42434  ns per exp_rational_1 number  max err = 0.000541627
5.67616 +- 0.899543  ns per RNG number

Well, that’s great! We’ve got quite considerably faster and our function is not too complex.

Shorter polynomials using larger tables

The other way of making the polynomial faster is to use a larger lookup table so that fewer terms are needed. First we need to make a large lookup table. I think this can be done in C++11, and I’ll make it the subject of a future post. For now, I’ll use awk, and save the results as “exp_256.h”:

BEGIN{
	print "const double exps_0_16_256[257]=";
	print "{";

	for(i=0; i <= 256; i++)
		print "exp("  0- i * 16 / 256  "),";

	print "};";
}

And the corresponding C++ code for both Taylor series and rational functions:


float exp_rational_2(float n)
{
	if( n <= -16)
		return 0;

	int fn = -ceil(n*16);
	n += fn *1.f/16;
	n*=0.5f;

	return exps_0_16_256[fn] * (1 + n) / (1-n);

}

float exp_taylor_2(float n)
{
	if( n <= -16)
		return 0;

	int fn = -ceil(n*16);
	n += fn *1.f/16;

	return exps_0_16_256[fn] * taylor_exp_silly<2>(n);
}

And the results are:

52.0466 +- 2.66404  ns per exp_taylor_1 number  max err = 0.00121278
30.2947 +- 2.28403  ns per exp_rational_2 number  max err = 1.91927e-05
27.9873 +- 1.747    ns per expf number  max err = 0
20.6494 +- 1.79951  ns per exp_taylor_1_silly number  max err = 0.00121281
17.4536 +- 1.74277  ns per exp_taylor_2 number  max err = 4.00543e-05
17.4435 +- 1.76992  ns per exp_rational_1 number  max err = 0.000541627
5.71171 +- 0.832463 ns per RNG number

So far the winner is the small Taylor series with the large lookup table. It’s essentially the same speed as the rational function but about 3 bits (or of you prefer a factor of 10) more accurate. For some reason, the rational function is awful.

The thing is, now the lookup table is getting quite big. We can essentially shrink it exponentially by using a series of lookup tables, where the exponential factor is the table size. Size 16 tables seems reasonable.

float exps_0_1[17]=
{
	exp(- 0 / 16.f),
	exp(- 1 / 16.f),
 	exp(- 2 / 16.f),
	exp(- 3 / 16.f),
 	exp(- 4 / 16.f),
 	exp(- 5 / 16.f),
 	exp(- 6 / 16.f),
	exp(- 7 / 16.f),
	exp(- 8 / 16.f),
	exp(- 9 / 16.f),
	exp(-10 / 16.f),
 	exp(-11 / 16.f),
	exp(-12 / 16.f),
 	exp(-13 / 16.f),
	exp(-14 / 16.f),
	exp(-15 / 16.f),
 	exp(-1.f)
};


float taylor_exp_double_lookup(float n)
{
	if( n <= -16)
		return 0;
	
	int fn = -ceil(n);
	n += fn;

	//n is between 0 and 1

	int fn2 = -ceil(n*16);
	n += fn2 * 1.f/16;
	
	//n is between 0 and 1/16

	return exps_0_15[fn] * exps_0_1[fn2] * taylor_exp_silly<2>(n);	

}

Well, that’s much prettier than a large lookup table. So how does it do?

46.1554 +- 2.17428  ns per exp_taylor_1 number  max err = 0.00121278
27.8598 +- 1.78498  ns per exp_rational_2 number  max err = 1.91927e-05
24.2195 +- 1.68243  ns per expf number  max err = 0
24.0807 +- 2.35904  ns per taylor_exp_double number  max err = 4.01139e-05
19.0864 +- 1.93974  ns per exp_taylor_1_silly number  max err = 0.00121281
15.4396 +- 1.25958  ns per exp_rational_1 number  max err = 0.000541657
15.2347 +- 1.37960  ns per exp_taylor_2 number  max err = 4.00543e-05
5.98766 +- 0.90143  ns per RNG number 

awww 😦

Well that’s no good. Turns out that the 256 element lookup table fits into the L1 cache many times over, so shrinking it won’t achieve anything. I guess that serves me right for ploughing ahead without thinking. Interestingly at this point moving to lower degree polynomials doesn’t help. That’s presumably because the CPU can run the table lookup fetch in parallel with the polynomial evaluation.

Even larger tables

Well, the table lookup strategy looks worthwhile, as does reducing the number of operations in the polynomial. Why not try even larger lookup tables!

float exp_taylor_3(float n)
{
	if( n <= -16)
		return 0;
	
	int fn = -ceil(n*64);
	n += fn *1.f/64;

	return exps_0_64_1024[fn] * taylor_exp_silly<1>(n);
}
46.4604 +- 2.86348  ns per exp_taylor_1 number  max err = 0.00121278
27.5389 +- 1.63889  ns per exp_rational_2 number  max err = 1.91927e-05
24.4224 +- 1.60809  ns per expf number  max err = 0
21.1663 +- 1.48631  ns per taylor_exp_double number  max err = 0.00191307
18.7157 +- 1.49931  ns per exp_taylor_1_silly number  max err = 0.00121281
15.7689 +- 1.25929  ns per exp_taylor_2 number  max err = 0.00191307
15.6992 +- 1.10481  ns per exp_taylor_3 number  max err = 0.000121415
15.3896 +- 1.30146  ns per exp_rational_1 number  max err = 0.000541657
5.40330 +- 0.62257  ns per RNG number 

Well, as expected, because the larger lookup table still fits in cache, it’s just as fast as before, but we get more accuracy by virtue of having a larger table. But just how far can we push it? If we go far enough, then we can drop the order of the taylor series down by 1…

float exp_taylor_4(float n)
{
	if( n <= -16)
		return 0;
	
	int fn = -ceil(n*256);
	n += fn *1.f/256;

	return exps_0_16_4096[fn] * taylor_exp_silly<1>(n);
}


float exp_taylor_4a(float n)
{
	if( n <= -16)
		return 0;
	
	int fn = -ceil(n*256);
	n += fn *1.f/256;

	return exps_0_16_4096[fn] * taylor_exp_silly<0>(n);
}
47.8244 +- 2.64700  ns per exp_taylor_1 number  max err = 0.00121278
28.6373 +- 1.62028  ns per exp_rational_2 number  max err = 1.91927e-05
24.6311 +- 1.35607  ns per expf number  max err = 0
24.1567 +- 1.74667  ns per taylor_exp_double number  max err = 4.01139e-05
19.7450 +- 1.59549  ns per exp_taylor_1_silly number  max err = 0.00121281
16.6356 +- 1.39842  ns per exp_taylor_4 number  max err = 7.62939e-06
16.5854 +- 1.72065  ns per exp_taylor_3 number  max err = 2.02656e-06
16.1388 +- 1.54983  ns per exp_rational_1 number  max err = 0.000541657
15.8857 +- 1.58612  ns per exp_taylor_2 number  max err = 4.00543e-05
6.15813 +- 0.87407  ns per exp_taylor_4a number  max err = 0.00389856
5.33836 +- 0.51697  ns per RNG number 

Well, as expected, it’s just as fast and we get more accuracy for free. Based on the logic that the lookup and polynomial can run in parallel, I also bumped up the order for exp_taylor_3 this time. And that also doesn’t affect the speed, but is even more accurate than with the large lookup table. Of course we can play leapfrog with those numbers all day, but they’re more than accurate enough now, so we may as well pick the one with the smaller cache footprint.

There’s no slowdown in fact all the way up to polynomials of O(3) and at that point we’ve actually hit the limits of the accuracy of float(), so in this context we can get more or less perfect results with a speedup of about 1.5. Bumping up to O(4) makes things much slower, so I think at that point we exceed the parallelism within the CPU.

But what about 4a? Well, if you don’t care about accuracy all that much, you can drop down to an O(0) approximation (i.e. e^x \approx 1). It’s fast and the accuracy is pretty terrible, but maybe acceptable. I guess that dropping the polynomial allows the CPUs parallelism to be used elsewhere.

Blatantly stealing code off the internet

The easiest way of writing a faster exp() is to steal someone else’s. These work by reinterpreting a float as an int, figuring out what the int should be based on the float it represents, then doing some tricky algebra and bit manipulations to get a function out quickly. I tried his updated fastexp and fasterexp functions:


//From here: #include "fastonebigheader.h"

#define cast_uint32_t static_cast<uint32_t>
static inline float
fastpow2 (float p)
{
  float offset = (p < 0) ? 1.0f : 0.0f;
  float clipp = (p < -126) ? -126.0f : p;
  int w = clipp;
  float z = clipp - w + offset;
  union { uint32_t i; float f; } v = { cast_uint32_t ( (1 << 23) * (clipp + 121.2740575f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z) ) };

  return v.f;
}

static inline float
fastexp (float p)
{
  return fastpow2 (1.442695040f * p);
}

static inline float
fasterpow2 (float p)
{
  float clipp = (p < -126) ? -126.0f : p;
  union { uint32_t i; float f; } v = { cast_uint32_t ( (1 << 23) * (clipp + 126.94269504f) ) };
  return v.f;
}

static inline float
fasterexp (float p)
{
  return fasterpow2 (1.442695040f * p);
}

48.1445 +- 4.01450  ns per exp_taylor_1 number  max err = 0.00121278
27.4882 +- 1.55120  ns per exp_rational_2 number  max err = 1.91927e-05
27.3199 +- 2.67695  ns per expf number  max err = 0
24.092  +- 2.38582  ns per taylor_exp_double number  max err = 4.01139e-05
21.0399 +- 1.97700  ns per exp_taylor_1_silly number  max err = 0.00121281
20.4665 +- 1.35137  ns per fastexp number  max err = 4.13656e-05
17.1155 +- 1.79687  ns per exp_taylor_2 number  max err = 4.00543e-05
16.1131 +- 1.68323  ns per exp_taylor_3 number  max err = 1.84774e-06
15.8211 +- 1.68255  ns per exp_rational_1 number  max err = 0.000541657
15.0374 +- 1.74190  ns per exp_taylor_4 number  max err = 1.84774e-06
9.19174 +- 1.13393  ns per fasterexp number  max err = 0.0286525
5.91207 +- 1.28270  ns per exp_taylor_4a number  max err = 0.00389856
5.77552 +- 0.838646  ns per RNG number 

Well, in terms of raw speed, exp_taylor4a is the clear winner, at the penalty of having poor accuracy, though not the worst. It’s kind of obnoxious that a large, dumb lookup table beats deep cunning with bit fiddling in both speed and accuracy, but there you go.

The final stages: fiddling with compiler flags and stuff

Well the last stage is fiddling with flags. We don’t always have the luxury of -ffast-math. So how much worse is it without that?

40.4608 +- 3.21249  ns per exp_taylor_1 number  max err = 0.00121278
25.8433 +- 2.77110  ns per expf number  max err = 0
25.1485 +- 2.34913  ns per exp_rational_2 number  max err = 1.91927e-05
22.3017 +- 2.79495  ns per taylor_exp_double number  max err = 4.01139e-05
19.2444 +- 1.15690  ns per fastexp number  max err = 4.33326e-05
18.3797 +- 1.39186  ns per exp_taylor_3 number  max err = 1.84774e-06
18.3455 +- 1.30289  ns per exp_taylor_4 number  max err = 1.84774e-06
17.3415 +- 1.95789  ns per exp_taylor_2 number  max err = 4.00543e-05
17.1502 +- 1.76520  ns per exp_taylor_1_silly number  max err = 0.00121281
8.46855 +- 1.05033  ns per exp_rational_1 number  max err = 0.000541657
5.72898 +- 0.93221  ns per RNG number 
3.23796 +- 0.90463  ns per fasterexp number  max err = 0.0286525
2.24955 +- 0.58446  ns per exp_taylor_4a number  max err = 0.00389856

Huh. Didn’t expect things to get faster.

What about the random number generator. Surely that can’t make any difference, right? Because I measured the speed of that and subtracted it off, right?

#include <random>

//...

//I love C++11!
mt19937 rng;
uniform_real_distribution<float> unif(0, 10);

float rngf()
{
	return unif(rng);
}

41.4247 +- 2.72788  ns per exp_taylor_1 number  max err = 0.00121278
27.8752 +- 4.46173  ns per expf number  max err = 0
15.9981 +- 1.56283  ns per exp_taylor_1_silly number  max err = 0.00121281
15.9583 +- 1.23581  ns per exp_rational_2 number  max err = 1.91927e-05
14.9642 +- 1.26806  ns per fastexp number  max err = 4.33326e-05
14.9292 +- 2.12149  ns per exp_taylor_4 number  max err = 1.84774e-06
13.4167 +- 1.51129  ns per exp_taylor_3 number  max err = 1.84774e-06
12.7261 +- 1.38275  ns per taylor_exp_double number  max err = 4.01139e-05
12.3376 +- 1.94665  ns per exp_taylor_2 number  max err = 4.00543e-05
12.0379 +- 1.33376  ns per RNG number 
10.5822 +- 1.22292  ns per exp_rational_1 number  max err = 0.000541627
4.59183 +- 1.51845  ns per exp_taylor_4a number  max err = 0.00389439
2.96676 +- 1.49019  ns per fasterexp number  max err = 0.0286524

What???? How did fastexp suddenly become so fast? Why did exp_taylor_4a halve in speed?

I think this demonstrates that we’ve reached the limits of microbenchmarking here. It seems that there’s going to be a large factor at this point which depends very much on the code surrounding the exp() because how much that sucks up the execution units on the CPU will affect the speed.

Conclusions

1. You can make a special purpose exp() much faster
2. I hate benchmarking.

The complete code is here:

https://github.com/edrosten/fasterexp

Advertisements