### Introduction

In the source of SGen, Mono’s new garbage collector currently in development, there’s a little linear search function for a small, fixed-size array, with the comment “do a binary search or lookup table later”. One of our Google Summer of Code students took this to heart and implemented a binary search that was unfortunately not only incorrect but also slower than the linear search. This is not too surprising, because implementing a binary search is anything but trivial, despite the seeming simplicity of the algorithm. Also, common wisdom says that for small array sizes, linear search beats binary search.

I took this as the starting point for a little project, to find out about the respective virtues of linear and binary search on modern CPUs.

To limit the scope of this project, I concerned myself only with searching in small to medium sized sorted arrays of integers, producing the index of the first element that is equal to or greater than the search key. My results will not necessarily apply to other applications.

Although I will briefly explore the usage of SIMD instructions to speed up linear search this is not the primary purpose of this project, so I haven’t benchmarked those implementations as extensively, but I will present a few results.

### Implementation

#### Linear search

This is the basic linear search:

static int linear (const int *arr, int n, int key) { int i = 0; while (i < n) { if (arr [i] >= key) break; ++i; } return i; }

Every loop iteration here does two conditional branches: One to check whether we’re already at the element we’re looking for and the other to check whether we’re at the end of the array. Fortunately, both are easy to predict: By predicting that we’re not at the end of the array and that we haven’t found the element yet, we get exactly one branch misprediction per search, which is hard to improve upon.

We can, however, reduce the total number of branches. First, by unrolling the loop, like this:

static int linear_4 (const int *arr, int n, int key) { int i = 0; while (i + 3 < n) { if (arr [i + 0] >= key) return i + 0; if (arr [i + 1] >= key) return i + 1; if (arr [i + 2] >= key) return i + 2; if (arr [i + 3] >= key) return i + 3; i += 4; } while (i < n) { if (arr [i] >= key) break; ++i; } return i; }

This will reduce the number of conditional branches per iteration from 2 to up to 1.25, depending on `n`

. For `n < 4`

this unrolling will not speed up search at all, of course. One can strike a compromise between performance for small loops and the benefits of putting as many iterations into an unrolling as possible by unrolling the loop multiple times, with the longer unrollings first.

There is another trick to get completely rid of the end-check, namely putting a sentinel value after the end of the last valid array element. In our case this sentinel should be the largest possible integer (i.e., `MAX_INT`

). It will always cause the key-check to succeed and the loop to finish after the last element. The basic search function for an array with a sentinel looks like this:

static int linear_sentinel (const int *arr, int key) { int i = 0; for (;;) { if (arr [i] >= key) return i; ++i; } }

Now we’re down to one conditional and one unconditional branch per iteration. By unrolling we can reduce the unconditional branch to a fraction. Note that with a sentinel we don’t need an end-check even with unrolling, so searches through very small arrays also benefit from unrolling.

#### Linear search with SSE2

Using a SIMD instruction set like SSE2 we can further reduce the number of instructions per array element. First, we can load 4 array elements in one go into a 128 bit register. Then we can do 4 comparisons in parallel (with `PCMPGTD`

) if we have prepared another register with 4 copies of `key`

. We can then get the results of those comparisons into a general purpose register (with `PMOVMSKB`

) which we can then compare to the result we would expect if none of the 4 array elements match our key yet, in which case we increment our counter and repeat the loop. If the result indicated that one of the keys does match, we can use the `BSF`

instruction, which counts the number of trailing zero bits in a register, to determine the index of the matching element without needing another loop. Using GCC’s SSE2 builtins, the basic loop looks like this:

static int linear_sentinel_sse2 (const int *arr, int n, int key) { v4si *in_data = (v4si*)arr; v4si key4 = { key, key, key, key }; int i = 0; for (;;) { v4si tmp = __builtin_ia32_pcmpgtd128 (key4, in_data [i]); int res = __builtin_ia32_pmovmskb128 ((v16qi)tmp); if (res != 0xffff) break; ++i; } return i * 4 + __builtin_ctz (~res) / 4; }

This loop can of course be unrolled, as well. Note that after the sentinel we require that there be at least 3 more dummy array elements so that the load doesn’t result in a segmentation fault. If the array is properly aligned to 16 bytes that will always be the case.

We are now at one conditional branch per 4 array elements, but we can go even lower. Using the `PACKSSDW`

and `PACKSSWB`

instructions we can pack the results of 16 key comparisons into one vector which we can then move into a general purpose register like above and branch on it. We’ll require even more dummy memory after the sentinel. Here’s what it looks like:

static int linear_sentinel_sse2_nobranch (const int *arr, int n, int key) { v4si *in_data = (v4si*)arr; v4si key4 = { key, key, key, key }; int i = 0; for (;;) { v4si cmp0 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 0]); v4si cmp1 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 1]); v4si cmp2 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 2]); v4si cmp3 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 3]); v8hi pack01 = __builtin_ia32_packssdw128 (cmp0, cmp1); v8hi pack23 = __builtin_ia32_packssdw128 (cmp2, cmp3); v16qi pack0123 = __builtin_ia32_packsswb128 (pack01, pack23); int res = __builtin_ia32_pmovmskb128 (pack0123); if (res != 0xffff) break; i += 4; } return i * 4 + __builtin_ctz (~res); }

This loop can also be unrolled, but I have refrained from doing so.

#### Binary search

Here’s the basic binary search:

static int binary (const int *arr, int n, int key) { int min = 0, max = n; while (min < max) { int middle = (min + max) >> 1; if (key > arr [middle]) min = middle + 1; else max = middle; } return min; }

This will work for `n < MAX_INT/2`

. For larger `n`

, `middle`

must be calculated as `min + ((max - min) >> 1)`

, which is one subtraction per iteration more.

Each iteration here also does two unconditional branches, one of which is essentially unpredictable. There’s no way for the CPU to know in which of the two halves of the search space the element you’re looking for will be, which means that we’ll get on average one branch misprediction per two iterations.

There is a way around this unconditional branch, however: It can be turned into two conditional move instructions. I haven’t found a way to tell GCC to do this, so I had to do it using inline assembly (for x86 and AMD64):

static int binary_cmov (const int *arr, int n, int key) { int min = 0, max = n; while (min < max) { int middle = (min + max) >> 1; asm ("cmpl %3, %2\n\tcmovg %4, %0\n\tcmovle %5, %1" : "+r" (min), "+r" (max) : "r" (key), "g" (arr [middle]), "g" (middle + 1), "g" (middle)); } return min; }

Even though this version has to calculate potentially unneeded values it is still significantly faster, as we shall see.

Given that the linear search is indeed faster than the binary one for small `n`

, we can also try to switch to a linear search once the distance between `min`

and `max`

falls below some favorable threshold, to be determined by benchmarking results.

The binary search can also be unrolled, in which case the loop invariant needs to maintain a minimum distance between `min`

and `max`

, depending on how often we unroll the loop. If we work under the assumption that we know the size of the array statically, we can even go so far as to unroll the binary search completely and thereby save the termination check. This can also be combined with a finishing linear search, of course. I have only tested full loop unrolling, with and without linear finish.

### Observations

#### Methodology

I have written a benchmark program that times a large number of different combinations of the optimizations described above with various array sizes, ranging from 1 to 65536. For each algorithm and array size we first do a few calibration runs to figure out how many random searches are needed so that the program runs for at least two seconds. That number of searches is then repeated 10 times, the two fastest and the two slowest times are thrown away and the rest is averaged. Error bars are so tiny that they can be ignored, for all practical purposes.

The random numbers are generated before the test runs, and we generate at most 100000 of them whereas we typically do tens of millions of searches, so the random number generation overhead is negligible.

For very small array sizes the benchmark overhead might still be at least as large as the time spent in the search, however, so the results for array sizes below at least 8 should be taken with a grain of salt.

#### Trends

Since there are too many combinations to present them all in detail I’ll first talk about which searches perform best and then we’ll look into those in more depth.

As far as linear search is concerned, it pays off to unroll and it pays off to use sentinels. The best non-sentinel linear search I tested unrolls the loop first 8 times and then 4 times. This search in turn is beaten handily across the board by a 32 times unrolled linear search with a sentinel. I have not tested longer unrollings than 32 for the sentinel search. On AMD CPUs, unrolling 32 times is a bit slower than unrolling 16 times, though, while for the Intel CPUs they’re at about the same speed.

The SSE2 searches outperform the “regular” linear searches by quite a margin for large, but not for smaller array sizes. I have tested them on a Core 2 and on a Core i7 machine. On the former they are worse than the best regular linear searches for small array sizes, on the latter they yield about the same performance.

With binary search, conditional move gives a big speedup, although on some CPUs (Pentium 4/D) plain binary search is quite a bit faster below array sizes of 256. Unrolling yields some improvements (around 10%) on a few Intel CPUs that I tested with, as well as on the two AMD CPUs (Athlon and Phenom II), and isn’t significantly slower on any, so it might be worth the hassle if you know the array size statically. Switching to linear search resulted in worse performance for array sizes above 256, no matter which binary search and which linear search I tried it on.

#### Results

Here, then, as a concrete example, is a plot of the run-times of the most interesting algorithms on an Intel Core i7 running at 2.8 GHz for small array sizes:

On the horizontal axis we have the array size and on the vertical axis we have the runtime for one search in seconds. As you can see the curves for the two binary searches with conditional move – plain and unrolled – are almost identical. The unrolled linear search beats plain linear search by more than a factor of two, and the unrolled linear search with sentinel is faster still. For array sizes larger than around 80, the unrolled SSE2 linear search (with sentinel) beats the best regular linear search, but below they are at about the same level. At around an array size of 100 we can see the fastest binary search starting to win against the fastest linear search. The fastest regular linear search can hold on against plain binary search until about an array size of 200. The SSE2 linear search is beaten by plain binary search only at an array size of over 400, which is not on the diagram. The 16-element per conditional branch SSE2 linear search (not shown in the diagram), which outperforms even the “normal” unrolled SSE2 search for larger array sizes, can hold on until about 550.

We can also calculate that at the break even point of 100 each search took on average almost exactly 50 cycles. For the linear search this means that each iteration took one cycle (because if you’re searching for random elements in an array of size 100 you have to visit an average of 50 elements), whereas each binary search iteration took about 6 cycles on average.

On a Core 2 (at 1.86 GHz) things look a bit different:

The SSE2 optimized linear search doesn’t perform nearly as well as the unrolled linear sentinel search below 100, and the break even point with the fastest binary search is at around 64. We can also see that the unrolled binary search performs a bit better than the one without unrolling.

The following table summarizes the most important results for all the CPUs on which I’ve run the benchmarks. It does not include data for the SSE2 optimized linear searches because I have only run those on two of the CPUs in the table.

CPU | Clock | Best | Best linear | Best linear short | B/E | Cycles @ B/E |
---|---|---|---|---|---|---|

Core i7 | 2.8 GHz | binary_cmov | linear_sentinel_32 | linear_sentinel_32 | 97 | 52 |

Core 2 | 1.86 GHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 64 | 116 |

Pentium 4 | 3.4 GHz | binary_cmov_unrolled | linear_sentinel_8 | linear_sentinel_32 | 88 | 113 |

Celeron | 2.66 GHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 88 | 120 |

Atom | 1.6 GHz | binary_cmov_unrolled_linear_sentinel | linear_sentinel_32 | linear_sentinel_32 | 88 | 102 |

AMD Phenom II | 3.2 GHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 29 | 36 |

AMD Athlon | 1.5 GHz | binary_cmov_unrolled | linear_sentinel_16 | linear_sentinel_16 | 26 | 53 |

Core | 2.16 GHz | binary_cmov_unrolled | linear_84 | linear_sentinel_32 | 54 | 67 |

Geode | 430 MHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 25 | 100 |

The column “Best” gives the asymptotically fastest search. “Best linear” is the asymptotically fastest linear search and “Best linear short” is the fastest linear search for small array sizes. “B/E” (for “break even”) is the array size for which the fastest binary search overtakes the fastest linear search. The last column, “Cycles @ B/E” gives the average number of cycles per search at the break even point.

As a last interesting tidbit, here’s a table that gives the performance of the fastest linear searches for an array size of 1000, in number of processed array elements per clock cycle:

CPU | Regular | SSE2 |
---|---|---|

Core i7 | 1 | 2.6 |

Core 2 | 0.4 | 0.6 |

I am very impressed by the Nehalem core.

### Conclusion

To put everything I’ve learned from this into a few words: If you need to search through a sorted array of integers and performance is really, really important, use linear search if your array is below around 64 elements in size, binary search if it’s above. If you use linear search, it’s best to use a sentinel and to unroll the loop (16 times recommended). If you use binary search, make sure your code uses conditional moves inside the loop instead of a conditional branch. If you’re very, very serious about performance and know the array size statically, completely unroll the binary search.

### Resources

If you want to play around with all this yourself, compare performance in detail or send me benchmarks for CPUs I haven’t tested myself, visit http://github.com/schani/linbin. I’d be especially interested in non-x86 CPUs. If you can improve on my implementations, I’d be very keen to hear about it.

### Credits

Thanks to Stefan Kral for lots of useful advice and for the first version of the SSE2 linear search, and to Herbert Poetzl for doing lots of benchmark runs.

Awesome post! I’ve been wanting to do a similar test for a long time but never got around to it.

Thanks!

Did you really have to write the code?!? Common that is what big Oh notation is for. There main problem people are having is to truly understand what big Oh means. What I mean is the following, we know as a fact that a linear search is O(n) while a binary search is O(log N), of course if we do not understand our algorithm (in this case binary search) we will thing that O(log N) will always outperform O(N) but that is not true.

If you carefully think about the definition of big Oh notation you will find that in most cases we are worried about N begin close to a large number. This is a very important fact, because we are ignoring the exact and precise performance of our implementation. If we where to talk about the performance of the binary search we will have to say that it is O(log N) + K where K represents the constant overhead that our algorithm has. Getting back to big numbers, if we say that K is constant, it means that when the limit of N is infinite K becomes 0 since a small number divided by a large number is nearly 0. At this point we can agree that if our problem is facing large number, we ought to ignore K since it has no added value. Nevertheless, there are cases where O(N) does outperform O(log N) because we are working with a small number of items. In the linear search algorithm you have not overhead (nearly) meaning that K is 0 while in the binary search we do have K, because the number of items is small, we cannot ignore K! That is, O(N) is better than O(log N) + K because K is large enough when dealing with a small list!!

As you can see, if the algorithm is correctly understood as well as the environment in which we use it, there is actually no need to write tests cases etc… you could have reached the conclusion using pen and paper ;).

Nevertheless, I think this is a goo post for other people but please explain somewhere that the theory s right! people should stop using big Oh until the really understand it meaning and the algorithm implementation…

You misunderstand the purpose of my little project. What you’re telling us about big-O is common knowledge and barely worth menitioning – I assume my readers know this stuff. My goal was to figure out exactly how linear vs binary search behave for small array sizes on modern architectures and how to get the most out of them. I fail to see how you could have worked that out using pen and paper.

I did understand the purpose, what would have been more “insightful” would have been a mathematical proof of the algorithm being impacted by the hardware improvements, that way you could state that which would be the real impact. Let me get down to it and I’ll get back to you with it.

Microprocessor hardware is only one factor. Algorithm is another. C compiler is another. Assembler is another. Linker is another. And I probably forgot to mention a couple of other factors:)

All factors are interdependent, and most of the arcane details are neither publicly disclosed nor well-documented.

I understand one would _like_ to have a grand analytical model encompassing all these levels, but it is infeasible, given the size of the model required and the number of unknowns.

An empirical method is the only way of obtaining _any usable_ results; the idea of “getting a mathematical proof” is simply preposterous.

Small problem sizes _do_ occur in practice, and big-O is of no use there.

Determining the exact break-even points analytically is a zillion times more complex than doing empirical measurements, as analytical methods would require precise modeling of the microprocessors, compilers, assemblers, and linkers. Any volunteers?

True, but would be more interesting, at least from my point of view, that does not mean that the post is not interesting! but rather an “easy thing to do” (take no offence here, please)

Mandel,

Big-Oh is not accurate enough to figure out *exactly* where the line is drawn in a comparison of linear vs binary. The point of Schani’s post was to figure out at what array size it makes sense to use linear rather than binary search. We all know that linear can be faster than binary for small sizes, but what is “small sizes”? You can’t use Big-Oh to figure that out. You

haveto write the code and test it to be sure. You may have noted that Schani’s results differ between architectures and there’s no way comparing O(N)+K vs O(log N)+K would ever have allowed Schani to arrive at the results he got after testing for actual results.You might be interested in this, possibly much faster, implementation of binary search.

int N, A[N];

int binary_search(int val)

{

int i, step;

for (step = 1; step < N; step <>= 1)

if (i | step < N && A[i | step] <= val)

i |= step;

return i;

}

Sorry, the pasting screw something up. Trying again:

int N, A[N];

int binary_search(int val) {

int i, step;

for (step = 1; step < N; step <>= 1)

if (i | step < N && A[i | step] <= val)

i |= step;

return i;

}

please check your code again. it is broken in more than one way…

also why should it be faster?

I fixed, i.e., rewrote, that code below, sounds promising, and works for arrays of power-of-2 size.

The code is broken, but it looks as a messy recollection of an actually good idea: having just one conditional move, by representing the interval as [i, i + step[ rather than as [min, max[. However, the range check completely misses the point.

I believe this algorithm can’t work, even with the commented out test, for non-power-of-2 cases; I would invoke it just on the main fraction of the array.

//Only works when N is a power of 2!

static int binary_smart(const int a[], int n, int key) {

int i = 0, step;

for (step = n / 2; step > 0; step >>= 1)

//This test is unneeded.

//if ((i | step) < n && a[i | step] <= key)

if (a[i | step] <= key)

i |= step;

return i;

}

I will try introducing cmov and doing a small comparison with the other binary search on my machine.

Complete correctness checking code here:

http://pastie.org/1115219

With this version (the one in my other comment) and without any tuning, I got the fastest version of them all; I checked more extensively on n = 4096, but it seems the fastest also down to n = 128 and n = 64.

==DETAILS==

On my system, binary, binary_cmov and binary_smart (which is automatically optimized by using cmov) take 1, 0.7, and 0.5 s on my systems, when launched as:

time ./searchtest 10000000 $ALG_NAME 4096

The second fastest routine, here, seems to be binary_cmov_unrolled, taking ~0.6 s.

However, adding -funroll-loops to gcc command line (gcc 4.4.1 from Ubuntu 9.10), it gets down to ~0.44s. To note that -funroll-loops doesn’t improve all routines, just some of them.

Given the amount of optimization effort spent, it would be maybe worth to try to combine the two routines, or to fix this version. This is a first attempt (also at http://pastie.org/1115514).

static int binary_smart_(const int *a, int n, int key) {

int idx;

//idx = 2^(largest i| 2^i < n):

for (idx = 1; idx < n; idx <>= 1;

if (a[idx] > key) {

return binary_smart_(a, idx, key);

} else {

//search_func_t binary_cmov_unrolled = get_binary_cmov_unrolled(n);

return binary_cmov(a + idx, n – idx, key);

}

}

However, this version takes ~0.7 s, rather than ~0.44, I believe because of the first loop. It can be probably optimized through some fancy instruction, though.

Otherwise, it’s still the best version for arrays of size 2^n.

Very interesting.

I’m not sure your conclusions are totally right though. As the array size and total search time decreases, the proportional effect of things that are external to the search implementation (eg: what level of the memory hierarchy is the array data in? what about the search function code? what is your data alignment? what other code runs between each search?) will increase.

The conclusion I would draw from your results is: For a general search implementation, use binary search with cmov (which has the best asymptotic complexity and can be reasonably expected to always beat or match a naive binary search; unroll if you really want to). If performance is important and you’re dealing with short arrays, then I would argue that no general rule can be given except to profile and optimise specifically for your use case.

I’m slightly surprised that the compiler didn’t emit a cmov itself, although I probably shouldn’t be.

It should do better if you use the ternary operator:

min = (key > arr[middle]) ? middle+1 : min;

max = (key <= arr[middle]) ? middle : max;

For me, with gcc 4.4.3 (-O3), I get almost exactly the same cmov instructions as you wrote by hand.

So did you update the sgen code :-)

No, I didn’t :-). Yet!