### Introduction

In the source of SGen, Mono’s new garbage collector currently in development, there’s a little linear search function for a small, fixed-size array, with the comment “do a binary search or lookup table later”. One of our Google Summer of Code students took this to heart and implemented a binary search that was unfortunately not only incorrect but also slower than the linear search. This is not too surprising, because implementing a binary search is anything but trivial, despite the seeming simplicity of the algorithm. Also, common wisdom says that for small array sizes, linear search beats binary search.

I took this as the starting point for a little project, to find out about the respective virtues of linear and binary search on modern CPUs.

To limit the scope of this project, I concerned myself only with searching in small to medium sized sorted arrays of integers, producing the index of the first element that is equal to or greater than the search key. My results will not necessarily apply to other applications.

Although I will briefly explore the usage of SIMD instructions to speed up linear search this is not the primary purpose of this project, so I haven’t benchmarked those implementations as extensively, but I will present a few results.

### Implementation

#### Linear search

This is the basic linear search:

static int linear (const int *arr, int n, int key) { int i = 0; while (i < n) { if (arr [i] >= key) break; ++i; } return i; }

Every loop iteration here does two conditional branches: One to check whether we’re already at the element we’re looking for and the other to check whether we’re at the end of the array. Fortunately, both are easy to predict: By predicting that we’re not at the end of the array and that we haven’t found the element yet, we get exactly one branch misprediction per search, which is hard to improve upon.

We can, however, reduce the total number of branches. First, by unrolling the loop, like this:

static int linear_4 (const int *arr, int n, int key) { int i = 0; while (i + 3 < n) { if (arr [i + 0] >= key) return i + 0; if (arr [i + 1] >= key) return i + 1; if (arr [i + 2] >= key) return i + 2; if (arr [i + 3] >= key) return i + 3; i += 4; } while (i < n) { if (arr [i] >= key) break; ++i; } return i; }

This will reduce the number of conditional branches per iteration from 2 to up to 1.25, depending on `n`

. For `n < 4`

this unrolling will not speed up search at all, of course. One can strike a compromise between performance for small loops and the benefits of putting as many iterations into an unrolling as possible by unrolling the loop multiple times, with the longer unrollings first.

There is another trick to get completely rid of the end-check, namely putting a sentinel value after the end of the last valid array element. In our case this sentinel should be the largest possible integer (i.e., `MAX_INT`

). It will always cause the key-check to succeed and the loop to finish after the last element. The basic search function for an array with a sentinel looks like this:

static int linear_sentinel (const int *arr, int key) { int i = 0; for (;;) { if (arr [i] >= key) return i; ++i; } }

Now we’re down to one conditional and one unconditional branch per iteration. By unrolling we can reduce the unconditional branch to a fraction. Note that with a sentinel we don’t need an end-check even with unrolling, so searches through very small arrays also benefit from unrolling.

#### Linear search with SSE2

Using a SIMD instruction set like SSE2 we can further reduce the number of instructions per array element. First, we can load 4 array elements in one go into a 128 bit register. Then we can do 4 comparisons in parallel (with `PCMPGTD`

) if we have prepared another register with 4 copies of `key`

. We can then get the results of those comparisons into a general purpose register (with `PMOVMSKB`

) which we can then compare to the result we would expect if none of the 4 array elements match our key yet, in which case we increment our counter and repeat the loop. If the result indicated that one of the keys does match, we can use the `BSF`

instruction, which counts the number of trailing zero bits in a register, to determine the index of the matching element without needing another loop. Using GCC’s SSE2 builtins, the basic loop looks like this:

static int linear_sentinel_sse2 (const int *arr, int n, int key) { v4si *in_data = (v4si*)arr; v4si key4 = { key, key, key, key }; int i = 0; for (;;) { v4si tmp = __builtin_ia32_pcmpgtd128 (key4, in_data [i]); int res = __builtin_ia32_pmovmskb128 ((v16qi)tmp); if (res != 0xffff) break; ++i; } return i * 4 + __builtin_ctz (~res) / 4; }

This loop can of course be unrolled, as well. Note that after the sentinel we require that there be at least 3 more dummy array elements so that the load doesn’t result in a segmentation fault. If the array is properly aligned to 16 bytes that will always be the case.

We are now at one conditional branch per 4 array elements, but we can go even lower. Using the `PACKSSDW`

and `PACKSSWB`

instructions we can pack the results of 16 key comparisons into one vector which we can then move into a general purpose register like above and branch on it. We’ll require even more dummy memory after the sentinel. Here’s what it looks like:

static int linear_sentinel_sse2_nobranch (const int *arr, int n, int key) { v4si *in_data = (v4si*)arr; v4si key4 = { key, key, key, key }; int i = 0; for (;;) { v4si cmp0 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 0]); v4si cmp1 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 1]); v4si cmp2 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 2]); v4si cmp3 = __builtin_ia32_pcmpgtd128 (key4, in_data [i + 3]); v8hi pack01 = __builtin_ia32_packssdw128 (cmp0, cmp1); v8hi pack23 = __builtin_ia32_packssdw128 (cmp2, cmp3); v16qi pack0123 = __builtin_ia32_packsswb128 (pack01, pack23); int res = __builtin_ia32_pmovmskb128 (pack0123); if (res != 0xffff) break; i += 4; } return i * 4 + __builtin_ctz (~res); }

This loop can also be unrolled, but I have refrained from doing so.

#### Binary search

Here’s the basic binary search:

static int binary (const int *arr, int n, int key) { int min = 0, max = n; while (min < max) { int middle = (min + max) >> 1; if (key > arr [middle]) min = middle + 1; else max = middle; } return min; }

This will work for `n < MAX_INT/2`

. For larger `n`

, `middle`

must be calculated as `min + ((max - min) >> 1)`

, which is one subtraction per iteration more.

Each iteration here also does two unconditional branches, one of which is essentially unpredictable. There’s no way for the CPU to know in which of the two halves of the search space the element you’re looking for will be, which means that we’ll get on average one branch misprediction per two iterations.

There is a way around this unconditional branch, however: It can be turned into two conditional move instructions. I haven’t found a way to tell GCC to do this, so I had to do it using inline assembly (for x86 and AMD64):

static int binary_cmov (const int *arr, int n, int key) { int min = 0, max = n; while (min < max) { int middle = (min + max) >> 1; asm ("cmpl %3, %2\n\tcmovg %4, %0\n\tcmovle %5, %1" : "+r" (min), "+r" (max) : "r" (key), "g" (arr [middle]), "g" (middle + 1), "g" (middle)); } return min; }

Even though this version has to calculate potentially unneeded values it is still significantly faster, as we shall see.

Given that the linear search is indeed faster than the binary one for small `n`

, we can also try to switch to a linear search once the distance between `min`

and `max`

falls below some favorable threshold, to be determined by benchmarking results.

The binary search can also be unrolled, in which case the loop invariant needs to maintain a minimum distance between `min`

and `max`

, depending on how often we unroll the loop. If we work under the assumption that we know the size of the array statically, we can even go so far as to unroll the binary search completely and thereby save the termination check. This can also be combined with a finishing linear search, of course. I have only tested full loop unrolling, with and without linear finish.

### Observations

#### Methodology

I have written a benchmark program that times a large number of different combinations of the optimizations described above with various array sizes, ranging from 1 to 65536. For each algorithm and array size we first do a few calibration runs to figure out how many random searches are needed so that the program runs for at least two seconds. That number of searches is then repeated 10 times, the two fastest and the two slowest times are thrown away and the rest is averaged. Error bars are so tiny that they can be ignored, for all practical purposes.

The random numbers are generated before the test runs, and we generate at most 100000 of them whereas we typically do tens of millions of searches, so the random number generation overhead is negligible.

For very small array sizes the benchmark overhead might still be at least as large as the time spent in the search, however, so the results for array sizes below at least 8 should be taken with a grain of salt.

#### Trends

Since there are too many combinations to present them all in detail I’ll first talk about which searches perform best and then we’ll look into those in more depth.

As far as linear search is concerned, it pays off to unroll and it pays off to use sentinels. The best non-sentinel linear search I tested unrolls the loop first 8 times and then 4 times. This search in turn is beaten handily across the board by a 32 times unrolled linear search with a sentinel. I have not tested longer unrollings than 32 for the sentinel search. On AMD CPUs, unrolling 32 times is a bit slower than unrolling 16 times, though, while for the Intel CPUs they’re at about the same speed.

The SSE2 searches outperform the “regular” linear searches by quite a margin for large, but not for smaller array sizes. I have tested them on a Core 2 and on a Core i7 machine. On the former they are worse than the best regular linear searches for small array sizes, on the latter they yield about the same performance.

With binary search, conditional move gives a big speedup, although on some CPUs (Pentium 4/D) plain binary search is quite a bit faster below array sizes of 256. Unrolling yields some improvements (around 10%) on a few Intel CPUs that I tested with, as well as on the two AMD CPUs (Athlon and Phenom II), and isn’t significantly slower on any, so it might be worth the hassle if you know the array size statically. Switching to linear search resulted in worse performance for array sizes above 256, no matter which binary search and which linear search I tried it on.

#### Results

Here, then, as a concrete example, is a plot of the run-times of the most interesting algorithms on an Intel Core i7 running at 2.8 GHz for small array sizes:

On the horizontal axis we have the array size and on the vertical axis we have the runtime for one search in seconds. As you can see the curves for the two binary searches with conditional move – plain and unrolled – are almost identical. The unrolled linear search beats plain linear search by more than a factor of two, and the unrolled linear search with sentinel is faster still. For array sizes larger than around 80, the unrolled SSE2 linear search (with sentinel) beats the best regular linear search, but below they are at about the same level. At around an array size of 100 we can see the fastest binary search starting to win against the fastest linear search. The fastest regular linear search can hold on against plain binary search until about an array size of 200. The SSE2 linear search is beaten by plain binary search only at an array size of over 400, which is not on the diagram. The 16-element per conditional branch SSE2 linear search (not shown in the diagram), which outperforms even the “normal” unrolled SSE2 search for larger array sizes, can hold on until about 550.

We can also calculate that at the break even point of 100 each search took on average almost exactly 50 cycles. For the linear search this means that each iteration took one cycle (because if you’re searching for random elements in an array of size 100 you have to visit an average of 50 elements), whereas each binary search iteration took about 6 cycles on average.

On a Core 2 (at 1.86 GHz) things look a bit different:

The SSE2 optimized linear search doesn’t perform nearly as well as the unrolled linear sentinel search below 100, and the break even point with the fastest binary search is at around 64. We can also see that the unrolled binary search performs a bit better than the one without unrolling.

The following table summarizes the most important results for all the CPUs on which I’ve run the benchmarks. It does not include data for the SSE2 optimized linear searches because I have only run those on two of the CPUs in the table.

CPU | Clock | Best | Best linear | Best linear short | B/E | Cycles @ B/E |
---|---|---|---|---|---|---|

Core i7 | 2.8 GHz | binary_cmov | linear_sentinel_32 | linear_sentinel_32 | 97 | 52 |

Core 2 | 1.86 GHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 64 | 116 |

Pentium 4 | 3.4 GHz | binary_cmov_unrolled | linear_sentinel_8 | linear_sentinel_32 | 88 | 113 |

Celeron | 2.66 GHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 88 | 120 |

Atom | 1.6 GHz | binary_cmov_unrolled_linear_sentinel | linear_sentinel_32 | linear_sentinel_32 | 88 | 102 |

AMD Phenom II | 3.2 GHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 29 | 36 |

AMD Athlon | 1.5 GHz | binary_cmov_unrolled | linear_sentinel_16 | linear_sentinel_16 | 26 | 53 |

Core | 2.16 GHz | binary_cmov_unrolled | linear_84 | linear_sentinel_32 | 54 | 67 |

Geode | 430 MHz | binary_cmov_unrolled | linear_sentinel_32 | linear_sentinel_32 | 25 | 100 |

The column “Best” gives the asymptotically fastest search. “Best linear” is the asymptotically fastest linear search and “Best linear short” is the fastest linear search for small array sizes. “B/E” (for “break even”) is the array size for which the fastest binary search overtakes the fastest linear search. The last column, “Cycles @ B/E” gives the average number of cycles per search at the break even point.

As a last interesting tidbit, here’s a table that gives the performance of the fastest linear searches for an array size of 1000, in number of processed array elements per clock cycle:

CPU | Regular | SSE2 |
---|---|---|

Core i7 | 1 | 2.6 |

Core 2 | 0.4 | 0.6 |

I am very impressed by the Nehalem core.

### Conclusion

To put everything I’ve learned from this into a few words: If you need to search through a sorted array of integers and performance is really, really important, use linear search if your array is below around 64 elements in size, binary search if it’s above. If you use linear search, it’s best to use a sentinel and to unroll the loop (16 times recommended). If you use binary search, make sure your code uses conditional moves inside the loop instead of a conditional branch. If you’re very, very serious about performance and know the array size statically, completely unroll the binary search.

### Resources

If you want to play around with all this yourself, compare performance in detail or send me benchmarks for CPUs I haven’t tested myself, visit http://github.com/schani/linbin. I’d be especially interested in non-x86 CPUs. If you can improve on my implementations, I’d be very keen to hear about it.

### Credits

Thanks to Stefan Kral for lots of useful advice and for the first version of the SSE2 linear search, and to Herbert Poetzl for doing lots of benchmark runs.