initial guess is simply a

linear interpolation from the minimum to the maximum, the first pass can be

very much optimized by realizing that all we really do is to create a histogram

of the distribution. Thus, in the first pass, we do not store the bucket or

bucket-index for the elements, and we do not need to do a binary search to find

the bucket to increase (see Figure 7). The steps of the bucketsort then

becomes:

• Creating

Histogram

•

Refining

Pivots

•

Counting

elements per bucket

•

Repositioning

the elements

When

creating the offsets for each bucket for the final pass, we also take the

opportunity to make sure that each bucket starts at a float4 aligned offset,

thereby eliminating the need for an extra pass before merge-sorting the lists.

Choice of

divisions: As explained above, we will need to

split the list into at least d

= 2p parts, where p is the number of

available processors. It is easy to realize, though, that since each thread

will do an atomic increase on one of d

addresses, that choice would lead to a lot of stalls. However, the mergesort

algorithm does not impose any upper limits on the number of sublists. In fact,

increasing the number of sublists in bucketsort decreases the amount of work

that mergesort needs to do. Meanwhile, splitting the list into too many parts

would lead to longer binary searches for each bucketsort-thread and more

traffic between the CPU and GPU. In our tests, with 32 processors (on the

GeForce 8600) splitting the list into 1024 sublists seems to be the best

choice.

In the

histogram pass, where no binary search is required, we could however use

several times more buckets to improve the quality of pivot points if the

distribution is expected to be very uneven and spiky.

Random

shuffling of input elements: Since the parallel

bucketsort relies on atomic increments of each bucket size, nearly sorted

lists can cause many serialized accesses to the same counter, which can limit

the parallelism and lower the speed. In order to avoid significant slow-down we

can therefore add one pass, without compromising generality, at the beginning

of the algorithm, which randomly shuffles the elements to be sorted. This is

done in O(N) time using CUDA

and the atomic exchange instruction.