initial to make sure that each bucket starts

initial guess is simply a
linear interpolation from the minimum to the maximum, the first pass can be
very much optimized by realizing that all we really do is to create a histogram
of the distribution. Thus, in the first pass, we do not store the bucket or
bucket-index for the elements, and we do not need to do a binary search to find
the bucket to increase (see Figure 7). The steps of the bucketsort then
becomes:

•    Creating
Histogram

We Will Write a Custom Essay Specifically
For You For Only \$13.90/page!

order now

•
Refining
Pivots

•
Counting
elements per bucket

•
Repositioning
the elements

When
creating the offsets for each bucket for the final pass, we also take the
opportunity to make sure that each bucket starts at a float4 aligned offset,
thereby eliminating the need for an extra pass before merge-sorting the lists.

Choice of
divisions: As explained above, we will need to
split the list into at least d
= 2p parts, where p is the number of
available processors. It is easy to realize, though, that since each thread
will do an atomic increase on one of d
addresses, that choice would lead to a lot of stalls. However, the mergesort
algorithm does not impose any upper limits on the number of sublists. In fact,
increasing the number of sublists in bucketsort decreases the amount of work
that mergesort needs to do. Meanwhile, splitting the list into too many parts
traffic between the CPU and GPU. In our tests, with 32 processors (on the
GeForce 8600) splitting the list into 1024 sublists seems to be the best
choice.

In the
histogram pass, where no binary search is required, we could however use
several times more buckets to improve the quality of pivot points if the
distribution is expected to be very uneven and spiky.

Random
shuffling of input elements: Since the parallel
bucketsort relies on atomic increments of each bucket size, nearly sorted
lists can cause many serialized accesses to the same counter, which can limit
the parallelism and lower the speed. In order to avoid significant slow-down we
can therefore add one pass, without compromising generality, at the beginning
of the algorithm, which randomly shuffles the elements to be sorted. This is
done in O(N) time using CUDA
and the atomic exchange instruction.