initial to make sure that each bucket starts

initial guess is simply a
linear interpolation from the minimum to the maximum, the first pass can be
very much optimized by realizing that all we really do is to create a histogram
of the distribution. Thus, in the first pass, we do not store the bucket or
bucket-index for the elements, and we do not need to do a binary search to find
the bucket to increase (see Figure 7). The steps of the bucketsort then


•    Creating

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now




elements per bucket


the elements


creating the offsets for each bucket for the final pass, we also take the
opportunity to make sure that each bucket starts at a float4 aligned offset,
thereby eliminating the need for an extra pass before merge-sorting the lists.


Choice of
divisions: As explained above, we will need to
split the list into at least d
= 2p parts, where p is the number of
available processors. It is easy to realize, though, that since each thread
will do an atomic increase on one of d
addresses, that choice would lead to a lot of stalls. However, the mergesort
algorithm does not impose any upper limits on the number of sublists. In fact,
increasing the number of sublists in bucketsort decreases the amount of work
that mergesort needs to do. Meanwhile, splitting the list into too many parts
would lead to longer binary searches for each bucketsort-thread and more
traffic between the CPU and GPU. In our tests, with 32 processors (on the
GeForce 8600) splitting the list into 1024 sublists seems to be the best


In the
histogram pass, where no binary search is required, we could however use
several times more buckets to improve the quality of pivot points if the
distribution is expected to be very uneven and spiky.


shuffling of input elements: Since the parallel
bucketsort relies on atomic increments of each bucket size, nearly sorted
lists can cause many serialized accesses to the same counter, which can limit
the parallelism and lower the speed. In order to avoid significant slow-down we
can therefore add one pass, without compromising generality, at the beginning
of the algorithm, which randomly shuffles the elements to be sorted. This is
done in O(N) time using CUDA
and the atomic exchange instruction.