BucketSort implementation is not completely correct

I’ve tested the code in the video, but when you add larger numbers like 888 to the array and test your bucketSort class, there might be out of scope error. Because 888/3 = 296 and you don’t have any bucket at index 296 of buckets list.

1 Like

Agreed. If you check Wikipedia about the Bucket Sort algorithm it accounts for the max value stored when determining the bucket index. For k buckets with a max value of M, the bucket for the value at index i should be floor(k*value[i]/M):

So your example would be floor(3*888/888) = floor(3) = 2 (assuming 888 were the max value).

Note: this requires one iteration through the whole array at the start to get the max value.

1 Like

But I think this one is also not accurate.
If we have an array of {3, 1, 5, 7, 4, 5, 4, 888},
according to this formula “floor(k*value[i]/M)” there will be only two buckets, which makes this algorithm very slow.

So, I used the old formula which is in the video course, but I used std::map for the buckets.
Here is my code:

static void BucketSort(std::vector &v, int numberOfBuckets){
std::map<int, std::list> buckets;
for (auto item : v){
buckets[item / numberOfBuckets].push_back(item);
}

int i = 0;
for (auto &[key, bucket] : buckets){
	bucket.sort();
	for(auto item : bucket)
		v[i++] = item;
}

}

If you think my code can be considered accurate, please let me know.

Not accurate is probably the wrong description. It would use 3 buckets and nothing would go into the middle bucket. First bucket would contain everything other than 888 and the third bucket would contain just 888. This just proves that the Bucket Sort algorithm is susceptible to very bad distributions causing poor performance (pretty much all sorting algorithms have degenerate cases).

I am not certain of the iteration order guarantees on maps on whatever derivative of C you are using. You must iterate from the first index to the last in order to get the sort order correct. If you have to get the keys, then sort those, you are kinda getting into a circular problem (I must sort these keys in order to sort these items). If the order of the keys is guaranteed to be smallest to largest that could work though. Otherwise I do not think it actually does the sort correctly.

1 Like

According to this StackOverflow answer std::map should have a guaranteed iteration order through the keys from start to end. So your algorithm should work if that is true.

I should also mention that if you use simply dividing by the “number of buckets” that you are not actually guaranteeing that number of buckets at all (since you can have more than that number of buckets). Using the example numbers you provided, here are the bucket numbers you get for the inputs:

  • 3 => 3/3 = 1
  • 1 => 1/3 = 0
  • 5 => 5/3 = 1
  • 7 => 7/3 = 2
  • 4 => 4/3 = 1
  • 5, 4 (see above)
  • 888 => 888/3 = 296

Note: that would be 4 buckets (not 3): 0, 1, 2, and 296.


Since we need to identify the largest value anyways in the classic BucketSort, I am curious that a better variation is to keep track of some histogram information on orders of magnitude (maybe powers of 2 for example). Then you could use the histogram information to tell you how to create buckets that are approximately the same size.

Applying that to our example of {3, 1, 5, 7, 4, 5, 4, 888} we would create a histogram like this:

{ 
  Range[0 to 0]: 0, // no items in this range
  Range[1 to 1]: 1, // just one 1 in this range
  Range[2 to 2]: 0, // no items in this range
  Range[3 to 4]: 3, // 3, 4, and 4 are in this range
  Range[5 to 8]: 3, // 5, 7, and 5 are in this range
  Range[9 to 16]: 0, // no items in this range
  ...
  Range[513 to 1024]: 1, // 888 is in this range
}

Since we have 8 items total, we want to get 2 to 3 items in each bucket, so we should pick approximate breakpoints that would get us there. We can get exactly 4 items in the first bucket by choosing 0 to 4 as our first bucket, but since we want to get closer to 2.5, we should try to get 1.5 items from that bucket which means aiming for the middle (since there are 3 items total) which means using 0 to 3 as our first bucket. That leaves approximately 1.5 items in that bucket and we want (approximately) another 1 item from the next bucket that has items. It has 3 so we could get approximately 4.5 items if we chose the next bucket to be from 4 to 8. Trying to get only 1 of the 3 items means we want to pick about 1/3 through the range which means we should really cut it off at 6. So our second bucket becomes from 4 to 6. That leaves our third bucket covering 7 to 888.

Putting our values into those buckets, let us see how the breakdown turned out:

  • bucket[0] (range 0 to 3) = [3, 1]
  • bucket[1] (range 4 to 6) = [5, 4, 5, 4]
  • bucket[2] (range 7 to 888) = [7, 888]

While we did get one bucket larger than 3 items, that really was not such a bad breakdown (especially compared to the main one where everything ended up in the same bucket except 888).

Anyway, thanks for pointing out this issue since many students would not have discovered the issue.

@Mosh - it may actually be worth updating that particular set of videos since the algorithm is actually wrong.

1 Like

It does the sort correctly

OK, but it is still not a bucket sort since it is not using a fixed number of buckets. Dividing by k (number of buckets) is not enough to guarantee k buckets. You have to multiply by k and divide by M (max value). Otherwise you can have an arbitrary number of buckets. Cheers!