Why hash maps in Java 8 use binary tree instead of linked list?
I recently came to know that in Java 8 hash maps uses binary tree instead of linked list and hash code is used as the branching factor.I understand that in case of high collision the lookup is reduced to O(log n) from O(n) by using binary trees.My question is what good does it really do as the amortized time complexity is still O(1) and maybe if you force to store all the entries in the same bucket by providing the same hash code for all keys we can see a significant time difference but no one in their right minds would do that.
Binary tree also uses more space than singly linked list as it stores both left and right nodes.Why increase the space complexity when there is absolutely no improvement in time complexity except for some spurious test cases.
This is mostly security-related change. While in normal situation it's rarely possible to have many collisions, if hash keys arrive from untrusted source (e.g. HTTP header names received from the client), then it's possible and not very hard to specially craft the input, so the resulting keys will have the same hashcode. Now if you perform many look-ups, you may experience denial-of-service. It appears that there's quite a lot of code in the wild which is vulnerable to this kind of attacks, thus it was decided to fix this on the Java side.
For more information refer to JEP-180.
Your question contains some wrong premises.
- A bucket collision is not necessarily a hash collision. You don’t need to have the same hash code for two objects to end up at the same bucket. The bucket is an element of an array and the hash code has to be mapped to a particular index. Since the array size should be reasonable in respect to the Map’s size, you can’t raise the array size arbitrarily to avoid bucket collisions. There’s even the theoretical limitation that an array size can be at max 2³¹ while there are 2³² possible hash codes.
- Having a hash collision is not a sign of bad programming. For all objects having a value space larger than 2³², the possibility of distinct objects with the same hash code is unavoidable. Strings are an obvious example, but even a Point bearing two int values or a plain Long key have unavoidable hash collisions. So they might be more common than you think and it heavily depends on the use case.
- The implementation switches to a binary tree only when the number of collisions in a bucket exceeds a certain threshold so the higher memory costs only apply when they will pay off. there seems to be a common misunderstanding regarding, how they work. Since bucket collision are not necessarily hash collisions, the binary search will first search the hash codes. Only when the hash codes are identical and the key appropriately implements Comparable, its natural order will be used. The examples you might have found on the web deliberately use the same hash code for objects to demonstrate the use of the Comparable implementation which otherwise would not show up. What they trigger is only the last resort of the implementation.
- As Tagir pointed out, this issue might affect the security of a software as a slow fallback may open the possibility of DoS attacks. There have been several attempts in previous JRE versions to address this issue which had more drawbacks than the memory consumption of a binary tree. For example, there was an attempt to randomize the mapping of hash codes to array entries in Java 7 which caused an initialization overhead as documented in this bug report. This is not the case with this new implementation.