原文:http://mindprod.com/jgloss/hashcode.html 外一篇:http://www.ibm.com/developerworks/java/library/j-jtp05273.html ©1996-2011 2009-06-10 Roedy Green, Canadian Mind Products
hashCode
The key thing to understand about hashCodes is that they need not be unique, just as close to unique as practically possible.
HashCodes in a Nutshell
If you want to file something away for later retrieval, it can be faster if you file it numerically rather than by a long alphabetic key. A hashCode is a way of computing a small (32-bit) digest numeric key from a long String or even an arbitrary clump of bytes. The numeric key itself is meaningless and the hashCode functions for computing them can look a bit insane. However, when you go to look for something, you can do the same digest calculation on the long alphabetic key you are looking for, and no matter how bizarre an algorithm you used, you will calculate the same hashCode, and will be able to look up numerically with it. Of course there is always the possibility two different Strings will have the same digest hashCode. However, even then, all is not lost; it greatly narrows down the search, hence speeding it up. A Hashtable goes a step further, scrunching down the hashCode even further to an even smaller number that it can use to directly index an array, usually by dividing it by some (ideally prime) number and taking the remainder.
HashCode Misconceptions
Uniqueness: Some people try to use HashCodes as unique identifiers. You can’t count on hashCodes to be unique. They way they are properly used, this does not cause any serious problem. Denseness: Some people fret over trying to make hashCodes come out in some narrow band of numbers. The way they are used, they are trimmed down to size by taking the modulus relative to some prime or by discarding high order bits, so there is no point. Negative hashcodes: You can safely treat hashcodes as unsigned or signed when computing them. You don’t have to go to any special lengths to keep them positive.
String.hashCode Implementation
In JDK 1.0+ and 1.1+ the hashCode function for long Strings worked by sampling every nth character. This pretty well guaranteed you would have many Strings hashing to the same value, thus slowing down Hashtable lookup. In Java version 1.2 the function has been improved to multiply the result so far by 31 then add the next character in sequence. This is a little slower, but is much better at avoiding collisions.
// String.hashCode
/** The value is used for character storage. */ private final char value[];
/** The offset is the first index of the storage that is used. */ private final int offset;
/** The count is the number of characters in the String. */ private final int count;
/** Cache the hash code for the string */ private int hash; // Default to 0
/** * Returns a hash code for this string. The hash code for a * <code>String</code> object is computed as * <blockquote><pre> * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] * </pre></blockquote> * using <code>int</code> arithmetic, where <code>s[i]</code> is the * <i>i</i>th character of the string, <code>n</code> is the length of * the string, and <code>^</code> indicates exponentiation. * (The hash value of the empty string is zero.) * * @return a hash code value for this object. */ public int hashCode() { int h = hash; if ( h == 0 ) { int off = offset; char val[] = value; int len = count;
for ( int i = 0; i < len; i++ ) { h = 31 * h + val[ off++ ]; } hash = h; } return h; }
Object.hashCode Implementation
The default hashCode() method uses the 32-bit internal JVM address of the Object as its hashCode. However, if the Object is moved in memory during garbage collection, the hashCode stays constant. This default hashCode is not very useful, since to look up an Object in a HashMap, you need the exact same key Object by which the key/value pair was originally filed. Normally, when you go to look up, you don’t have the original key Object itself, just some data for a key. So, unless your key is a String, nearly always you will need to implement a hashCode and equals method on your key class. Object.hashCode in a native method.
The Gemini Twins: equals and hashCode
Equal hashCodes in general are not sufficient to ensure Object equality. However, if the hashCodes are not equal, you know the Objects can’t possibly be equal. Consider how many 50-character Strings there are (65535^50) and how many possible hashCodes there are (2^32). It should be obvious there are way more Strings than hashCodes. So the same hashCode has to be reused over and over for different Strings. The default hashCode just uses the address of the Object and the default equals method just compares addresses. If you override one of these two methods, you must override the other to match. The rules are: It is ok, unavoidable, but not desirable, if two Objects that do not compare equal have the same hashCode. Two Objects that compare equal must have the same hashCode. So if you had a Fruit Object with a flavour and colour field, and you decided that any two Objects with the same flavour were for all intents and purposes equal, you would define your equals and hashCode methods like this:
import java.awt.Color; /** * Class to describe a fruit. */ public class Fruit { /** * Flavour of the fruit. */ private String flavour;
/** * Colour of the fruit. */ private Color colour;
/** * Compare this Fruit object with another * fruit object. * * @param other Object, usually a Fruit, to compare with. * * @return true if the objects are of identical flavour. */ public boolean equals ( Object other ) { if ( other == null ) { return false; } else if ( other instanceof Fruit ) { return this.flavour.equals ( ((Fruit)other ).flavour ); } else return false; }
/** * Calculate the hashcode of this object, based * only on its flavour. * * @return the hashcode for the Fruit object */ public int hashCode() { return flavour.hashCode(); } }
As a rule of thumb, any time you use an Object as a key in a Map or Set (e.g. Hashtable, HashMap, HashSet, TreeMap etc.) you must redefine both equals and hashCode in such a way both incorporate that same and all the fields of the logical key. Fields in the key Object irrelevant to lookup should not be included in either method.
Simple HashCodes
Let us say you hand three ints in your Object. field1 had a range 0..99, field2 had a range -10..+10 and field3 has a range 100..1000 you could pack them into a unique, dense hashCode like this: The formula would be:
// simple packed hashCode algorithm final int low1 = 0; final int high1 = 99;
final int low2 = -10; final int high2 = 10;
final int low3 = 100; final int high3 = 1000;
int hashCode = ( field1 - low1 ); hashCode *= ( high2 - low2 + 1 ); hashCode += ( field2 - low2 ); hashCode *= ( high3 - low3 + 1 ); hashCode += ( field3 - low3 );
Calculating Aggregate hashCodes with XOR
The xor ^ operator is useful in computing hashing functions. To create a hashCode based on two fields, compute the hashCodes of the two fields separately and xor them together with the ^ operator. To create an hash on all the elements of an array you could xor all the values together. The result independent of the order. If you want the order to matter, use some digest function. xor also has the odd properly that if you have a pair of identical hashCodes xored together, it is as if they were not there. When you are expecting duplicates, you might want to use some other combining technique. XOR has the following nice properties: It does not depend on order of computation. It does not “waste” bits. If you change even one bit in one of the components, the final value will change. It is quick, a single cycle on even the most primitive computer. It preserves uniform distribution. If the two pieces you combine are uniformly distributed so will the combination be. In other words, it does not tend to collapse the range of the digest into a narrower band. The nasty properties of XOR are: It treats identical pairs of components as if they were not there. It ignores order. A ^ B is the same a B ^ A. If order matters, you want some sort of checksum/digest such as Adlerian. XOR would not be suitable to compute the hashCode for a List which depends on the order of the elements as part of its identity. If the values to be combined are only 8 bits wide, there will be effectively only 8 bits in the xored result. In contrast, multiplying by a prime and adding would scramble the entire 32 bits of the result, giving a much broader range of possible values, hence greater chance that each hashcode will unique to a single Object. In other words, it tends to expand the range of the digest into the widest possible band (32 bits). XOR is fairly easily degraded by patterns in the data. If you are not getting an even spread, it might pay to go for a higher overhead scrambling mechanism such as Aldlerian, CRC or even MD5 or SHA1. If you XOR a number of small quantities the result is still a small quantity. It does not exploit the high order bits for additional variability unless you do some sort of shifting. Here is another approach that would work better if you had two Strings in your Object. It gives a different hash code for your two Objects when:
// object1 has o1.string1 = "apple"; o1.string2 = "orange"; // and object2 has o2.string1 = "orange"; o2.string2 = "apple";
It works like this to combine hash codes of the fields in your object:
/** * hashCode that combines two strings * @return a hash code value on the pair of strings. */ public int hashCode() { int result = string1.hashCode(); result = 37 * result + string2.hashCode(); result = 37 * result + string3.hashCode(); // etc for all fields in the object // This will only work for 5 fields before you overflow and lose input from the first field. // The optimal prime multiplier is near Math.exp( Math.log( Integer.MAX_VALUE ) / numberOfFields) ) // This technique samples output from all the component Objects but mushes it together with information form the other fields. return result; }
Here is how to write a hashCode to combine fields:
/** * hashCode that combines three strings using xor. * @return a hash code value on the entire object. */ public int hashCode() { // this will not work well if some of the strings are equal. int result = string1.hashCode(); result ^= string2.hashCode(); result ^= string3.hashCode(); // etc for all fields in the object return result; }
Here is roughly how String.hashCode works
/** * simplified version of how String.hashCode works. * For the actual version see src.jar java/lang/String.java * @return a hash code value for this String. */ public int hashCode() { // inside the String is a char[] val that holds the characters. int hash = 0; int len = char.length(); for ( int i=0; i<len; i++ ) { hash = 31 * hash + val[ i ]; } return hash; }
Here is a fast hash algorithm you can apply to bytes, short, chars, ints, arrays etc. I used an assembler version of
/** * Java version of the assembler hashCode used in the BBL Forth compiler. * Because of the 1-bit shift you don't lose data completely on long keys. * The shift breaks up patterns in the data. * @return a hash code value for the byte[] val byte array in this object. */ public int hashCode() { // byte[] val holds the data. int hash = 0; int len = val.length(); for ( int i=0; i<len; i++ ) { // rotate left and xor (very fast in assembler, a bit clumsy to specify in Java, but a smart compiler might collapse it to two machine instructions) hash <<= 1; if ( hash < 0 ) { hash |= 1; } hash ^= val[ i ]; } return hash; }
Here is a straight-forward xor hash of all the bytes. The disadvantage is the result is only 8-bits.
/** * A vanilla XOR hashCode of all the bytes in an array. This gives only a signed 8-bit result. * @param val array of bytes to hash * @return a hash code value for the byte[] val byte array. */ public static int hashCode( byte[] val ) { // byte[] val holds the data. int hash = 0; final int len = val.length(); for ( int i=0; i<len; i++ ) { hash ^= val[ i ]; } return hash; }
Consider using an CRC-32 or an Adlerian digest for your hashCode when you can reduce the key part of your Object to a long string of bytes. This give a nice even spread over the range of possible integers.
Tips
Don’t sweat writing a perfect hashCode. Test your code to see if look up is the bottleneck before fussing too much. It is easy to concoct a variety of hashCode methods and test them far more quickly than you can mathematically analyse the tradeoffs. If you write a Collection that uses hashCode, stress test it with a HashCode method that always returns 0. If you have an expensive hashCode calculation, consider caching the result. The important thing about a hashCode for HashMap is that in produces lots of variability in the low order bits. As a simple check, do a histogram of the frequency of the low order byte. Ideally you should be getting all 256 possible values roughly equally. One simple way of adding variability to your hashCode function is to xor ^ high order bits onto the lower order ones.
When Do You Need A Custom equals and hashCode?
The hashCode method only gets invoked when you use the Object as the key to a Hashtable. It is not used when the Object is merely a Hashtable value. Most of the time your Hashtable keys are simple Strings, so you rarely need to write custom equals and hashCode methods. When you use a HashSet to help you check for duplicate Objects, then you likely will need a custom equals and hashCode method. There your Objects act as both key and value. If you know the key values in advance, it is possible to construct a hashCode function that has no collisions.
The One Key Catch
You can define only one hashCode/equals method for your HashSet Objects. That limits you to one type of HashSet lookup for your Objects. There is no equivalent to Comparator for HashSets. You can look up your Objects by only one key, though that key might contain several fields. You can’t have several HashSets each accessing the same Objects by different keys. You can of course have several HashSets each accessing a different subset of the same group of Objects using the same key. In contrast, with HashMap you have more freedom. Your Objects don’t have to implement a useful hashCode/equals, but any keys you use do. Since you can define different hashCode/equals for different types of key, you can have multiple HashMaps on the same group of Objects looking up by different keys. Arrays are Objects and use the lame Object. hashCode. To get a proper hashCode that is based on the values in the array, you need Arrays. hashCode. The packing method of creating hashCodes will not work will with HashMap, because HashMap discards the high order bits of the hashCode.
|