In cryptography, a hash or message digest is a fixed-size string calculated from an input text of any size up to some large limit. Hashing a string is a form of One-way encryption and is used by software to store passwords. Hash functions have additional properties. For example, a collision-resistant hash provides a practically unique fingerprint of its input. It is much easier (but because of the collision resistance just as secure) to digitally sign this short fingerprint than the possibly very long input. Thus, hash digests may be appended to files before sending so the receiver can verify that received files or strings have not been changed during transmission.
Collision-resistant hashes are used to facilitate various kinds of authentication service, such as source authentication, or data integrity protection. One-way hashes are commonly used to protect passwords. A system usually does not store its users' passwords, but their hashes (usually salted). The password entered by the user is then hashed and compared with the stored hash to authenticate the user. Message authentication codes, a form of keyed hashes, are often used as the authentication component in hybrid cryptosystems.
All of the following techniques are widely used.
An unkeyed hash can provide error-checking. The sender calculates a hash and stores or transmits it with the document. The receiver calculates a new hash from the document he receives, or the reader calculates one for the document he pulls from the archive. Compare his new hash with the one the sender calculated; if they match then it is overwhelmingly likely that the document has been transmitted or read without error. This is used in many software distributions, to avoid problems with corrupt downloads. It handles noisy lines or "bit rot" in an archive, but an unkeyed hash is useless against an adversary who intentionally changes the data. The enemy simply calculates a new hash for his changed version and stores or transmits that instead of the original hash.
To resist an adversary takes a keyed hash, a Hashed message authentication code or HMAC. Sender and receiver share a secret key; the sender hashes using both the key and the document data, and the receiver verifies using both. Lacking the key, the enemy cannot alter the document undetected. This technique is used in many security systems, such as IPsec, to ensure data integrity.
Hashes are also an essential component of digital signature algorithms. A signature is essentially a hash encrypted with the signer's private key in some public key encryption system. To verify a signature, decrypt it with the signer's public key and check that the decrypted hash matches one calculated from the received document. This technique and digital certificates which rely on digital signatures are extremely widely used.
Hashes are also commonly used as a mixing operation in random number generators.
The main design requirements for a hash are that it be difficult for an enemy to:
- find two inputs that hash to the same result (collision resistance)
- given a hash, find an input that gives that result (pre-image resistance)
- given an input, find another input that hashes to the same result (second pre-image resistance)
An ideal hash resists all of these.
A birthday attack can be used to find collisions for any hash at cost roughly 2hashsize/2. For this reason when a hash is used with a block cipher, it is general practice to make the hash size twice the key length of the cipher. For example, SHA-256 is used with AES-128. The strength of the hash against a birthday attack then roughly matches the strength of the cipher against a brute force attack.
A widely used design technique is to have multiple rounds of processing for each input block. The designer aims for a good avalanche effect with a small change in one round leading to very large changes a few rounds later. See the block cipher article for a discussion of these concepts.
Another technique is Merkle-Damgard strengthening, discovered independently by Ralph Merkle and Ivan Damgård. The technique is to append a representation of the input length to the input before hashing; this prevents collisions with inputs of other lengths. Given this, it can be proven that if the compression function of the hash is collision resistant, then the overall hash will be. One source is Merkle's Stanford PhD thesis Secrecy, authentication, and public key systems.
Over the past few years, a number of attacks have been developed against various hashes that process input in blocks such that the hash state is updated for each block and state after the last block becomes the final output. They work by finding collisions for a single block operation, then chaining these together to get an overall attack. A defense is the wide trail strategy in which the hash has an internal state much larger than the hash output so internal collisions are harder to find. Typically, internal state is about twice output size.
MD4 and descendants
Message Digest algorithm number 4 was from Ron Rivest. It is no longer used, replaced by its descendants. A specification is in RFC 1320.
The main innovation in MD4 was the use of bitwise nonlinear operations to mix words of data. Most computer instruction sets provide bitwise logical operations on words, and many programming languages do as well. For example, for the C expression a = b&c each bit of the output word a is the logical AND of the corresponding bits of the inputs b and c. On a 32-bit machine, that code does 32 Boolean operations in a single instruction. Rivest combined these operations to calculate nonlinear functions of three inputs. Quoting the RFC:
We first define three auxiliary functions that each take as input three 32-bit words and produce as output one 32-bit word.
F(X,Y,Z) = XY v not(X) Z G(X,Y,Z) = XY v XZ v YZ H(X,Y,Z) = X xor Y xor Z
... F acts as a conditional: if X then Y else Z. ... G acts as a majority function: if at least two of X, Y, Z are on, then G has a "1" bit in that bit position ... H is the bit-wise XOR or parity" function
These functions are used repeatedly to mix various sets of words.
MD5 was Rivests's version of an enhanced MD4. Like MD4, it gives a 128-bit hash. RFC 1321 gives a specification and RFC 1820 a performance analysis.
Quoting RFC 1321:
"The following are the differences between MD4 and MD5:
1. A fourth round has been added.
2. Each step now has a unique additive constant.
3. The function g in round 2 was changed from (XY v XZ v YZ) to (XZ v Y not(Z)) to make g less symmetric.
4. Each step now adds in the result of the previous step. This promotes a faster "avalanche effect".
5. The order in which input words are accessed in rounds 2 and 3 is changed, to make these patterns less like each other.
6. The shift amounts in each round have been approximately optimized, to yield a faster "avalanche effect." The shifts in different rounds are distinct."
There are several published attacks against MD5, but they do not affect all applications. MD5 is still widely used, though it is gradually being replaced.
There are a whole family of SHA hashes, all designed by NSA. The original SHA, now sometimes referred to as SHA-0, was essentially an improved MD4, with two major changes. It increased the hash size from 128 to 160 bits, using five 32-bit words of internal state instead of four; this raises the strength against a birthday attack from 264 to 280. Also, there is an expansion step which spreads the state out to 80 words. One word is then mixed back in at each round of the hash. SHA-0 was not much used, quickly replaced by SHA-1.
SHA-1 is a slightly modified SHA, also giving a 160-bit hash. It adds a one-bit rotation in each round. The NSA have never explained why they felt this change was necessary; presumably it protects against some attack which they do not wish to reveal. A specification is in RFC 3174. The US government standard is FIPS 180-1.
SHA-1 is in very wide use. For example, it is used in protocols such as PGP and IPsec and in random number generators such as Intel's hardware generator and the software random device in Linux.
There have been a number of attacks on SHA-1, though these are not applicable against all applications of the algorithm. If SHA-1 behaved as an ideal hash, the cost of finding a collision would be 280 computations; a birthday attack would be the best available. However in 2004, a team led by Wang Xiaoyun at Shandong University demonstrated collision attacks on MD5 and SHA-0 (without the rotation). Later papers from the same group showed an attack on SHA-1 with time complexity 269, then an improved attack at 263. In 2009, another group came up with an attack  requiring only 252 SHA computations.
SHA-2 is a family of hashes standardized by the US National Institute of Standards and Technology, NIST. The standard is FIPS 180-2 (pdf). The design is based on SHA.
There are four new hashes in the standard (SHA-1 is retained as well), named by their hash size: SHA-224, SHA-256, SHA-384 and SHA-512. Because of the birthday attack, when a hash is used with a block cipher, the hash size should be twice the key length of the cipher, SHA-256, 384 and 512 are intended to be used with AES-128, 192 and 256 respectively. SHA-224 is for use with Triple DES which has only 112-bit strength.
In internal structure, the four SHA-2 hashes are identical except the 384-bit and 512-bit versions use 64-bit variables while the 256-bit and 224-bit versions use 32-bit variables. SHA-384 is identical to SHA-512 except it starts with different constants and truncates the output to 384 bits. SHA-224 has the same relation to SHA-256.
As of early 2009, no attacks are known against the SHA-2 group of algorithms. However attacks have been found against MD4, MD5 and SHA-1. Since SHA-2 uses the same design ideas there is some cause for worry that eventually SHA-2 might fall. Playing it safe, NIST are therefore now working on an Advanced Hash Standard, also known as SHA-3, which could replace SHA-2 if that should become necessary.
This is a European design. The name is from Réseaux IP Européens (RIPE) and Message Digest. The best-known variant is RIPEMD-160, which can be used as a drop-in replacement for SHA-1. Other variants give 128, 256 or 320-bit hashes. RIPE-MD has a home page.
Other 20th century hashes
Tiger is a 192-bit hash designed by Eli Biham and Ross Anderson. It was designed to take advantage of 64-bit processors, using a lot of 64-bit operations, while still giving acceptable performance on other machines. It has a home page.
Whirlpool is a 512-bit hash from Vincent Rijmen (one of the designers of AES) and Paulo Barreto. Is uses a block cipher, designed along the same lines as AES but with 512-bit blocks and a 512-bit key, as the compression function.
Along with SHA-1 and some variants of SHA-2 and RIPE-MD, Whirlpool is included in the ISO/IEC hash standard.
I has a home page.
The Advanced Hash Standard
In 2005, the US National Institute of Standards and Technology (NIST) began the process of defining a new hash standard, SHA-3 or the Advanced Hash Standard or just AHS. There is a NIST page with details and links.
The overall process and methodology are similar to what they did for the AES competition, choosing a new cipher standard which became the Advanced Encryption Standard. Starting in 2005, they sponsored two public workshops to discuss the state of the hashing art, then issued a draft requirements document and invited public comment. After revising the requirements, they issued a call for submissions in November 2007. The deadline on that was October 31, 2008.
There were 64 submissions including 51 which NIST considered "complete and proper". NIST has posted the design documents and code for these 51 first round candidates. There is also an independent site with a list and links to design documents, the SHA-3 Zoo. It has 56 submitted hashes including all candidates.
Efforts toward cryptanalysis are intense; attacks, or at least observations that may lead to an attack, have been found for many of the hashes. As of March 2009, the Zoo lists 10 of the 51 candidates as "conceded broken"; the designers have conceded that (at least the submitted versions of) their hashes are broken.
The Skein team have some analysis of "engineering aspects" of the competitors, mainly speed and memory usage. Sean O'Neill has tables comparing source code and object code sizes.
The first AHS conference was in February 2009 . In July 2009, the field was narrowed to fourteen second round candidates — BLAKE, Blue Midnight Wish, CubeHash. ECHO, Fugue. Grøstl, Hamsi. JH, Keccak, Luffa, Shabal. SHAvite-3, SIMD. and Skein. There was a second conference and another narrowing, to five finalists — BLAKE, Grøstl, JH, Keccak, and Skein. The third conference was held in March 2012 . NIST have asked for final comments before June 1. Actual selection of a winner is expected shortly after that.
We list a few of the most interesting submissions below. See the NIST site or the SHA-3 Zoo for others.
From Bruce Schneier and others. It uses a new block cipher called Threefish to do the mixing. There is information both on Schneier's site and a Skein site.
From a team led by Ron Rivest. There is a web site at MIT. This was withdrawn from the competition  during the first round.
From Dan Bernstein, 
From a Danish team that includes Lars Knudsen. This is a wide-trail hash using some operations borrowed from AES. It has a home page.
From a team that includes Joan Daemen, one of the AES designers. It has a home page.
In September 2012, NIST announced Keccak as the winner.
From a team that includes Bart Preneel. 
From a team that includes Richard Schroeppel whose Hasty Pudding was in some ways the most interesting entry in the AES competition.
A French entry. 
From an Israeli team: Eli Biham, one of the inventors of differential cryptanalysis, and Orr Dunkelman.