Hash (cryptography)
In cryptography a hash or message digest is a fixed-size digest which can be calculated from an input text of any size up to some large limit. While cryptographic principles are used, these functions are used in manners quite different than two-way, or even one-way full-text cryptographically protected communications. The primary applications of hashes and message digests are as means of error detection, source authentication, or data integrity protection.
Applications
Hashes provide various kinds of authentication service, not the secrecy that other cryptographic primitives (block ciphers, stream ciphers and public key techniques) provide. They are often used as the authentication component in hybrid cryptosystems, along with other components for other services.
All of the following techniques are widely used.
An unkeyed hash can provide error-checking. The sender calculates a hash and stores or transmits it with the document. The receiver calculates a new hash from the document he receives, or the reader calculates one for the document he pulls from the archive. Compare his new hash with the one the sender calculated; if they match then it is overwhelmingly likely that the document has been transmitted or read without error. This is used in many software distributions, to avoid problems with corrupt downloads. It handles noisy lines or "bit rot" in an archive, but an unkeyed hash is useless against an adversary who intentionally changes the data. The enemy simply calculates a new hash for his changed version and stores or transmits that instead of the original hash.
To resist an adversary takes a keyed hash, a Hashed message authentication code or HMAC. Sender and receiver share a secret key; the sender hashes using both the key and the document data, and the receiver verifies using both. Lacking the key, the enemy cannot alter the document undetected. This technique is used in many security systems, such as IPsec, to ensure data integrity.
Hashes are also an essential component of digital signature algorithms. A signature is essentially a hash encrypted with the signer's private key in some public key encryption system. To verify a signature, decrypt it with the signer's public key and check that the decrypted hash matches one calculated from the received document. This technique and digital certificates which rely on digital signatures are extremely widely used.
Hashes are also commonly used as a mixing operation in random number generators.
Design considerations
The main design requirements for a hash are that it be difficult for an enemy to:
- find two inputs that hash to the same result (collision resistance)
- given a hash, find an input that gives that result (pre-image resistance)
- given an input, find another input that hashes to the same result (second pre-image resistance)
An ideal hash resists all of these.
A birthday attack can be used to find collisions for any hash at cost roughly 2hashsize/2. For this reason when a hash is used with a block cipher, it is general practice to make the hash size twice the key length of the cipher. For example, SHA-256 is used with AES-128. The strength of the hash against a birthday attack then roughly matches the strength of the cipher against brute force.
A widely used design technique is to have multiple rounds of processing for each input block. The designer aims for a good avalanche effect with a small change in one round leading to very large changes a few rounds later. See the block cipher article for a discussion of these concepts.
Another technique is Merkle-Damgard strengthening. The technique is to append a representation of the input length to the input before hashing; this prevents collisions with inputs of other lengths. Given this, Merkle and Damgard prove that if the compression function of the hash is collision resistant, then the overall hash will be. One source is Merkle's Stanford PhD thesis Secrecy, authentication, and public key systems.
MD4 and descendants
MD4
Message Digest algorithm number 4 was from Ron Rivest. It is no longer used, replaced by its descendants. A specification is in RFC 1320.
The main innovation in MD4 was the use of bitwise non-linear operations to mix words of data. Most computer instruction sets provide bitwise logical operations on words, and many programming languages do as well. For example, for the C expression a = b&c each bit of the output word a is the logical AND of the corresponding bits of the inputs b and c. On a 32-bit machine, that code does 32 Boolean operations in a single instruction. Rivest combined these operations to calculate non-linear functions of three inputs. Quoting the RFC:
We first define three auxiliary functions that each take as input three 32-bit words and produce as output one 32-bit word.
F(X,Y,Z) = XY v not(X) Z G(X,Y,Z) = XY v XZ v YZ H(X,Y,Z) = X xor Y xor Z
... F acts as a conditional: if X then Y else Z. ... G acts as a majority function: if at least two of X, Y, Z are on, then G has a "1" bit in that bit position ... H is the bit-wise XOR or parity" function
These functions are used repeatedly to mix various sets of words.
MD5
MD5 was Rivests's version of an enhanced MD4. Like MD4, it gives a 128-bit hash. RFC 1321 gives a specification and RFC 1820 a performance analysis.
Quoting RFC 1321:
"The following are the differences between MD4 and MD5:
1. A fourth round has been added.
2. Each step now has a unique additive constant.
3. The function g in round 2 was changed from (XY v XZ v YZ) to (XZ v Y not(Z)) to make g less symmetric.
4. Each step now adds in the result of the previous step. This promotes a faster "avalanche effect".
5. The order in which input words are accessed in rounds 2 and 3 is changed, to make these patterns less like each other.
6. The shift amounts in each round have been approximately optimized, to yield a faster "avalanche effect." The shifts in different rounds are distinct."
SHA
There are a whole family of SHA hashes, all designed by NSA. The original SHA was essentially an improved MD4, with two major changes. It increased the hash size from 128 to 160 bits, using five 32-bit words of internal state instead of four. Also, there is an expansion step which spreads the state out to 80 words. One word is then mixed back in at each round of the hash. This was not much used, quickly replaced by SHA-1.
SHA-1
SHA-1 is a slightly modified SHA, also giving a 160-bit hash. It adds a one-bit rotation in each round. The NSA have never explained why they felt this change was necessary; presumably it protects against some attack which they do not wish to reveal.
A specification is in RFC 3174. The US government standard is FIPS 180-1.
SHA-1 is in very wide use. For example, it is used in protocols such as PGP and IPsec and in random number generators such as Intel's hardware generator and the software random device in Linux.
SHA-2
SHA-2 is a family of hashes standardized by the US National Institute for Standards and Technology, NIST. The standard is FIPS 180-2 (pdf). The design is based on SHA.
There are four new hashes in the standard (SHA-1 is retained as well), named by their hash size: SHA-224, SHA-256, SHA-384 and SHA-512. Because of the birthday attack, when a hash is used with a block cipher, the hash size should be twice the key length of the cipher, SHA-256, 384 and 512 are intended to be used with AES-128, 192 and 256 respectively. SHA-224 is for use with Triple DES which has only 112-bit strength.
In internal structure, the four SHA-2 hashes are identical except the 384-bit and 512-bit versions use 64-bit variables while the 256-bit and 224-bit versions use 32-bit variables. SHA-384 is identical to SHA-512 except it starts with different constants and truncates the output to 384 bits. SHA-224 has the same relation to SHA-256.
As of late 2008, no attacks are known against the SHA-2 group of algorithms, but attacks have been found against MD4, MD5 and SHA-1, so there is some cause for worry that eventually SHA-2 might fall. Playing it safe, NIST are therefore now working on an Advanced Hash Standard, also known as SHA-3, which could replace SHA-2 if that should become necessary.
RIPE-MD
This is a European design. The best-known variant is RIPEMD-160, which can be used as a drop-in replacement for SHA-1. Other variants give 128, 256 or 320-bit hashes. RIPE-MD has a home page.
Other 20th century hashes
Tiger
Tiger is a 192-bit hash designed by Eli Biham and Ross Anderson. It was designed to take advantage of 64-bit processors, using a lot of 64-bit operations, while still giving acceptable performance on other machines. It has a home page.
Whirlpool
Whirlpool is a 512-bit hash from Vincent Rijmen (one of the designers of AES) and Paulo Barreto. Is uses a block cipher, designed along the same lines as AES but with 512-bit blocks and a 512-bit key, as the compression function.
Along with SHA-1 and some variants of SHA-2 and RIPE-MD, Whirlpool is included in the ISO/IEC hash standard.
I has a home page.
The Advanced Hash Standard
In 2005, the US National Institute of Standards and Technology (NIST) began the process of defining a new hash standard, SHA-3 or the Advanced Hash Standard or just AHS. There is a NIST page with details and links.
The overall process and methodology are similar to what they did for the AES contest, choosing a new cipher standard which became the Advanced Encryption Standard. Starting in 2005, they sponsored two public workshops contest to discuss the state of the hashing art, then issued a draft requirements document and invited public comment. After revising the requirements, they issued a call for submissions in November 2007. The deadline on that was October 31, 2008.
There were 64 submissions including 51 which became first round candidates, i.e. NIST considered them "complete and proper" submissions. NIST has posted the design documents and code for those [1]. There is also an independent site with a list and links to design documents, the SHA-3 Zoo; as of early December, it lists 55 hashes.
Efforts toward cryptanalysis are intense; attacks, or at least observations that may lead to an attack, have been found for many of the hashes. As of mid-December, the Zoo labels 10 as "broken". At least three designers — Greg Rose for Boole, Michal Trojnara for StreamHash and Sean O'Neill for EnRUPT — have already conceded that at least the submitted versions of their hashes are broken.
The Skein team also have some analysis of "engineering aspects" of the competitors, mainly speed and memory usage. Sean O'Neill has tables comparing source code and object code sizes.
There will be more conferences, the first in February 2009 [2], then a narrowing of the field to a group of finalists, more analysis and another conference, then a final selection. Target date for completion of the process and release of the new standard is 2012.
Skein
From Bruce Schneier and others. It uses a new block cipher called Threefish to do the mixing. There is information both on Schneier's site and a Skein site.
MD6
From a team led by Ron Rivest. There is a web site at MIT.
CubeHash
From Dan Bernstein, [3]
Essence
From Jason Worth Martin [4]
Sgàil
Peter Maxwell [5]
EnRUPT
Sean O'Neil [6]
NaSha
Smile Markovski and Aleksandra Mileva [7]
Maraca
Robert Jenkins [8]