Cryptographic Hash Functions
Basic Characteristics of Hash Functions
Put very simply, a hash function is a measurement whose purpose is to detect very slight changes to data, or to determine whether or not two pieces of data are identical. Hash functions are detective security tools, not protective, Cryptographically useful hashing functions have some specific characteristics. While more secure algorithms have been available for several years, many organizations persist in using weaker functions.
The output of any one hash function is always of the same length, regardless of the length of the input string. Stated informally, a cryptographically strong hash is one for which a very slight change in the input causes a drastic change throughout the output, and that means you can't search for input strings producing a given hash value.
A good cryptographic hash has these four properties, where "infeasible" means "almost certainly beyond the capability of any adversary who must be kept from breaking the system for as long as the security is important":
1: It is easy to compute the hash value for any given message.
2: It is infeasible to find a message that has a given hash.
3: It is infeasible to modify a message without changing its hash.
4: It is infeasible to find two different messages with the same hash.
Here is how to calculate the SHA1 hash of the opening sentence of the CRC Handbook of Applied Cryptography:
$ echo n "Cryptography has a long and fascinating history."  openssl sha1 d1557b56f7274faf16d38fdf9241035ab71da799
We can replace the initial C
with K
and change only one bit out of the 48character string
because of the ASCII encodings:
C = 0x43 = 0100
0
011
K = 0x4b = 0100
1
011
That's a change of just one bit out of 384 bits in the input.
383 of the 384 bits, or 99.739583%, are identical.
But the resulting hash is very different:
$ echo n "Kryptography has a long and fascinating history."  openssl sha1 7215da622638fc3abdab4ffc91c6d2cf4c5599fe
Compare those two nearly identical inputs:
Cryptography has a long and fascinating history.
Kryptography has a long and fascinating history.
And now compare the two very different hash results,
first in hexadecimal (base 16) and then in binary:
d1557b56f7274faf16d38fdf9241035ab71da799 7215da622638fc3abdab4ffc91c6d2cf4c5599fe 110100010101010101111011010101101111011100100111010011111010111100010110110100111000111111011111100100100100000100000011010110101011011100011101101001111001100 111001000010101110110100110001000100110001110001111110000111010101111011010101101001111111111001001000111000110110100101100111101001100010101011001100111111110
One criteria for a good hash function is that a singlebit change in the input causes changes in about half of the output bits. That's what we're seeing here.
Let's use a much larger example input.
The King James translation of the Bible
is available as a plain text ASCII document
from Project Gutenberg.
That provides a commonly available, well known,
and fairly large data set.
According to the
wc
command
this file contains 4,445,260 bytes,
with 100,117 lines of text making up 823,156 words:
$ wc kjv10.txt 100117 823156 4445260 kjv10.txt
This document contains a specification for data integrity. The second and third to last verses of the entire document (that is, The Revelation of Saint John the Divine, chapter 22, verses 18 and 19) provide a relevant test point for investigating change detection. As included in the file downloaded from gutenberg.org those verses read:
22:18 For I testify unto every man that heareth the words of the prophecy of this book, If any man shall add unto these things, God shall add unto him the plagues that are written in this book: 22:19 And if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life, and out of the holy city, and from the things which are written in this book.
I have made a copy of that text file, changing the first
letter of 22:19 from an uppercase "A" to a lowercase "a".
That one changed character over 100,000 lines into the file
makes absolutely no change in the meaning of the text.
The ASCII codes for the characters
A
and
a
vary by only one bit:
Char Hex Binary A 0x41 10
00001 a 0x61 11
00001 $ wc kjv10.txt kjv10modified.txt 100117 823156 4445260 kjv10.txt 100117 823156 4445260 kjv10modified.txt 200234 1646312 8890520 total $ diff C 1 kjv10.txt kjv10modified.txt *** kjv10.txt  kjv10modified.txt *************** *** 100099,100101 **** shall add unto him the plagues that are written in this book: 22:19 !A
nd if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life, and  100099,100101  shall add unto him the plagues that are written in this book: 22:19 !a
nd if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life, and
One character out of 4,445,260, that's a 0.0000225% difference. But to be more accurate, the change is only one bit out of 35,562,080 bits, a 0.0000028% difference. Let's see what happens to the hash values for those files. We will calculate the MD5, SHA1, SHA2256, and SHA2512 hash values for the two files.
$ openssl md5 kjv10* # MD5 HASH98074d4575a9d6970cd2e69dfde4adab
kjv10modified.txta6d84a7fdfaa6906bae40e2c342e978f
kjv10.txt $ openssl sha1 kjv10* # SHA1 HASHafa2cd8f2e5797b1edde66a829bbbb20650b51fa
kjv10modified.txta4f709de0b5684bee071ac912432833a0c09db7d
kjv10.txt $ openssl sha256 kjv10* # SHA256 HASHf3df0027a219731d277f2ee5496a7351b7be2028ee5126128c89f40296b4b93d
kjv10modified.txtf9aad5f2fed41aa145ea03f6c40716cfeece5b61c6d5fb8350a48873a87614ca
kjv10.txt $ openssl sha512 kjv10* # SHA512 HASH699f43a0750dbb6b2104e402a3622a7491c18537bff18e083c60603edf2fb7b1cbf7f5eccbf36ca7fc20b12b6929d7cf230b671f0d04e1eae4f1601107499292
kjv10modified.txt29df561de4486a9c3e313c7e1515b104ed2d2cf17ff26ab255eba05bff47c5382ec438488f27d6497ad19f68b87fc942702997641f1eba672d6c6c11ce95938f
kjv10.txt
You can't reverse a hash, but you might find a collision
There is no such thing as the inverse of a hash function! There is no way to start with the hash of an input, and find an input for which you are certain that it was the input.
A typical hash function has a huge output space. You would expect to half to try half that many possible inputs before randomly stumbling across something that gave the same hash.
But even after you manage to find some input that produces the desired output, it is only one out of a possibly infinite number of possible inputs that produce the same hash!
A collision is a pair of different inputs
that produce the same hash output.
That is:
input1 != input2
but
hash(input1) = hash(input2)
A random collision would be the discovery of any two inputs with equal hashes.
A chosentarget collision is more difficult to
find.
That is the case where input1
or hash(input1)
is already specified, and the attacker needs to find
input2.
Another description of a good hash function is in "Deploying a New Hash Algorithm", by Steven M Bellovin and Eric K Rescorla, which follows the description of section 9.2 of the CRC Handbook of Applied Cryptography, Menezes, van Oorschot, and Vanstone, available as a free PDF download:

Collision resistance:
It is computationally infeasible to find
x
andy
wherex != y
andH(x) = H(y)
That is, it is impractically difficult to find a random collision. 
Preimage resistance:
Given a desired output
y,
it is computationally infeasible to findx
whereH(x) = y
That is, it is impractically difficult to find a chosentarget collision. 
Second preimage resistance:
Given an input
x',
it is computationally infeasible to findx
whereH(x) = H(x')
That is, it is impractically difficult to find a chosentarget collision.
Computers get faster and memory gets cheaper. So, a 2^{128} search space is no longer considered to be adequately large, and so the MD5 algorithm is not strong enough against bruteforce attacks with sophisticated modern hardware. What's worse, MD5 isn't even as strong as promised and researchers have published results for collisions (and see the results below for even more on this):
 Finding two input strings that have the same MD5 hash.
 Finding a second input string that has the same MD5 hash as a chosen "target" input string (a chosentarget collision, a much harder problem).
In early 2017 Google announced the first practical technique for generating a SHA1 collision, over 100,000 times faster than a bruteforce attack.
Use SHA2256 or SHA2512. Not only are the search spaces enormously larger, but SHA2 does not have the known weaknesses of MD5 and SHA1. Further details are available here.
Hash  Output bits  Possible outputs 
MD5  128  2^{128} ≈ 3.40×10^{38} 
SHA1  160  2^{160} ≈ 1.46×10^{48} 
SHA2256  256  2^{256} ≈ 1.16×10^{77} 
SHA2512  512  2^{512} ≈ 1.34×10^{154} 
That last number is awfully large, with 155 digits.
Searching theHash Output
Space
Do hash functions produce all possible outputs? Or might there be some "forbidden patterns"? Do hash functions enforce some form of parity preventing more than some maximum number of bits set to 0 or 1, or preventing sequential runs of identical bits longer than some threshold? Apparently not, although these "interesting" events take more and more bruteforce search to discover as they become more and more interesting. See the results of my bruteforce search for the details.
Yes, collisions are awfully unlikely to happen randomly, and still very difficult to find on purpose, but it is possible. Here are two 128byte data files with identical MD5 hashes! But their SHA1 hashes differ. Notice how similar the two files are, just six regularlyspaced bits differ:
$ openssl md5 cryptofile*.dat MD5(FILE1.DAT)= a4c0d35c95a63a805915367dcfe6b751 MD5(FILE2.DAT)= a4c0d35c95a63a805915367dcfe6b751 $ openssl sha1 cryptofile*.dat SHA1(FILE1.DAT)= 2783c4ff4a3f20d25f2598a8b052b890c37dcac4 SHA1(FILE2.DAT)= 3c35410823ef00b12d020981c1cf8564c0f89bcc $ hexdump C cryptofile1.dat 00000000 d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c gibberish 00000010 2f ca b5
87
12 46 7e ab 40 04 58 3e b8 fb 7f 89 deleted... 00000020 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25
71
41 5a  00000030 08 51 25 e8 f7 cd c9 9f d9 1d bd
f2
80 37 3c 5b  00000040 96 0b 1d d1 dc 41 7b 9c e4 d8 97 f4 5a 65 55 d5  00000050 35 73 9a
c7
f0 eb fd 0c 30 29 f1 66 d1 09 b1 8f  00000060 75 27 7f 79 30 d5 5c eb 22 e8 ad ba 79
cc
15 5c  00000070 ed 74 cb dd 5f c5 d3 6d b1 9b 0a
d8
35 cc a7 e3  00000080 $ hexdump C cryptofile2.dat 00000000 d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c gibberish 00000010 2f ca b5
07
12 46 7e ab 40 04 58 3e b8 fb 7f 89 deleted... 00000020 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25
f1
41 5a  00000030 08 51 25 e8 f7 cd c9 9f d9 1d bd
72
80 37 3c 5b  00000040 96 0b 1d d1 dc 41 7b 9c e4 d8 97 f4 5a 65 55 d5  00000050 35 73 9a
47
f0 eb fd 0c 30 29 f1 66 d1 09 b1 8f  00000060 75 27 7f 79 30 d5 5c eb 22 e8 ad ba 79
4c
15 5c  00000070 ed 74 cb dd 5f c5 d3 6d b1 9b 0a
58
35 cc a7 e3  00000080
If you want to play with these data files, rightclick and save them: File #1 File #2
Note that MD5 is a chained hash function, and this means that if you start with the above two 128byte files, and append identical sequences to each, the two resulting data files will also have identical MD5 hashes. So, an infinite number of collisions could be based on these.
Hash Collision Research
In the summer of 2004 some mathematicians presented a paper showing collision results for MD5 and others. See "Collisions for Hash Functions MD4, MD5, HAVAL128 and RIPEMD" by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu.
Someone else then published a paper on weaknesses in SHA0, which was a precursor to SHA1. Actually, SHA0 was the winning submission for the U.S. government's Secure Hash Algorithm, but the NSA specified a few tweaks for reasons they (of course) did not explain. For this intermediate work, see "NearCollisions of SHA0" by Eli Biham and Rafi Chen.
At the Crypto 2005 conference the following summer,
Xiaoyun Wang, Yiqun Yin, and Hongbo Yu (two authors of
the preceding summer's paper)
published attacks against full SHA1.
Then, at an informal meeting at that conference,
Xiaoyun Wang announced new results: the time complexity
was even less than the papers (so far) had announced.
Relevant papers include:
How to Break MD5 and Other Hash Functions
Efficient Collision Search Attacks on SHA0
Finding Collisions in the Full SHA1
Cryptanalysis of the Hash Functions MD4 and RIPEMD
An attack on hash function HAVAL128
Other papers by Xiaoyun Wang
Lucks and Daum have generated Postscript files that exploit the attack — the files have the same hash, although they contain contradictory messages!
Stevens, Lenstra and de Weger have generated several PDF files with very different contents but identical hashes as part of their general hash work.
In December, 2008, researchers reported finding MD5 collisions in one to two days using a system built from 200 Playstation 3's. Since six Certificate Authorities were still using MD5 signing on X.509 digital certificates, that means that with a day or two of computing you could generate forged certificates allowing a rogue system to masquerade as a "secure" server using one of those certificates. They generated an intermediate CA certificate so they could sign their own certificates. That means you would only need to do the computational attack once to generate as many apparently valid certificates as you want.
 Full details: slides from their presentation and so on.
 A demonstration of the certificate forgery using an intentionally backdated and expired forged Equifax certificate.
 A discussion at veracode.com.
 Further discussion from Information Security magazine.
 USCERT Vulnerability Note #836068 discusses this.
In 2012, Microsoft explained that the authors of the Flame malware used an MD5 collision to forge a Windows codesigning certificate.
In October 2015, Marc Stevens, Pierre Karpman, and Thomas Peyrin reported on discovering freestart collisions in SHA1 (this is a collision of the internal compression function, not all of SHA1 but effectively a SHA1 collision). They used a 64GPU cluster, and estimated the cost of finding SHA1 collisions to be US$ 75,000–120,000 by renting Amazon EC2 highperformance computing systems for a few months.
In February 2017 Google announced the first practical technique for generating a SHA1 collision based on Marc Stevens' earlier work. It's over 100,000 times faster than a bruteforce attack.
What does all this mean for us?
What does all that mean? The difficulty of finding random or chosentarget collisions is not as hard as we thought it was for anything in the family of hash functions including MD5 and SHA1. So we should conclude:
 Bad news: This research shows that the SHA1 hash function is significantly weaker than we thought!
 Good news: The numbers are still awfully large in comparison to current computer speeds. We're still talking theoretical rather than practical in most situations with today's hardware.
It is time to move away from SHA1 based designs, new systems should be designed around SHA2256 or SHA2512 with a plan to plug in replacement functions. Trusted implementations of SHA2256 and SHA2512 are available as part of the OpenSSL package, both commandline tools and shared libraries. And cryptologists spent several years developing and testing a replacement for SHA1 and the SHA2 family.
In February 2007 NIST announced a competition to replace the thencurrent FIPS1802, which specified SHA1, SHA2224, SHA2256, SHA2384, and SHA2512. FIPS1802 was issued in 2002, so the SHA family did not last as long as might have been expected. The new hash algorithms were selected in a process like that used for AES, the Advanced Encryption Standard, defined by government standard FIPS1402. Entries were submitted at the end of October, 2008, and the final selection was announced in October, 2012. You can read about the details of the contest if you're curious.
Good news for everyone came out of this. The development of the new hash function and its needed testing led to intense study of the SHA2 family (SHA2224, SHA2256, SHA2384, SHA2512). The weaknesses found in MD5 and SHA1 were not found in SHA2. So, developing SHA3 showed that it wasn't an emergency, SHA2 still seems to be quite good. But, we didn't know that until it was studied.
The Keccak algorithm was chosen as SHA3. Joan Daemen was on the Keccak development team, he was one of the two Belgian cryptographers behind the Rijndael cipher algorithm that was chosen as the Advanced Encryption Standard or AES.
Algorithm  Output Length 
MD1 [1]  128 
MD2 [1]  128 
MD3 [1]  128 
MD4 [1]  128 
MD5  128 
SHA1  160 
RIPEMD160  160 
SHA2224  224 
SHA3224  224 
SHA2256  256 
Snefru  256 
GOST [2]  256 
SHA3256  256 
SHA2384  384 
SHA3384  384 
EKSBlowfish [3]  448 
Whirlpool [4]  512 
SHA2512  512 
SHA3512  512 
[1]
MD1 and MD3 were computationally inefficient,
MD2 and MD4 were cryptographically flawed.
[2]
Государственный
Стандарт
Союза
ССР
or
Gosudarstvennyi Standart Soyuza SSR
[3]
Based on the Blowfish cipher, used in the
bcrypt
passwordbased key derivation function.
[4]
Designed by Vincent Rijmen, one of the Rijndael/AES
developers, and Paulo S.L.M. Barreto.
Also visit The Hashing Function Lounge.