Cryptographic Hash Functions
Basic Characteristics of Hash Functions
Put very simply, a hash function is a measurement whose purpose is to detect very slight changes to data, or to determine whether or not two pieces of data are identical. Hash functions are detective security tools, not protective, Cryptographically useful hashing functions have some specific characteristics. While more secure algorithms have been available for several years, many organizations persist in using weaker functions.
The output of any one hash function is always of the same length, regardless of the length of the input string. Stated informally, a cryptographically strong hash is one for which a very slight change in the input causes a drastic change throughout the output, and that means you can't search for input strings producing a specified hash value.
A good cryptographic hash has these four properties, where "infeasible" means "almost certainly beyond the capability of any adversary who must be kept from breaking the system for as long as the security is important":
1: It is easy to compute the hash value for any message.
2: It is infeasible to find a message that has a specified hash.
3: It is infeasible to modify a message without changing its hash.
4: It is infeasible to find two different messages with the same hash.
Here is how to calculate the SHA-1 hash of the opening sentence of the CRC Handbook of Applied Cryptography:
$ echo -n "Cryptography has a long and fascinating history." | openssl sha1 d1557b56f7274faf16d38fdf9241035ab71da799
We can replace the initial C
with K
and change only one bit out of the 48-character string
because of the ASCII encodings:
C = 0x43 = 0100
0
011
K = 0x4b = 0100
1
011
That's a change of just one bit out of 384 bits in the input.
383 of the 384 bits, or 99.739583%, are identical.
But the resulting hash is very different:
$ echo -n "Kryptography has a long and fascinating history." | openssl sha1 7215da622638fc3abdab4ffc91c6d2cf4c5599fe
Compare those two nearly identical inputs:
Cryptography has a long and fascinating history.
Kryptography has a long and fascinating history.
And now compare the two very different hash results,
first in hexadecimal (base 16) and then in binary:
d1557b56f7274faf16d38fdf9241035ab71da799 7215da622638fc3abdab4ffc91c6d2cf4c5599fe 110100010101010101111011010101101111011100100111010011111010111100010110110100111000111111011111100100100100000100000011010110101011011100011101101001111001100 111001000010101110110100110001000100110001110001111110000111010101111011010101101001111111111001001000111000110110100101100111101001100010101011001100111111110
One criteria for a good hash function is that a single-bit change in the input causes changes in about half of the output bits. That's what we're seeing here.
Let's use a much larger example input.
The King James translation of the Bible
is available as a plain text ASCII document
from Project Gutenberg.
That provides a commonly available, well known,
and fairly large data set.
According to the
wc
command
this file contains 4,445,260 bytes,
with 100,117 lines of text making up 823,156 words:
$ wc kjv10.txt 100117 823156 4445260 kjv10.txt
This document contains a specification for data integrity. The second and third to last verses of the entire document (that is, The Revelation of Saint John the Divine, chapter 22, verses 18 and 19) provide a relevant test point for investigating change detection. As included in the file downloaded from gutenberg.org those verses read:
22:18 For I testify unto every man that heareth the words of the prophecy of this book, If any man shall add unto these things, God shall add unto him the plagues that are written in this book: 22:19 And if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life, and out of the holy city, and from the things which are written in this book.
I have made a copy of that text file, changing the first
letter of 22:19 from an upper-case "A" to a lower-case "a".
That one changed character over 100,000 lines into the file
makes absolutely no change in the meaning of the text.
The ASCII codes for the characters
A
and
a
vary by only one bit:
Char Hex Binary A 0x41 10
00001 a 0x61 11
00001 $ wc kjv10.txt kjv10-modified.txt 100117 823156 4445260 kjv10.txt 100117 823156 4445260 kjv10-modified.txt 200234 1646312 8890520 total $ diff -C 1 kjv10.txt kjv10-modified.txt *** kjv10.txt --- kjv10-modified.txt *************** *** 100099,100101 **** shall add unto him the plagues that are written in this book: 22:19 !A
nd if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life, and --- 100099,100101 ---- shall add unto him the plagues that are written in this book: 22:19 !a
nd if any man shall take away from the words of the book of this prophecy, God shall take away his part out of the book of life, and
One character out of 4,445,260, that's a 0.0000225% difference. But to be more accurate, the change is only one bit out of 35,562,080 bits, a 0.0000028% difference. Let's see what happens to the hash values for those files. We will calculate the MD5, SHA-1, SHA-2-256, and SHA-2-512 hash values for the two files.
$ openssl md5 kjv10* # MD5 HASH98074d4575a9d6970cd2e69dfde4adab
kjv10-modified.txta6d84a7fdfaa6906bae40e2c342e978f
kjv10.txt $ openssl sha1 kjv10* # SHA1 HASHafa2cd8f2e5797b1edde66a829bbbb20650b51fa
kjv10-modified.txta4f709de0b5684bee071ac912432833a0c09db7d
kjv10.txt $ openssl sha256 kjv10* # SHA256 HASHf3df0027a219731d277f2ee5496a7351b7be2028ee5126128c89f40296b4b93d
kjv10-modified.txtf9aad5f2fed41aa145ea03f6c40716cfeece5b61c6d5fb8350a48873a87614ca
kjv10.txt $ openssl sha512 kjv10* # SHA512 HASH699f43a0750dbb6b2104e402a3622a7491c18537bff18e083c60603edf2fb7b1cbf7f5eccbf36ca7fc20b12b6929d7cf230b671f0d04e1eae4f1601107499292
kjv10-modified.txt29df561de4486a9c3e313c7e1515b104ed2d2cf17ff26ab255eba05bff47c5382ec438488f27d6497ad19f68b87fc942702997641f1eba672d6c6c11ce95938f
kjv10.txt
You can't reverse a hash, but you might find a collision
There is no such thing as the inverse of a hash function! There is no way to start with the hash of an input, and find an input for which you are certain that it was the input.
A typical hash function has a huge output space. You would expect to half to try half that many possible inputs before randomly stumbling across something that gave the same hash.
But even after you manage to find some input that produces the desired output, it is only one out of a possibly infinite number of possible inputs that produce the same hash!
A collision is a pair of different inputs
that produce the same hash output.
That is:
input1 != input2
but
hash(input1) = hash(input2)
A random collision would be the discovery of any two inputs with equal hashes.
A chosen-target collision is more difficult to
find.
That is the case where input1
or hash(input1)
is already specified, and the attacker needs to find
input2.
Another description of a good hash function is in "Deploying a New Hash Algorithm", by Steven M Bellovin and Eric K Rescorla, which follows the description of section 9.2 of the CRC Handbook of Applied Cryptography, Menezes, van Oorschot, and Vanstone, available as a free PDF download:
-
Collision resistance:
It is computationally infeasible to find
x
andy
wherex != y
andH(x) = H(y)
That is, it is impractically difficult to find a random collision. -
Preimage resistance:
Given a desired output
y,
it is computationally infeasible to findx
whereH(x) = y
That is, it is impractically difficult to find a chosen-target collision. -
Second preimage resistance:
Given an input
x',
it is computationally infeasible to findx
whereH(x) = H(x')
That is, it is impractically difficult to find a chosen-target collision.
Computers get faster and memory gets cheaper. So, a 2128 search space is no longer considered to be adequately large, and so the MD5 algorithm is not strong enough against brute-force attacks with sophisticated modern hardware. What's worse, MD5 isn't even as strong as promised and researchers have published results for collisions (and see the results below for even more on this):
- Finding two input strings that have the same MD5 hash.
- Finding a second input string that has the same MD5 hash as a chosen "target" input string (a chosen-target collision, a much harder problem).
In early 2017 Google announced the first practical technique for generating a SHA-1 collision, over 100,000 times faster than a brute-force attack.
Use SHA-2-256 or SHA-2-512. Not only are the search spaces enormously larger, but SHA-2 does not have the known weaknesses of MD5 and SHA-1. Further details are available here.
Hash | Output bits | Possible outputs |
MD5 | 128 | 2128 ≈ 3.40×1038 |
SHA-1 | 160 | 2160 ≈ 1.46×1048 |
SHA-2-256 | 256 | 2256 ≈ 1.16×1077 |
SHA-2-512 | 512 | 2512 ≈ 1.34×10154 |
That last number is awfully large, with 155 digits.
Searching theHash Output
Space
Do hash functions produce all possible outputs? Or might there be some "forbidden patterns"? Do hash functions enforce some form of parity preventing more than some maximum number of bits set to 0 or 1, or preventing sequential runs of identical bits longer than some threshold? Apparently not, although these "interesting" events take more and more brute-force search to discover as they become more and more interesting. See the results of my brute-force search for the details.
Yes, collisions are awfully unlikely to happen randomly, and still very difficult to find on purpose, but it is possible. Here are two 128-byte data files with identical MD5 hashes! But their SHA-1 hashes differ. Notice how similar the two files are, just six regularly-spaced bits differ:
$ openssl md5 crypto-file*.dat MD5(FILE1.DAT)= a4c0d35c95a63a805915367dcfe6b751 MD5(FILE2.DAT)= a4c0d35c95a63a805915367dcfe6b751 $ openssl sha1 crypto-file*.dat SHA1(FILE1.DAT)= 2783c4ff4a3f20d25f2598a8b052b890c37dcac4 SHA1(FILE2.DAT)= 3c35410823ef00b12d020981c1cf8564c0f89bcc $ hexdump -C crypto-file1.dat 00000000 d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c |gibberish 00000010 2f ca b5
87
12 46 7e ab 40 04 58 3e b8 fb 7f 89 |deleted... 00000020 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25
71
41 5a | 00000030 08 51 25 e8 f7 cd c9 9f d9 1d bd
f2
80 37 3c 5b | 00000040 96 0b 1d d1 dc 41 7b 9c e4 d8 97 f4 5a 65 55 d5 | 00000050 35 73 9a
c7
f0 eb fd 0c 30 29 f1 66 d1 09 b1 8f | 00000060 75 27 7f 79 30 d5 5c eb 22 e8 ad ba 79
cc
15 5c | 00000070 ed 74 cb dd 5f c5 d3 6d b1 9b 0a
d8
35 cc a7 e3 | 00000080 $ hexdump -C crypto-file2.dat 00000000 d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c |gibberish 00000010 2f ca b5
07
12 46 7e ab 40 04 58 3e b8 fb 7f 89 |deleted... 00000020 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25
f1
41 5a | 00000030 08 51 25 e8 f7 cd c9 9f d9 1d bd
72
80 37 3c 5b | 00000040 96 0b 1d d1 dc 41 7b 9c e4 d8 97 f4 5a 65 55 d5 | 00000050 35 73 9a
47
f0 eb fd 0c 30 29 f1 66 d1 09 b1 8f | 00000060 75 27 7f 79 30 d5 5c eb 22 e8 ad ba 79
4c
15 5c | 00000070 ed 74 cb dd 5f c5 d3 6d b1 9b 0a
58
35 cc a7 e3 | 00000080
If you want to play with these data files, right-click and save them: File #1 File #2
Note that MD5 is a chained hash function, and this means that if you start with the above two 128-byte files, and append identical sequences to each, the two resulting data files will also have identical MD5 hashes. So, an infinite number of collisions could be based on these.
Hash Collision Research
In the summer of 2004 some mathematicians presented a paper showing collision results for MD5 and others. See "Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD" by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu.
Someone else then published a paper on weaknesses in SHA-0, which was a precursor to SHA-1. Actually, SHA-0 was the winning submission for the U.S. government's Secure Hash Algorithm, but the NSA specified a few tweaks for reasons they (of course) did not explain. For this intermediate work, see "Near-Collisions of SHA-0" by Eli Biham and Rafi Chen.
At the Crypto 2005 conference the following summer,
Xiaoyun Wang, Yiqun Yin, and Hongbo Yu (two authors of
the preceding summer's paper)
published attacks against full SHA-1.
Then, at an informal meeting at that conference,
Xiaoyun Wang announced new results: the time complexity
was even less than the papers (so far) had announced.
Relevant papers include:
How to Break MD5 and Other Hash Functions
Efficient Collision Search Attacks on SHA-0
Finding Collisions in the Full SHA-1
Cryptanalysis of the Hash Functions MD4 and RIPEMD
An attack on hash function HAVAL-128
Other papers by Xiaoyun Wang
Lucks and Daum have generated Postscript files that demonstrate the attack. The files have the same hash, although they contain contradictory messages!
Stevens, Lenstra and de Weger have generated several PDF files with very different contents but identical hashes as part of their general hash research.
In December, 2008, researchers reported finding MD5
collisions in one to two days using a system built from
200 Playstation 3's.
Since six Certificate Authorities were still using MD5
signing on X.509 digital certificates, that means that
with a day or two of computing you could generate
forged certificates allowing a rogue system to
masquerade as a "secure" server using one of those
certificates.
They generated an intermediate CA certificate
so they could sign their own certificates.
That means you would only need to do
the computational attack once to generate
as many apparently valid certificates as you want.
Discussion at veracode.com
Discussion from Information Security
US-CERT Vulnerability Note #836068
In 2012, Microsoft explained that the authors of the Flame malware used an MD5 collision to forge a Windows code-signing certificate.
In October 2015, Marc Stevens, Pierre Karpman, and Thomas Peyrin reported on discovering freestart collisions in SHA-1 (this is a collision of the internal compression function, not all of SHA-1 but effectively a SHA-1 collision). They used a 64-GPU cluster, and estimated the cost of finding SHA-1 collisions to be US$ 75,000–120,000 by renting Amazon EC2 high-performance computing systems for a few months.
In February 2017 Google announced the first practical technique for generating a SHA-1 collision based on Marc Stevens' earlier work. It's over 100,000 times faster than a brute-force attack.
What does all this mean for us?
What does all that mean? The difficulty of finding random or chosen-target collisions is not as hard as we thought it was for anything in the family of hash functions including MD5 and SHA-1. So we should conclude:
- Bad news: This research shows that the SHA-1 hash function is significantly weaker than we thought!
- Good news: The numbers are still awfully large in comparison to current computer speeds. We're still talking theoretical rather than practical in most situations with today's hardware.
It is time to move away from SHA-1 based designs, new systems should be designed around SHA-2-256 or SHA-2-512 with a plan to plug in replacement functions. Trusted implementations of SHA-2-256 and SHA-2-512 are available as part of the OpenSSL package, both command-line tools and shared libraries. And cryptologists spent several years developing and testing a replacement for SHA-1 and the SHA-2 family.
In February 2007 NIST announced a competition to replace the then-current FIPS-180-2, which specified SHA-1, SHA-2-224, SHA-2-256, SHA-2-384, and SHA-2-512. FIPS-180-2 was issued in 2002, so the SHA family did not last as long as might have been expected. The new hash algorithms were selected in a process like that used for AES, the Advanced Encryption Standard, defined by government standard FIPS-140-2. Entries were submitted at the end of October, 2008, and the final selection was announced in October, 2012. You can read about the details of the contest if you're curious.
Good news for everyone came out of this. The development of the new hash function and its needed testing led to intense study of the SHA-2 family (SHA-2-224, SHA-2-256, SHA-2-384, SHA-2-512). The weaknesses found in MD5 and SHA-1 were not found in SHA-2. So, developing SHA-3 showed that it wasn't an emergency, SHA-2 still seems to be quite good. But, we didn't know that until it was studied.
The Keccak algorithm was chosen as SHA-3. Joan Daemen was on the Keccak development team, he was one of the two Belgian cryptographers behind the Rijndael cipher algorithm that was chosen as the Advanced Encryption Standard or AES.
Algorithm | Output Length |
MD1 [1] | 128 |
MD2 [1] | 128 |
MD3 [1] | 128 |
MD4 [1] | 128 |
MD5 | 128 |
SHA-1 | 160 |
RIPEMD-160 | 160 |
SHA-2-224 | 224 |
SHA-3-224 | 224 |
SHA-2-256 | 256 |
Snefru | 256 |
GOST [2] | 256 |
SHA-3-256 | 256 |
SHA-2-384 | 384 |
SHA-3-384 | 384 |
EKS-Blowfish [3] | 448 |
Whirlpool [4] | 512 |
SHA-2-512 | 512 |
SHA-3-512 | 512 |
[1]
MD1 and MD3 were computationally inefficient,
MD2 and MD4 were cryptographically flawed.
[2]
Государственный
Стандарт
Союза
ССР
or
Gosudarstvennyi Standart Soyuza SSR
[3]
Based on the Blowfish cipher, used in the
bcrypt
password-based key derivation function.
[4]
Designed by Vincent Rijmen, one of the Rijndael/AES
developers, and Paulo S.L.M. Barreto.
Also visit The Hashing Function Lounge.