Cryptographic Hash Functions

Basic Characteristics of Hash Functions

Put very simply, a hash function is a measurement whose purpose is to detect very slight changes to data, or to determine whether or not two pieces of data are identical. Hash functions are detective security tools, not protective, Cryptographically useful hashing functions have some specific characteristics. While more secure algorithms have been available for several years, many organizations persist in using weaker functions.

Smoking a hookah after dinner in Göreme, Turkey

No, this is just flavored tobacco, in Göreme, Cappadocia, Turkey

The output of any one hash function is always of the same length, regardless of the length of the input string. Stated informally, a cryptographically strong hash is one for which a very slight change in the input causes a drastic change throughout the output, and that means you can't search for input strings producing a specified hash value.

A good cryptographic hash has these four properties, where "infeasible" means "almost certainly beyond the capability of any adversary who must be kept from breaking the system for as long as the security is important":

1: It is easy to compute the hash value for any message.

2: It is infeasible to find a message that has a specified hash.

3: It is infeasible to modify a message without changing its hash.

4: It is infeasible to find two different messages with the same hash.

Here is how to calculate the SHA-1 hash of the opening sentence of the CRC Handbook of Applied Cryptography:

$ echo -n "Cryptography has a long and fascinating history." | openssl sha1
d1557b56f7274faf16d38fdf9241035ab71da799

We can replace the initial C with K and change only one bit out of the 48-character string because of the ASCII encodings:
C = 0x43 = 01000011
K = 0x4b = 01001011
That's a change of just one bit out of 384 bits in the input. 383 of the 384 bits, or 99.739583%, are identical. But the resulting hash is very different:

$ echo -n "Kryptography has a long and fascinating history." | openssl sha1
7215da622638fc3abdab4ffc91c6d2cf4c5599fe

Compare those two nearly identical inputs:
Cryptography has a long and fascinating history.
Kryptography has a long and fascinating history.
And now compare the two very different hash results, first in hexadecimal (base 16) and then in binary:

d1557b56f7274faf16d38fdf9241035ab71da799
7215da622638fc3abdab4ffc91c6d2cf4c5599fe

110100010101010101111011010101101111011100100111010011111010111100010110110100111000111111011111100100100100000100000011010110101011011100011101101001111001100
111001000010101110110100110001000100110001110001111110000111010101111011010101101001111111111001001000111000110110100101100111101001100010101011001100111111110

One criteria for a good hash function is that a single-bit change in the input causes changes in about half of the output bits. That's what we're seeing here.

Let's use a much larger example input. The King James translation of the Bible is available as a plain text ASCII document from Project Gutenberg. That provides a commonly available, well known, and fairly large data set. According to the wc command this file contains 4,445,260 bytes, with 100,117 lines of text making up 823,156 words:

$ wc kjv10.txt
  100117  823156 4445260 kjv10.txt

This document contains a specification for data integrity. The second and third to last verses of the entire document (that is, The Revelation of Saint John the Divine, chapter 22, verses 18 and 19) provide a relevant test point for investigating change detection. As included in the file downloaded from gutenberg.org those verses read:

22:18 For I testify unto every man that heareth the words of the
prophecy of this book, If any man shall add unto these things, God
shall add unto him the plagues that are written in this book: 22:19
And if any man shall take away from the words of the book of this
prophecy, God shall take away his part out of the book of life, and
out of the holy city, and from the things which are written in this book.

I have made a copy of that text file, changing the first letter of 22:19 from an upper-case "A" to a lower-case "a". That one changed character over 100,000 lines into the file makes absolutely no change in the meaning of the text. The ASCII codes for the characters A and a vary by only one bit:

Char  Hex   Binary
 A   0x41  1000001
 a   0x61  1100001

$ wc kjv10.txt kjv10-modified.txt
  100117  823156 4445260 kjv10.txt
  100117  823156 4445260 kjv10-modified.txt
  200234 1646312 8890520 total
$ diff -C 1 kjv10.txt kjv10-modified.txt
*** kjv10.txt
--- kjv10-modified.txt
***************
*** 100099,100101 ****
  shall add unto him the plagues that are written in this book: 22:19
! And if any man shall take away from the words of the book of this
  prophecy, God shall take away his part out of the book of life, and
--- 100099,100101 ----
  shall add unto him the plagues that are written in this book: 22:19
! and if any man shall take away from the words of the book of this
  prophecy, God shall take away his part out of the book of life, and

One character out of 4,445,260, that's a 0.0000225% difference. But to be more accurate, the change is only one bit out of 35,562,080 bits, a 0.0000028% difference. Let's see what happens to the hash values for those files. We will calculate the MD5, SHA-1, SHA-2-256, and SHA-2-512 hash values for the two files.

$ openssl md5 kjv10*
# MD5 HASH
98074d4575a9d6970cd2e69dfde4adab  kjv10-modified.txt
a6d84a7fdfaa6906bae40e2c342e978f  kjv10.txt
$ openssl sha1 kjv10*
# SHA1 HASH
afa2cd8f2e5797b1edde66a829bbbb20650b51fa  kjv10-modified.txt
a4f709de0b5684bee071ac912432833a0c09db7d  kjv10.txt
$ openssl sha256 kjv10*
# SHA256 HASH
f3df0027a219731d277f2ee5496a7351b7be2028ee5126128c89f40296b4b93d  kjv10-modified.txt
f9aad5f2fed41aa145ea03f6c40716cfeece5b61c6d5fb8350a48873a87614ca  kjv10.txt
$ openssl sha512 kjv10*
# SHA512 HASH
699f43a0750dbb6b2104e402a3622a7491c18537bff18e083c60603edf2fb7b1cbf7f5eccbf36ca7fc20b12b6929d7cf230b671f0d04e1eae4f1601107499292  kjv10-modified.txt
29df561de4486a9c3e313c7e1515b104ed2d2cf17ff26ab255eba05bff47c5382ec438488f27d6497ad19f68b87fc942702997641f1eba672d6c6c11ce95938f  kjv10.txt

Hash and marijuana menu in a brown cafe in Amsterdam.

A very different meaning of hash, in a brown cafe in Amsterdam.

You can't reverse a hash, but you might find a collision

There is no such thing as the inverse of a hash function! There is no way to start with the hash of an input, and find an input for which you are certain that it was the input.

A typical hash function has a huge output space. You would expect to half to try half that many possible inputs before randomly stumbling across something that gave the same hash.

But even after you manage to find some input that produces the desired output, it is only one out of a possibly infinite number of possible inputs that produce the same hash!

A collision is a pair of different inputs that produce the same hash output. That is:
input1 != input2 but
hash(input1) = hash(input2)

A random collision would be the discovery of any two inputs with equal hashes.

A chosen-target collision is more difficult to find. That is the case where input1 or hash(input1) is already specified, and the attacker needs to find input2.

Another description of a good hash function is in "Deploying a New Hash Algorithm", by Steven M Bellovin and Eric K Rescorla, which follows the description of section 9.2 of the CRC Handbook of Applied Cryptography, Menezes, van Oorschot, and Vanstone, available as a free PDF download:

Collision resistance: It is computationally infeasible to find x and y where x != y and H(x) = H(y)
That is, it is impractically difficult to find a random collision.
Preimage resistance: Given a desired output y, it is computationally infeasible to find x where H(x) = y
That is, it is impractically difficult to find a chosen-target collision.
Second preimage resistance: Given an input x', it is computationally infeasible to find x where H(x) = H(x')
That is, it is impractically difficult to find a chosen-target collision.

Computers get faster and memory gets cheaper. So, a 2¹²⁸ search space is no longer considered to be adequately large, and so the MD5 algorithm is not strong enough against brute-force attacks with sophisticated modern hardware. What's worse, MD5 isn't even as strong as promised and researchers have published results for collisions (and see the results below for even more on this):

Finding two input strings that have the same MD5 hash.
Finding a second input string that has the same MD5 hash as a chosen "target" input string (a chosen-target collision, a much harder problem).

In early 2017 Google announced the first practical technique for generating a SHA-1 collision, over 100,000 times faster than a brute-force attack.

Use SHA-2-256 or SHA-2-512. Not only are the search spaces enormously larger, but SHA-2 does not have the known weaknesses of MD5 and SHA-1. Further details are available here.

Hash	Output bits	Possible outputs
MD5	128	2¹²⁸ ≈ 3.40×10³⁸
SHA-1	160	2¹⁶⁰ ≈ 1.46×10⁴⁸
SHA-2-256	256	2²⁵⁶ ≈ 1.16×10⁷⁷
SHA-2-512	512	2⁵¹² ≈ 1.34×10¹⁵⁴

That last number is awfully large, with 155 digits.

Searching the
Hash Output
Space

Do hash functions produce all possible outputs? Or might there be some "forbidden patterns"? Do hash functions enforce some form of parity preventing more than some maximum number of bits set to 0 or 1, or preventing sequential runs of identical bits longer than some threshold? Apparently not, although these "interesting" events take more and more brute-force search to discover as they become more and more interesting. See the results of my brute-force search for the details.

Yes, collisions are awfully unlikely to happen randomly, and still very difficult to find on purpose, but it is possible. Here are two 128-byte data files with identical MD5 hashes! But their SHA-1 hashes differ. Notice how similar the two files are, just six regularly-spaced bits differ:


 $ openssl md5 crypto-file*.dat
 MD5(FILE1.DAT)= a4c0d35c95a63a805915367dcfe6b751
 MD5(FILE2.DAT)= a4c0d35c95a63a805915367dcfe6b751
 $ openssl sha1 crypto-file*.dat
 SHA1(FILE1.DAT)= 2783c4ff4a3f20d25f2598a8b052b890c37dcac4
 SHA1(FILE2.DAT)= 3c35410823ef00b12d020981c1cf8564c0f89bcc
 $ hexdump -C crypto-file1.dat
 00000000  d1 31 dd 02 c5 e6 ee c4  69 3d 9a 06 98 af f9 5c  |gibberish      
 00000010  2f ca b5 87 12 46 7e ab  40 04 58 3e b8 fb 7f 89  |deleted...
 00000020  55 ad 34 06 09 f4 b3 02  83 e4 88 83 25 71 41 5a  |
 00000030  08 51 25 e8 f7 cd c9 9f  d9 1d bd f2 80 37 3c 5b  |
 00000040  96 0b 1d d1 dc 41 7b 9c  e4 d8 97 f4 5a 65 55 d5  |
 00000050  35 73 9a c7 f0 eb fd 0c  30 29 f1 66 d1 09 b1 8f  |
 00000060  75 27 7f 79 30 d5 5c eb  22 e8 ad ba 79 cc 15 5c  |
 00000070  ed 74 cb dd 5f c5 d3 6d  b1 9b 0a d8 35 cc a7 e3  |
 00000080
 $ hexdump -C crypto-file2.dat
 00000000  d1 31 dd 02 c5 e6 ee c4  69 3d 9a 06 98 af f9 5c  |gibberish
 00000010  2f ca b5 07 12 46 7e ab  40 04 58 3e b8 fb 7f 89  |deleted...
 00000020  55 ad 34 06 09 f4 b3 02  83 e4 88 83 25 f1 41 5a  |
 00000030  08 51 25 e8 f7 cd c9 9f  d9 1d bd 72 80 37 3c 5b  |
 00000040  96 0b 1d d1 dc 41 7b 9c  e4 d8 97 f4 5a 65 55 d5  |
 00000050  35 73 9a 47 f0 eb fd 0c  30 29 f1 66 d1 09 b1 8f  |
 00000060  75 27 7f 79 30 d5 5c eb  22 e8 ad ba 79 4c 15 5c  |
 00000070  ed 74 cb dd 5f c5 d3 6d  b1 9b 0a 58 35 cc a7 e3  |
 00000080

If you want to play with these data files, right-click and save them: File #1 File #2

Note that MD5 is a chained hash function, and this means that if you start with the above two 128-byte files, and append identical sequences to each, the two resulting data files will also have identical MD5 hashes. So, an infinite number of collisions could be based on these.

Hash Collision Research

In the summer of 2004 some mathematicians presented a paper showing collision results for MD5 and others. See "Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD" by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu.

Someone else then published a paper on weaknesses in SHA-0, which was a precursor to SHA-1. Actually, SHA-0 was the winning submission for the U.S. government's Secure Hash Algorithm, but the NSA specified a few tweaks for reasons they (of course) did not explain. For this intermediate work, see "Near-Collisions of SHA-0" by Eli Biham and Rafi Chen.

At the Crypto 2005 conference the following summer, Xiaoyun Wang, Yiqun Yin, and Hongbo Yu (two authors of the preceding summer's paper) published attacks against full SHA-1. Then, at an informal meeting at that conference, Xiaoyun Wang announced new results: the time complexity was even less than the papers (so far) had announced. Relevant papers include:
How to Break MD5 and Other Hash Functions
Efficient Collision Search Attacks on SHA-0
Finding Collisions in the Full SHA-1
Cryptanalysis of the Hash Functions MD4 and RIPEMD
An attack on hash function HAVAL-128
Other papers by Xiaoyun Wang

Lucks and Daum have generated Postscript files that demonstrate the attack. The files have the same hash, although they contain contradictory messages!

Stevens, Lenstra and de Weger have generated several PDF files with very different contents but identical hashes as part of their general hash research.

In December, 2008, researchers reported finding MD5 collisions in one to two days using a system built from 200 Playstation 3's. Since six Certificate Authorities were still using MD5 signing on X.509 digital certificates, that means that with a day or two of computing you could generate forged certificates allowing a rogue system to masquerade as a "secure" server using one of those certificates. They generated an intermediate CA certificate so they could sign their own certificates. That means you would only need to do the computational attack once to generate as many apparently valid certificates as you want.
Discussion at veracode.com Discussion from Information Security US-CERT Vulnerability Note #836068

In 2012, Microsoft explained that the authors of the Flame malware used an MD5 collision to forge a Windows code-signing certificate.

In October 2015, Marc Stevens, Pierre Karpman, and Thomas Peyrin reported on discovering freestart collisions in SHA-1 (this is a collision of the internal compression function, not all of SHA-1 but effectively a SHA-1 collision). They used a 64-GPU cluster, and estimated the cost of finding SHA-1 collisions to be US$ 75,000–120,000 by renting Amazon EC2 high-performance computing systems for a few months.

In February 2017 Google announced the first practical technique for generating a SHA-1 collision based on Marc Stevens' earlier work. It's over 100,000 times faster than a brute-force attack.

What does all this mean for us?

What does all that mean? The difficulty of finding random or chosen-target collisions is not as hard as we thought it was for anything in the family of hash functions including MD5 and SHA-1. So we should conclude:

Bad news: This research shows that the SHA-1 hash function is significantly weaker than we thought!
Good news: The numbers are still awfully large in comparison to current computer speeds. We're still talking theoretical rather than practical in most situations with today's hardware.

It is time to move away from SHA-1 based designs, new systems should be designed around SHA-2-256 or SHA-2-512 with a plan to plug in replacement functions. Trusted implementations of SHA-2-256 and SHA-2-512 are available as part of the OpenSSL package, both command-line tools and shared libraries. And cryptologists spent several years developing and testing a replacement for SHA-1 and the SHA-2 family.

In February 2007 NIST announced a competition to replace the then-current FIPS-180-2, which specified SHA-1, SHA-2-224, SHA-2-256, SHA-2-384, and SHA-2-512. FIPS-180-2 was issued in 2002, so the SHA family did not last as long as might have been expected. The new hash algorithms were selected in a process like that used for AES, the Advanced Encryption Standard, defined by government standard FIPS-140-2. Entries were submitted at the end of October, 2008, and the final selection was announced in October, 2012. You can read about the details of the contest if you're curious.

Good news for everyone came out of this. The development of the new hash function and its needed testing led to intense study of the SHA-2 family (SHA-2-224, SHA-2-256, SHA-2-384, SHA-2-512). The weaknesses found in MD5 and SHA-1 were not found in SHA-2. So, developing SHA-3 showed that it wasn't an emergency, SHA-2 still seems to be quite good. But, we didn't know that until it was studied.

The Keccak algorithm was chosen as SHA-3. Joan Daemen was on the Keccak development team, he was one of the two Belgian cryptographers behind the Rijndael cipher algorithm that was chosen as the Advanced Encryption Standard or AES.

**Cryptographic Hash Functions**
Algorithm	Output Length
MD1 [1]	128
MD2 [1]	128
MD3 [1]	128
MD4 [1]	128
MD5	128
SHA-1	160
RIPEMD-160	160
SHA-2-224	224
SHA-3-224	224
SHA-2-256	256
Snefru	256
GOST [2]	256
SHA-3-256	256
SHA-2-384	384
SHA-3-384	384
EKS-Blowfish [3]	448
Whirlpool [4]	512
SHA-2-512	512
SHA-3-512	512

[1] MD1 and MD3 were computationally inefficient, MD2 and MD4 were cryptographically flawed.
[2] Государственный Стандарт Союза ССР or Gosudarstvennyi Standart Soyuza SSR
[3] Based on the Blowfish cipher, used in the bcrypt password-based key derivation function.
[4] Designed by Vincent Rijmen, one of the Rijndael/AES developers, and Paulo S.L.M. Barreto.

Also visit The Hashing Function Lounge.

Next ❯ Digital Signatures and Certificates