# Cryptographic Hash Functions

## Basic Characteristics of Hash Functions

Put very simply, a hash function is a measurement whose purpose is to detect very slight changes to data, or to determine whether or not two pieces of data are identical. Hash functions are detective security tools, not protective, Cryptographically useful hashing functions have some specific characteristics. While more secure algorithms have been available for several years, many organizations persist in using weaker functions.

No, this is just flavored tobacco, in Göreme, Cappadocia, Turkey

The output of any one hash function is always of the same length, regardless of the length of the input string. Stated informally, a cryptographically strong hash is one for which a very slight change in the input causes a drastic change throughout the output, and that means you can't search for input strings producing a specified hash value.

A good cryptographic hash has these four properties, where "infeasible" means "almost certainly beyond the capability of any adversary who must be kept from breaking the system for as long as the security is important":

1: It is easy to compute the hash value for any message.

2: It is infeasible to find a message that has a specified hash.

3: It is infeasible to modify a message without changing its hash.

4: It is infeasible to find two different messages with the same hash.

Here is how to calculate the SHA-1 hash of the opening sentence of the CRC Handbook of Applied Cryptography:

```\$ echo -n "Cryptography has a long and fascinating history." | openssl sha1
d1557b56f7274faf16d38fdf9241035ab71da799 ```

We can replace the initial `C` with `K` and change only one bit out of the 48-character string because of the ASCII encodings:
`   C = 0x43 = 0100``0``011`
`   K = 0x4b = 0100``1``011`
That's a change of just one bit out of 384 bits in the input. 383 of the 384 bits, or 99.739583%, are identical. But the resulting hash is very different:

```\$ echo -n "Kryptography has a long and fascinating history." | openssl sha1
7215da622638fc3abdab4ffc91c6d2cf4c5599fe ```

Compare those two nearly identical inputs:
`   Cryptography has a long and fascinating history.`
`   Kryptography has a long and fascinating history.`
And now compare the two very different hash results, first in hexadecimal (base 16) and then in binary:

```d1557b56f7274faf16d38fdf9241035ab71da799
7215da622638fc3abdab4ffc91c6d2cf4c5599fe

110100010101010101111011010101101111011100100111010011111010111100010110110100111000111111011111100100100100000100000011010110101011011100011101101001111001100
111001000010101110110100110001000100110001110001111110000111010101111011010101101001111111111001001000111000110110100101100111101001100010101011001100111111110
```

One criteria for a good hash function is that a single-bit change in the input causes changes in about half of the output bits. That's what we're seeing here.

Let's use a much larger example input. The King James translation of the Bible is available as a plain text ASCII document from Project Gutenberg. That provides a commonly available, well known, and fairly large data set. According to the `wc` command this file contains 4,445,260 bytes, with 100,117 lines of text making up 823,156 words:

```\$ wc kjv10.txt
100117  823156 4445260 kjv10.txt ```

This document contains a specification for data integrity. The second and third to last verses of the entire document (that is, The Revelation of Saint John the Divine, chapter 22, verses 18 and 19) provide a relevant test point for investigating change detection. As included in the file downloaded from gutenberg.org those verses read:

```22:18 For I testify unto every man that heareth the words of the
prophecy of this book, If any man shall add unto these things, God
shall add unto him the plagues that are written in this book: 22:19
And if any man shall take away from the words of the book of this
prophecy, God shall take away his part out of the book of life, and
out of the holy city, and from the things which are written in this book. ```

I have made a copy of that text file, changing the first letter of 22:19 from an upper-case "A" to a lower-case "a". That one changed character over 100,000 lines into the file makes absolutely no change in the meaning of the text. The ASCII codes for the characters `A` and `a` vary by only one bit:

```Char  Hex   Binary
A   0x41  1`0`00001
a   0x61  1`1`00001

\$ wc kjv10.txt kjv10-modified.txt
100117  823156 4445260 kjv10.txt
100117  823156 4445260 kjv10-modified.txt
200234 1646312 8890520 total
\$ diff -C 1 kjv10.txt kjv10-modified.txt
*** kjv10.txt
--- kjv10-modified.txt
***************
*** 100099,100101 ****
shall add unto him the plagues that are written in this book: 22:19
! `A`nd if any man shall take away from the words of the book of this
prophecy, God shall take away his part out of the book of life, and
--- 100099,100101 ----
shall add unto him the plagues that are written in this book: 22:19
! `a`nd if any man shall take away from the words of the book of this
prophecy, God shall take away his part out of the book of life, and ```

One character out of 4,445,260, that's a 0.0000225% difference. But to be more accurate, the change is only one bit out of 35,562,080 bits, a 0.0000028% difference. Let's see what happens to the hash values for those files. We will calculate the MD5, SHA-1, SHA-2-256, and SHA-2-512 hash values for the two files.

```\$ openssl md5 kjv10*
# MD5 HASH
`98074d4575a9d6970cd2e69dfde4adab`  kjv10-modified.txt
`a6d84a7fdfaa6906bae40e2c342e978f`  kjv10.txt
\$ openssl sha1 kjv10*
# SHA1 HASH
`afa2cd8f2e5797b1edde66a829bbbb20650b51fa`  kjv10-modified.txt
`a4f709de0b5684bee071ac912432833a0c09db7d`  kjv10.txt
\$ openssl sha256 kjv10*
# SHA256 HASH
`f3df0027a219731d277f2ee5496a7351b7be2028ee5126128c89f40296b4b93d`  kjv10-modified.txt
`f9aad5f2fed41aa145ea03f6c40716cfeece5b61c6d5fb8350a48873a87614ca`  kjv10.txt
\$ openssl sha512 kjv10*
# SHA512 HASH
`699f43a0750dbb6b2104e402a3622a7491c18537bff18e083c60603edf2fb7b1cbf7f5eccbf36ca7fc20b12b6929d7cf230b671f0d04e1eae4f1601107499292`  kjv10-modified.txt
`29df561de4486a9c3e313c7e1515b104ed2d2cf17ff26ab255eba05bff47c5382ec438488f27d6497ad19f68b87fc942702997641f1eba672d6c6c11ce95938f`  kjv10.txt ```

A very different meaning of hash, in a brown cafe in Amsterdam.

## You can't reverse a hash, but you might find a collision

There is no such thing as the inverse of a hash function! There is no way to start with the hash of an input, and find an input for which you are certain that it was the input.

A typical hash function has a huge output space. You would expect to half to try half that many possible inputs before randomly stumbling across something that gave the same hash.

But even after you manage to find some input that produces the desired output, it is only one out of a possibly infinite number of possible inputs that produce the same hash!

A collision is a pair of different inputs that produce the same hash output. That is:
`   input1 != input2`     but
`   hash(input1) = hash(input2)`

A random collision would be the discovery of any two inputs with equal hashes.

A chosen-target collision is more difficult to find. That is the case where `input1` or `hash(input1)` is already specified, and the attacker needs to find `input2.`

Another description of a good hash function is in "Deploying a New Hash Algorithm", by Steven M Bellovin and Eric K Rescorla, which follows the description of section 9.2 of the CRC Handbook of Applied Cryptography, Menezes, van Oorschot, and Vanstone, available as a free PDF download:

• Collision resistance: It is computationally infeasible to find `x` and `y` where `x != y` and `H(x) = H(y)`
That is, it is impractically difficult to find a random collision.
• Preimage resistance: Given a desired output `y,` it is computationally infeasible to find `x` where `H(x) = y`
That is, it is impractically difficult to find a chosen-target collision.
• Second preimage resistance: Given an input `x',` it is computationally infeasible to find `x` where `H(x) = H(x')`
That is, it is impractically difficult to find a chosen-target collision.

Computers get faster and memory gets cheaper. So, a 2128 search space is no longer considered to be adequately large, and so the MD5 algorithm is not strong enough against brute-force attacks with sophisticated modern hardware. What's worse, MD5 isn't even as strong as promised and researchers have published results for collisions (and see the results below for even more on this):

• Finding two input strings that have the same MD5 hash.
• Finding a second input string that has the same MD5 hash as a chosen "target" input string (a chosen-target collision, a much harder problem).

In early 2017 Google announced the first practical technique for generating a SHA-1 collision, over 100,000 times faster than a brute-force attack.

Use SHA-2-256 or SHA-2-512. Not only are the search spaces enormously larger, but SHA-2 does not have the known weaknesses of MD5 and SHA-1. Further details are available here.

 Hash Output bits Possible outputs MD5 128 2128 ≈ 3.40×1038 SHA-1 160 2160 ≈ 1.46×1048 SHA-2-256 256 2256 ≈ 1.16×1077 SHA-2-512 512 2512 ≈ 1.34×10154

That last number is awfully large, with 155 digits.

Do hash functions produce all possible outputs? Or might there be some "forbidden patterns"? Do hash functions enforce some form of parity preventing more than some maximum number of bits set to 0 or 1, or preventing sequential runs of identical bits longer than some threshold? Apparently not, although these "interesting" events take more and more brute-force search to discover as they become more and more interesting. See the results of my brute-force search for the details.

Yes, collisions are awfully unlikely to happen randomly, and still very difficult to find on purpose, but it is possible. Here are two 128-byte data files with identical MD5 hashes! But their SHA-1 hashes differ. Notice how similar the two files are, just six regularly-spaced bits differ:

``````
\$ openssl md5 crypto-file*.dat
MD5(FILE1.DAT)= a4c0d35c95a63a805915367dcfe6b751
MD5(FILE2.DAT)= a4c0d35c95a63a805915367dcfe6b751
\$ openssl sha1 crypto-file*.dat
SHA1(FILE1.DAT)= 2783c4ff4a3f20d25f2598a8b052b890c37dcac4
SHA1(FILE2.DAT)= 3c35410823ef00b12d020981c1cf8564c0f89bcc
\$ hexdump -C crypto-file1.dat
00000000  d1 31 dd 02 c5 e6 ee c4  69 3d 9a 06 98 af f9 5c  |gibberish
00000010  2f ca b5 ````87```` 12 46 7e ab  40 04 58 3e b8 fb 7f 89  |deleted...
00000020  55 ad 34 06 09 f4 b3 02  83 e4 88 83 25 ````71```` 41 5a  |
00000030  08 51 25 e8 f7 cd c9 9f  d9 1d bd ````f2```` 80 37 3c 5b  |
00000040  96 0b 1d d1 dc 41 7b 9c  e4 d8 97 f4 5a 65 55 d5  |
00000050  35 73 9a ````c7```` f0 eb fd 0c  30 29 f1 66 d1 09 b1 8f  |
00000060  75 27 7f 79 30 d5 5c eb  22 e8 ad ba 79 ````cc```` 15 5c  |
00000070  ed 74 cb dd 5f c5 d3 6d  b1 9b 0a ````d8```` 35 cc a7 e3  |
00000080
\$ hexdump -C crypto-file2.dat
00000000  d1 31 dd 02 c5 e6 ee c4  69 3d 9a 06 98 af f9 5c  |gibberish
00000010  2f ca b5 ````07```` 12 46 7e ab  40 04 58 3e b8 fb 7f 89  |deleted...
00000020  55 ad 34 06 09 f4 b3 02  83 e4 88 83 25 ````f1```` 41 5a  |
00000030  08 51 25 e8 f7 cd c9 9f  d9 1d bd ````72```` 80 37 3c 5b  |
00000040  96 0b 1d d1 dc 41 7b 9c  e4 d8 97 f4 5a 65 55 d5  |
00000050  35 73 9a ````47```` f0 eb fd 0c  30 29 f1 66 d1 09 b1 8f  |
00000060  75 27 7f 79 30 d5 5c eb  22 e8 ad ba 79 ````4c```` 15 5c  |
00000070  ed 74 cb dd 5f c5 d3 6d  b1 9b 0a ````58```` 35 cc a7 e3  |
00000080``````

If you want to play with these data files, right-click and save them:

Note that MD5 is a chained hash function, and this means that if you start with the above two 128-byte files, and append identical sequences to each, the two resulting data files will also have identical MD5 hashes. So, an infinite number of collisions could be based on these.

## Hash Collision Research

In the summer of 2004 some mathematicians presented a paper showing collision results for MD5 and others. See "Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD" by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu.

Someone else then published a paper on weaknesses in SHA-0, which was a precursor to SHA-1. Actually, SHA-0 was the winning submission for the U.S. government's Secure Hash Algorithm, but the NSA specified a few tweaks for reasons they (of course) did not explain. For this intermediate work, see "Near-Collisions of SHA-0" by Eli Biham and Rafi Chen.

At the Crypto 2005 conference the following summer, Xiaoyun Wang, Yiqun Yin, and Hongbo Yu (two authors of the preceding summer's paper) published attacks against full SHA-1. Then, at an informal meeting at that conference, Xiaoyun Wang announced new results: the time complexity was even less than the papers (so far) had announced. Relevant papers include:

Lucks and Daum have generated Postscript files that demonstrate the attack. The files have the same hash, although they contain contradictory messages!

Stevens, Lenstra and de Weger have generated several PDF files with very different contents but identical hashes as part of their general hash research.

In December, 2008, researchers reported finding MD5 collisions in one to two days using a system built from 200 Playstation 3's. Since six Certificate Authorities were still using MD5 signing on X.509 digital certificates, that means that with a day or two of computing you could generate forged certificates allowing a rogue system to masquerade as a "secure" server using one of those certificates. They generated an intermediate CA certificate so they could sign their own certificates. That means you would only need to do the computational attack once to generate as many apparently valid certificates as you want.

In 2012, Microsoft explained that the authors of the Flame malware used an MD5 collision to forge a Windows code-signing certificate.

In October 2015, Marc Stevens, Pierre Karpman, and Thomas Peyrin reported on discovering freestart collisions in SHA-1 (this is a collision of the internal compression function, not all of SHA-1 but effectively a SHA-1 collision). They used a 64-GPU cluster, and estimated the cost of finding SHA-1 collisions to be US\$ 75,000–120,000 by renting Amazon EC2 high-performance computing systems for a few months.

In February 2017 Google announced the first practical technique for generating a SHA-1 collision based on Marc Stevens' earlier work. It's over 100,000 times faster than a brute-force attack.

## What does all this mean for us?

What does all that mean? The difficulty of finding random or chosen-target collisions is not as hard as we thought it was for anything in the family of hash functions including MD5 and SHA-1. So we should conclude:

• Bad news: This research shows that the SHA-1 hash function is significantly weaker than we thought!
• Good news: The numbers are still awfully large in comparison to current computer speeds. We're still talking theoretical rather than practical in most situations with today's hardware.

It is time to move away from SHA-1 based designs, new systems should be designed around SHA-2-256 or SHA-2-512 with a plan to plug in replacement functions. Trusted implementations of SHA-2-256 and SHA-2-512 are available as part of the OpenSSL package, both command-line tools and shared libraries. And cryptologists spent several years developing and testing a replacement for SHA-1 and the SHA-2 family.

In February 2007 NIST announced a competition to replace the then-current FIPS-180-2, which specified SHA-1, SHA-2-224, SHA-2-256, SHA-2-384, and SHA-2-512. FIPS-180-2 was issued in 2002, so the SHA family did not last as long as might have been expected. The new hash algorithms were selected in a process like that used for AES, the Advanced Encryption Standard, defined by government standard FIPS-140-2. Entries were submitted at the end of October, 2008, and the final selection was announced in October, 2012. You can read about the details of the contest if you're curious.

Good news for everyone came out of this. The development of the new hash function and its needed testing led to intense study of the SHA-2 family (SHA-2-224, SHA-2-256, SHA-2-384, SHA-2-512). The weaknesses found in MD5 and SHA-1 were not found in SHA-2. So, developing SHA-3 showed that it wasn't an emergency, SHA-2 still seems to be quite good. But, we didn't know that until it was studied.

The Keccak algorithm was chosen as SHA-3. Joan Daemen was on the Keccak development team, he was one of the two Belgian cryptographers behind the Rijndael cipher algorithm that was chosen as the Advanced Encryption Standard or AES.

 Algorithm Output Length MD1 [1] 128 MD2 [1] 128 MD3 [1] 128 MD4 [1] 128 MD5 128 SHA-1 160 RIPEMD-160 160 SHA-2-224 224 SHA-3-224 224 SHA-2-256 256 Snefru 256 GOST [2] 256 SHA-3-256 256 SHA-2-384 384 SHA-3-384 384 EKS-Blowfish [3] 448 Whirlpool [4] 512 SHA-2-512 512 SHA-3-512 512

[1] MD1 and MD3 were computationally inefficient, MD2 and MD4 were cryptographically flawed.
[2] Государственный Стандарт Союза ССР or Gosudarstvennyi Standart Soyuza SSR
[3] Based on the Blowfish cipher, used in the bcrypt password-based key derivation function.
[4] Designed by Vincent Rijmen, one of the Rijndael/AES developers, and Paulo S.L.M. Barreto.

Also visit The Hashing Function Lounge.