Hashing is an essential part of modern software systems, and we’re going to explain what you need to know to use them in your programs. We’ll be focusing mainly on the terrific hashlib module in Python. We also want to stick to the most secure and most widely used cryptographic hashing functions, although there certainly are plenty of others.
This is probably going to be a three part series of posts because I want to first explain what hashing is, why it’s useful and what types of things we would use it for. Hashing functions are widely used in programming and it’s not always clear why things are done the way they are, so a quick intro to the basic ideas will probably go a long way to address that. Slightly different types of hashing functions are useful for different tasks due to the subtle differences in their underlying properties, so it’s useful to compare and contrast specific hashing functions and look at specific progamming use cases.
Cryptographic hashing functions are mathematical functions that take arbitrary sized input and produce fixed length output and have desrable characteristics. For example we might have a hashing function h that takes input values x which are strings of any length and produces an integer output h(x) that is always 20 bytes in length. The reason cryptographic hashing functions are noteworthy is because they have some interesting properties.
These functions have a couple properties that make them valuable for many programming tasks, namely determinism and irreversibility. For a function to be deterministic just means that for an input a the function will always produce the same output b. Repeat the operation a million times and you’ll continue to get output b a million times in a row.
A function that is not reversible simply means that given the output b you cannot determine the input a. You might find some input i that produces the output b, but you can never be sure this was the input that did actually produce it. If you think about it, you’ll realize this is due to the constraints on input and output. An arbitrary length input means (for typical input sets like strings of characters or integers) that there are an infinite number of possible inputs. The output length is fixed however, so those infinite inputs get mapped onto a finite number of outputs. By the pigeonhole principle there must be output values which multiple inputs map onto. These are called collisions, and are guaranteed to exist for this reason.
That’s really all it takes to have a basic hashing function, but for it to be useful, we need to ask a little more from it. A useful hashing function is one that has uniformity, meaning that if there are 10 possible outputs, the infinite range of inputs needs to map onto those 10 values in a way that takes an input value and maps it to one of the outputs with roughly the same probability as any other.
Finally it is desirable in most cases to have functions that produce output that has a low degree of continuity. That is to say that two nearly identical inputs produce very different outputs. Without this property it might be easier to deduce the input value by using successive approximations to guess your way gradually to the actual input. So it’s important that even the smallest change in the input value produce a completely distinct output.
Cryptographic hashing functions are extremely useful if they have the properties described above. The property of determinism means that we can use them to ensure the integrity of messages or files. If we hash a file and even one or two bytes gets altered, the resulting hash will be completely different. And we can know with a high degree of confidence that a file that still produces the same hash value is identical to the original.
Similarly hashes can ensure the integrity of transmitted messages. If a message hashes to a different value then the recipient can be confident that it was corrupted in transmission somehow or tampered with. The same function is used in some blockchain implementations like Bitcoin, where each block contains the hash of the previous block ensuring that no prior data has been changed. The integrity of the entire chain of data is verified by hashing blocks and comparing the results.
Hashes are a secure way to compare user-supplied passwords with those provided at an earlier time. Storing actual passwords is risky and creates a vulnerability. So hashes of passwords are stored instead, and when the user provides credentials in the form of username and password that entered password is hashed and compared to the saved hash. For this use case we need all four of the properties described above. Our hashing function must be deterministic, hard to reverse, have few collisions and similar passwords need to generate completely different hashes.
The one twist we need to introduce for this use case is to prevent attacks where a collection of password hashes is acquired and some users might have used the same password. If simply hashed, the deterministic property ensures that there will also be duplicate hashes. This is undesirable because if one can be figured out then all other identical hashes are also known. We avoid this by using salted hashes; that is a distinct “salt” value is concatenated to the original password before it is hashed. Just by changing this salt value the entire string will always hash to completely different values. Due to our property of slight changes to the inputs producing completely different results the same passwords with different salts will not be able to be correlated.
These hashing functions are commonly used in cryptocurrency protocols as well. One such application is the proof of work mechanism in Bitcoin. Miners try to find a nonce value that when added to the contents of a block will produce a hash with a certain number of leading zeros. This difficulty increases exponentially as the number of leading zeros is increased linearly, and vice cersa, making this technique highly adaptive to the amount of hashing power on the network.
Hashing is useful for other tasks as well, such as constructing bloom filters and hash tables, for digital signatures, and for data deduplication. For example, if multiple data hash to the same value they can be considered to be duplicates. Obviously this is useful only when the data are large enough and the hashing function has few collisions. Another way to say that it’s difficult to find a collision for a given hash function is to observe that it has a relatively large sized output.
Well now that the introduction is out of the way we can proceed to talk about specific hashing functions that we encounter in programming, and to talk about use cases where they are commonly used and misused, and how specifically this is done using Python3 so stay tuned!