There are many hashing algorithms commonly used in cryptography and they have names like SHA1, MD5 and so on. These hashing methods have advanced in the past couple of decades, and although we commonly use the older ones in our everyday lives, it’s worth briefly looking at newer and better alternatives.
We explored the basic properties of hashing functions in the previous post about cryptographic hashing functions, what characteristics made them desirable, what types of programming applications they are commonly used in, and for what purpose.
Just to recap briefly what we covered in the previous post, cryptographic hash functions are functions that take arbitrary size inputs and map them to fixed length outputs. They are deterministic, non-reversible and are useful if they have fairly uniform coverage of the output domain as well as producing wholly different outputs for even slightly different inputs.
Hashing functions are commonly used to ensure file integrity; that is to say that a given file has not been altered or corrupted. Similarly message integrity is using hashes to provide evidence that a given message has not had errors introduced during transit, or been tampered with before delivery.
Another common use of hashes in the programming world is in authentication systems. Passwords are salted and hashed to ensure integrity and prevent rainbow table attacks. Digital signatures are hashes, and hashes are the basis of MAC (message authentication codes) and other authentication schemes. For example, instead of storing a large data set representing fingerprints of each user in a system, hashes of those data can be stored instead and compared with a hash of the fingerprint presented at authentication time.
We briefly outlined other use cases for hashing functions as well, including their use in blockchain protocols, using them for data deduplication, constructing hash tables and more. Now let’s take a closer look at the algorithms we can choose to use for the software we build.
Obviously in most cases we care about reversibility of the function, as mentioned earlier. If I know the hash I should not be able to construct the input that produced that hash. I should also not be able to produce a collision. Finally the algorithm should be able to run using a reasonable amount of resources including time, for whatever reasonable means for a given use case. There needs to be no obvious correlation between inputs and resulting hashes, as if it were random.
So how might a hashing algorithm be attacked? If you’re considering which one to use for a given application you’ll probably want to think about this. In general, attackers would like to find some way of getting clues about the message from the hash, find some way of producing collisions, or find a way to brute force the input space enabling them to effectively reverse the function.
Let’s talk about those in reverse order. Brute forcing the input domain is the least interesting of those because it can usually be avoided. If the input range for a hashing function is so limited that an attacker can enumerate the outputs until they find a match then a different strategy needs to be used. Do not build a web application that limits user passwords to twelve alphanumeric characters.
Being able to find collisions is the holy grail of hashing algorithm attackers. For more than a decade SHA1 vulnerabilities were known to exist before a method of producing collisions was found thanks to Google funding in 2017. A general purpose method was discovered in 2019.
Indeed in 2004 the MD5 and SHA0 hashing functions were broken, laying the groundwork for the inevitable cracking of successors like RIPEMD160 and SHA1. The SHA2 family (basically the SHA128 and SHA256) are really just fortified versions of the same underlying hashing mechanism too, and will surely be broken in the coming years.
SHA3 and Blake families are actually completely different algorithms, and have no known vulnerabilities yet. All the ones previously mentioned are susceptible to an attack method called a length extension attack. In this attack, one message is hashed, and then that message is extended and hashed in the hopes of finding some correlation between the two hashes. If some pattern is observed it gives the attacker some ability to predict the result of changes to the message being hashed, and therefore improved chances of constructing messages to get the hash results desired.
So enough about older algorithms and National Institute for Standards and technology (NIST) standards. Yes, all the algorithms we mentioned came from NIST challenges to find standardized hashing functions, including keccak which became SHA3 and blake2 which is probably the best choice for most new applications. In the next post we’ll finally look at using these hashing functions in Python programs, so stay tuned!
.