We mentioned that hiding text within text is the common ancestor of all steganographic techniques earlier, in our initial post on digital steganography basics. However, in the digital age there are some new wrinkles due to the existence of various kinds of computer files. Of course text messages can be concealed in any number of file types – in fact every file type, but we need to begin somewhere so let’s examine text-based steganography.
Text files are simply files that are interpreted as such by an operating system. The notion of files and filesystems is an abstraction the OS provides us so we can more easily manage data. These “files” are ones and zeros just the same as binary files, but they are comprised of human readable text encoded using a standard scheme like ASCII or Unicode. In the modern version of American Standard Code for Information Interchange (ASCII) each byte is interpreted as one character, and since a byte is 8 ones and zeros, it can represent 256 (2 possibilities per bit, 8 bits so 2^8=256) characters.
OK so why are we in the weeds here about character encoding schemes and such? The ASCII character set represents the English alphabet characters – 26 uppercase and 26 lowercase. It also encodes the 10 digits we use for numbers along with the “other” characters you find on English language keyboards. Some of these characters are for punctuation, some are symbols we commonly use for arithmetic. There are other non-visible formatting characters as well including the space character, the carriage return (vestige from the days of typewriters), line feed character and the tab character. Also there are many more that aren’t normally displayed to us in our software applications.
The idea here is that messages can be hidden using the characters that are not normally displayed to the users. I’m oversimplifying it by just sticking to this one common encoding scheme, when in fact there are other encoding schemes that have whitespace characters like wide spaces, and those schemes can be used as well. We’re going to keep it real simple by just talking about ASCII encoding and only those non-visible characters that I’m calling “whitespace” – spaces, tabs and newlines.
Text can be hidden in text files quite easily by adding extra spaces and tabs. Note that newlines are implemented differently by various operating systems and might be a carriage return, a line feed character or both – so we stick with spaces and tabs. This is exactly the technique that a program called snow uses to conceal text inside text. The message is encoded in spaces and tabs at the end of lines of text, including blank lines. This works because most text viewers will not display whitespace characters when they follow the end of a line of text.
Shown above is the snow program in action. We started with a small text file called plain.txt
that is 212 bytes in size. We then get snow to estimate how much data we can conceal:$ snow -S plain.txt
and the reply is 10 or 11 bytes, so for this tiny file we need a short message.
We add the message to this file using this command, which indicates the message text and what filename to give the resulting file:$ snow -C -m "secret message" plain.txt carrier.txt
The program tells us that it used pretty much all available space but did not have an error, and we now see that although the text looks identical the file size has grown to 304 bytes.
We did not use a password to use snow’s built-in symmetric encryption, so we can reveal the hidden message by simply giving the filename to look in:$ snow -C carrier.txt
Voila – it found our secret message hidden in the whitespace. So how could we tell without using the snow program? If we load this into a text editor we’ll see that the line endings are well past the last characters. This could just be extra spaces, but it could also indicate a closer look at that whitespace is warranted.
This is just one way to hide text in text files of course, there are as many methods as there are creative people to imagine them. Binary files give more opportunities to hide content, largely due to their size. A Word doc or PDF file with this text would easily be an order of magnitude larger than a few hundred bytes.
Most of the readers of this will probably understand that HTML files are also text files, typically shown to the viewer by a web browser. There are important details about the nature of this markup language and the way browsers render this content that make it useful for information hiding using text techniques. The same can be said for assocaited files containing CSS or Javascript, and they hold unique possibilities for information hiding.
I hope you enjoyed reading this description about hiding text inside text, and how it can be done using the snow program. Hiding text information inside HTML files will be discussed in a future post about text-based steganography because there are a lot more possibilities than there are with simple text files and it deserves it’s own writeup. For those of you eager to dive into hiding information inside binary files – please be patient, we’ll get there too.