Thursday, 15 September 2011
The Perils of Encodings
« Comodo Hacker Takes Us Back to Security 101 | Main | FoD 3.0! »Encodings are something that we usually take for granted in our daily lives. All of our files use it when we save files to the file system. Our web applications use it determine how to receive and send data for each HTTP request. Our browsers use it to determine how bytes should be converted to characters when receiving the HTTP response. But most of us don’t understand the implications of using encodings incorrectly or not at all. The first case in point would be converting between Strings and byte arrays.
Converting Strings to Bytes and Vice Versa
If you look across the web for samples of how to code encryption you will see the following many times:
static string GenerateKey()
{
// Create an instance of Symetric Algorithm. Key and IV is generated automatically.
DESCryptoServiceProvider desCrypto
=(DESCryptoServiceProvider)DESCryptoServiceProvider.Create();
// Use the Automatically generated key for Encryption.
return ASCIIEncoding.ASCII.GetString(desCrypto.Key);
}
Similar code is all over the Web because the code came from Microsoft’s Security Patterns and Practices as well as MSDN’s articles on doing cryptography. If .NET developers are following Microsoft’s guidance then we are looking at a lot of potentially vulnerable (in hours not years) encrypted data. Here are the links to the MSDN articles:
http://support.microsoft.com/kb/307010
http://support.microsoft.com/kb/301070
Asides from the fact that DES is a broken algorithm can you spot the vuln? Hint it has to do with encodings.
Well if you guessed:
return ASCIIEncoding.ASCII.GetString(desCrypto.Key);
Ding…Ding…Ding. Give that person a cigar.
Running a randomly generated key through ASCIIEncoding.ASCII.GetString(…) will convert bytes outside range of 0-127 to char(63) which is the ?. The other bytes will get converted to characters within the 0-127 range. Making the mistake above would result in a key where half of the characters were question marks and the other half of the characters were limited to half of their byte key space. To put this into perspective, most encryption keys are 128 bits (16 bytes) however the DES example above uses 64 bit keys with a parity bit used in each key byte resulting in a 56 bit effective key size. After following the advice above, an AES 128 bit key turns into a 56 bit key (128/2 = 64 bits because half of the key is turned into question marks, then the remaining 64 bits or 8 bytes have their 8th bit turned off which results in a 64 - 8 = 56 bit effective key size) and the DES key, from the example above, turns into a 28 bit effective key size (64/2 = 32 – 4 = 28 bits because the 4 bytes that were not turned into question marks have 0 as the 8th order bit). Recent experiments have shown 56 bit keys to be breakable in hours. If you are a .NET developer and have relied on MSDN articles for guidance on how to do encryption I would strongly suggest that you revisit your encryption code.
If you are thinking of changing the errant code above to:
return UnicodeEncoding.UTF8.GetString(desCrypto.Key);
because you think that UTF8 (Unicode Transformation Format) should be able to represent all characters then you will be right and wrong with basically the same result. UTF8 can represent every character in existence and could be modified to support extra terrestrial character sets, but the encoding rules are going to trip you up if you try the code above as a replacement.
Remember that a crypto key has to be truly random. Whereas UTF8 encoded strings have to follow a strict bit encoding pattern to identify characters.
To summarize UTF8 semantics:
If the character is an ASCII character it will start with a 0 bit because every byte holds 8 bits and 0-127 can fit into 7 bits. So the bit sequence looks like:
01xxxxxx
When you represent an ASCII character in UTF8 there is a direct one-for-one mapping. However the problem arises when you need to represent characters outside of the 0-127 range. The general rule is that the number of ones before the first zero in the bit pattern will tell you the number of 8 bit byte sequences that will make up Unicode character. Remember that this is highly simplified as there are also rules associated with combining Unicode characters to make new characters but that is not relevant here. So if you have a valid sequence of UTF8 bytes they will look like the following:
1110xxxx 10xxxxxx 10xxxxxxx 10xxxxxx …
The x’s above represent the bits used to identify the code point for the Unicode character (given that every character in existence is assigned a code point value or set of code point values).
What happens if you get a byte sequence like the following (which could easily occur in a random key)
1110xxxxx 01xxxxxx 01110xxx 0001xxxx …
Well it kind of depends on the UTF8 parser and interpreter. Some interpreters will ignore/drop the invalid sequences altogether as bad (high-lighted in red) while others may ignore the starter byte (assuming that it was an error) and continue processing the byte sequences as characters that it understands (underlined and indented above). In either case the key is significantly weakened.
So ASCII and UTF8 are Messed Up What about ISO8859-1 it’s an 8-bit Code Page
Pretty good that you suggested that but there are problems with this as well. Initially you might think that these are ok because they are code pages which represent all 8 bits as characters. When you convert from key bytes to characters you should not suffer the loss that you would under UTF8 or ASCII. However, the problem is that there are a number of other ISO8859-x (cp1252-cp125x) code pages which utilize different characters for the high order bit patterns (128-255) with overlap between the code pages. So if you store the characters as ISO8859-1 on one system them retrieve the characters as cp125x, the byte mappings will get messed up and your encrypted data will be corrupted or not decrypt properly because the encryption key becomes corrupted/altered.
Summary of the Problem
Another way of explaining the problem is that there is a disconnect between what encryption keys are and how they are being treated. Encryption keys are being treated as text when they are just a random number of bits. Text has to follow bit pattern rules defined by the code page they are being rendered in to ensure that they display correctly. Encryption keys should not follow any rules and should be as random as possible to protect against brute forcing. The only way I can explain the code above is that the MSDN code writer was looking for a way to store the encryption key in a format other than bytes.
So What The Heck Are We Supposed to Do?
You might think of writing the raw bytes directly to a text file but avoid this because character sets are implicitly used when writing to or reading from files. If the encryption key is written out to a file on a machine using one character set and later moved to a machine using a different character set, key bytes could become lost or corrupted when read from the alternate character set file system.
If you are generating a key for use in a cryptographic operation, set the key on the crypter from the generated bytes directly. If you have to store the key, then you can Base64 encode the byte stream after encrypting it. When you need to use the key, Base64 decode the key, decrypt it, and set the key directly as a byte array on the crypter.
For references on how the ASCIIEncoding object encoder works:
http://msdn.microsoft.com/en-us/library/system.text.encoding.ascii.aspx#Y1209
"The ASCIIEncoding object that is returned by this property might not have the appropriate behavior for your application. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character. Instead, you can call the GetEncoding method to instantiate an ASCIIEncoding object whose fallback is either an EncoderFallbackException or a DecoderFallbackException, as the following example illustrates."
http://www.dijksterhuis.org/encoding-c-strings-as-byte-byte-arrays-and-back-again/
"Solution #1 – Convert Unicode to ASCII / String to an ASCII Byte[]
If you intend to send only the most basic of messages which can be satisfied with just A-Z, a-z & 0-9 and a few other characters you can convert the C# string using the ASCII encoder. You will however lose any characters that are not defined by ASCII. So while this is a good idea if your application is only used in North America, the rest of the world will probably not thank you for this design decision."
Special thanks to Brian Chess for reviewing this post for content and clarity.
Technorati Tags: encoding .NET vulnerability
[Trackback URL for this entry]








There are no words to describe how bcodaious this is.