Hazards of Converting Binary Data To A String

code 0 comments suggest edit

Back in November, someone asked a question on StackOverflow about converting arbitrary binary data (in the form of a byte array) to a string. I know this because I make it a habit to read randomly selected questions in StackOverflow written in November 2011. Questions about text encodings in particular really turn me on.

In this case, the person posing the question was encrypting data into a byte array and converting that data into a string. The conversion code he used was similar to the following:

string text = System.Text.Encoding.UTF8.GetString(data);

That isn’t exactly their code, but this is a pattern I’ve seen in the past. In fact, I have a story about this I want to tell you in a future blog post. But I digress.

The infamous Jon Skeet answers:

You should absolutely not use an Encoding to convert arbitrary binary data to text. Encoding is for when you’ve got binary data which genuinely is encoded text - this isn’t.

Instead, use Convert.ToBase64String to encode the binary data as text, then decode usingConvert.FromBase64String.

Yes! Absolutely. Totally agree. As a general rule of thumb, agreeing with Jon Skeet is a good bet.

Not to give you the impression that I’m stalking Skeet, but I did notice that this wasn’t the first time Skeet answered a question about using encodings to convert binary data to text. In response to an earlier question he states:

Basically, treating arbitrary binary data as if it were encoded text is a quick way to lose data. When you need to represent binary data in a string, you should use base64, hex or something similar.

This perked my curiosity. I’ve always known that if you need to send binary data in text format, base64 encoding is the safe way to do so. But I didn’t really understand why the other encodings were unsafe. What are the cases in which you might lose data?

Round Tripping UTF-8 Encoded Strings

Well let’s look at one example. Imagine you’re receiving a stream of bytes and you store it as a UTF-8 string and pop it in the database. Later on, you need to relay that data so you take it out, encode it back to bytes, and send it on its merry way.

The following code simulates that scenario with a byte array containing a single byte, 128.

var data = new byte[] { 128 };
string text = Encoding.UTF8.GetString(data);
var bytes = Encoding.UTF8.GetBytes(text);

Console.WriteLine("Original:\t" + String.Join(", ", data));
Console.WriteLine("Round Tripped:\t" + String.Join(", ", bytes));

The first line of code creates a byte array with a single byte. The second line converts it to a UTF-8 string. The third line takes the string and converts it back to a byte array.

If you drop that code into the Main method of a Console app, you’ll get the following output.

Original:      128
Round Tripped: 239, 191, 189

WTF?! The data was changed and the original value is lost!

If you try it with 127 or less, it round trips just fine. What’s going on here?

UTF-8 Variable Width Encoding

To understand this, it’s helpful to understand what UTF-8 is in the first place. UTF-8 is a format that encodes each character in a string with one to four bytes. It can represent every unicode character, but is also backwards compatible with ASCII.

ASCII is an encoding that represents each character with seven bits of a single byte, and thus consists of 128 possible characters. The high order bit in standard ASCII is always zero. Why only 7-bits and not the full eight?

Because seven bits ought to be enough for anybody:

When you counted all possible alphanumeric characters (A to Z, lower and upper case, numeric digits 0 to 9, special characters like “% * / ?” etc.) you ended up a value of 90-something. It was therefore decided to use 7 bits to store the new ASCII code, with the eighth bit being used as a parity bit to detect transmission errors.

UTF-8 takes advantage of this decision to create a scheme that’s both backwards compatible with the ASCII characters, but also able to represent all unicode characters by leveraging the high order bit that ASCII ignores. Going back to Wikipedia:

UTF-8 is a variable-width encoding, with each character represented by one to four bytes.If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127).

This explains why bytes 0 through 127 all round trip correctly. Those are simply ASCII characters.

But why does 128 expand into multiple bytes when round tripped?

If the character is encoded by a sequence of more than one byte, the first byte has as many leading “1” bits as the total number of bytes in the sequence, followed by a “0” bit, and the succeeding bytes are all marked by a leading “10” bit pattern.

How do you represent 128 in binary? 10000000

Notice that it’s marked with a leading 10 bit pattern which means it’s a continuation character. Continuation of what?

the first byte never has 10 as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.

So in answer to the question of why does 128 expand into multiple bytes when round tripped, I don’t really know other than a single byte of 128 isn’t a valid UTF-8 character. So in all likelihood, the behavior shouldn’t be defined. it’s the Unicode Replacement Character used for invalid data (Thanks to RichB for the answer in the comments!).

I’ve noticed a lot of invalid ITF-8 values expand into these three bytes. But that’s beside the point. The point is that using UTF-8 encoding to store binary data is a recipe for data loss and heartache.

What about Windows-1252?

Going back to the original question, you’ll note that the code didn’t use UTF-8 encoding. I took some liberties in describing his approach. What he did was use  System.Text.Encoding.Default. This could be different things on different machines, but on my machine it’s the Windows-1252 character encoding also known as “Western European Latin”.

This is a single byte encoding and when I ran the same round trip code against this encoding, I could not find a data-loss scenario. Wait, could Jon be wrong?

To prove this to myself, I wrote a little program that cycles through every possible byte and round trips it.

using System;
using System.Linq;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        var encoding = Encoding.GetEncoding(1252);
        for (int b = Byte.MinValue; b <= Byte.MaxValue; b++)
        {
            var data = new[] { (byte)b };
            string text = encoding.GetString(data);
            var roundTripped = encoding.GetBytes(text);

            if (!roundTripped.SequenceEqual(data))
            {
                Console.WriteLine("Rount Trip Failed At: " + b);
                return;
            }
        }

        Console.WriteLine("Round trip successful!");
        Console.ReadKey();
    }
}

The output of this program shows that you can encode every byte, then decode it, and get the same result every time.

So in theory, it could be safe to use Windows-1252 encoding of binary data, despite what Jon said.

But I still wouldn’t do it. Not just because I believe Jon more than my own eyes and code. If it were me, I’d still use Base64 encoding because it’s known to be safe.

There are five unmapped code points in Windows-1252. You never know if those might change in the future. Also, there’s just too much risk of corruption. If you were to store this string in a file that converted its encoding to Unicode or some other encoding, you’d lose data (as we saw earlier).

Or if you were to pass this string to some unmanaged API (perhaps inadverdently) that expected a null terminated string, it’s possible this string would include an embedded null character and be truncated.

In other words, the safest bet is to listen to Jon Skeet as I’ve said all along. The next time I see Jon, I’ll have to ask him if there are other reasons not to use Windows-1252 to store binary data other than the ones I mentioned.

Found a typo or error? Suggest an edit! If accepted, your contribution is listed automatically here.

Comments

avatar

18 responses

  1. Avatar for Kim
    Kim January 30th, 2012

    "Yes! Absolutely. Totally agree. As a general rule of thumb, agreeing with Jon Skeet is a good bet."
    I agree.

  2. Avatar for RichB
    RichB January 30th, 2012

    It's the Unicode replacement character:
    www.fileformat.info/.../index.htm
    of which the UTF8 encoding is 0xEF 0xBF 0xBD or 239, 191, 189
    I'm sure there are also scenarios where the original data contains the 3 bytes which are used for the UTF8 BOM, and then a naive conversion to a UTF8 string and back again elides the BOM.
    However, my quick test with Mono shows that System.Text.Encoding.UTF8 doesn't loose the data in this way.

  3. Avatar for Michiel
    Michiel January 30th, 2012

    I second the advice to use base64 encoding to transfer and/or save bytes as a string. Even if Windows-1252 doesn't seem to lead to data loss, you still shouldn't use it. There's a risk that code will assume System.Text.Encoding.Default but that value depends on the operating system's current settings. It could be different on another machine, it could be different on the same machine at a later point in time.
    Phil, in your code snippets you seem to have mixed up the names for the encoded and decoded variables. In .NET, when using the System.Text.Decoder and System.Text.Encoder classes, encoding is the process of turning a string into bytes, and decoding is the process of turning bytes into a string.
    I understand the confusion, because the question on Stack overflow regards the string as an encoding of arbitrary bytes (the bytes not representing text). But when we talk about text, the string is the decoded value, the byte array is the encoded value.
    PS. If you need to ask Jon a question, the best way to reach him is here: http://stackoverflow.com/questions/ask

  4. Avatar for haacked
    haacked January 30th, 2012

    @Michiel notice that in my round trip example, I didn't use Encoding.Default. I explicitly created a Windows-1252 encoding. So that would mitigate the risk of different values for different machines. But I agree with you that the consumer of text encoded as such might use Default and screw things up.
    As for the encoded/decoded, I took the name from the original problem domain which was encrypting data into a string. I changed the variable names completely to avoid confusion. Thanks for pointing that out!

  5. Avatar for Bevan Coleman
    Bevan Coleman January 30th, 2012

    Wish I had seen this a month ago :)
    In my case I had a legacy product streaming out 8-bit binary.
    It took me me an embassingly long time to work out that the UTF-8 Encoding wasn't the same thing as 8-bit binary :/

  6. Avatar for Jon Skeet
    Jon Skeet January 30th, 2012

    As ever, there's more to say than I find time to actually write down :)
    Your final caveats are very relevant. To state them in a different way, your round-tripping code assumes that the *string itself* will be round-tripped. This assumes that you're not using any protocol which might strip non-printable characters, or perform some different decomposition of non-ASCII characters etc. For example, XML only allows you to use a very few characters below U+0020... and your example of a "truncate on Unicode NUL" API is a good one too.
    Base64 is pleasant in that it only uses printable ASCII characters. It's a shame that the normal form of it doesn't use URL-safe ASCII characters, which is why there are web-targeted variations - but at least most other protocols are likely to round-trip the data (or mutate it in harmless ways, such as adding removable whitespace).
    Another option to mention is hex-encoding the data: this has the disadvantage of taking up more space than base64 (chars = 2x bytes, instead of chars = 4/3x bytes for base64) but it has advantages too: it's easier to decode by inspection, and it's URL-safe by default.

  7. Avatar for angry guy
    angry guy January 30th, 2012

    Nice, but your post should have ended here: "Yes! Absolutely. Totally agree. As a general rule of thumb, agreeing with Jon Skeet is a good bet."
    Everything below is just a time waste.

  8. Avatar for haacked
    haacked January 30th, 2012

    @Jon Skeet thanks for the response! That's what I figured, but I wanted to double check. Thanks for dropping in. :)

  9. Avatar for adamralph
    adamralph January 30th, 2012

    Good post. I suspect in some cases when such questions are asked the conversions between byte arrays and strings are, in fact, unnecessary. I get the feeling that people feel the need to move data around and/or store it as strings because they feel more comfortable with them, whereas sticking with byte arrays throughout might often be much more sensible.

  10. Avatar for Artiom Chilaru
    Artiom Chilaru January 31st, 2012

    The thing about encodings like win-1252 is you'll get the same characters when you decode binary data to a string.. The problem is what you do with it afterwards.
    If you just store it like that in a variable, it'll probably survive a double conversion (byte array - string - byte array). But if you try to export it somewhere (a text editor, for example), and then try to drop it back - this is where you'll get your data lost pretty fast.
    For example the 00-1F characters are control characters (bell character, anyone?) and they just don't "work" when displayed on screen. Some apps will just convert them to a space, or a "?" character.
    In conclusion - yes, there are cases when you can use a simple encoding like win-1252 to decode an arbitrary byte array (for example if you want to look for a specific raw "string" within a file) but in most cases, especially if this data is to be transferred somewhere as a string - use Base64 or Hex encoding!

  11. Avatar for Ben
    Ben February 8th, 2012

    Okay, it's probably irrelevant for me to say this, but this is the first time I've felt like maybe I've actually truly progressed as a programmer. Not just because I knew this years ago, but because I'd have bet any sum of money there wasn't anything I knew that the vaunted Mr. Haack did not. Apparently, I've learned things.
    However, knowing me, I probably learned this "the hard way" then researched why it wasn't working.

  12. Avatar for Me
    Me February 10th, 2012

    "Yes! Absolutely. Totally agree. As a general rule of thumb, agreeing with Jon Skeet is a good bet."
    Jon Skeet agrees.

  13. Avatar for John Bubriski
    John Bubriski February 14th, 2012

    Hey Phil,
    Thank you for pointing this out, not that I knew this was an issue. But more importantly, it's crazy how bad the sample encryption code is out there on the internet. Recently I tried to dive into it a little, but found an enormous amount of insecure or incorrect code.
    From that endeavor I decided to create a project on Github that is designed to be "idiot-proof" symmetrical encryption using AES in .NET. Well, maybe not idiot-proof but as close as we can get. I would love for you and the other readers to take a look, and let me know what you think!
    The Encryptamajig on Github

  14. Avatar for Jeff
    Jeff February 24th, 2012

    Thanks.
    I am a bit confused now though, could you please check out this stackoverflow question? stackoverflow.com/...

  15. Avatar for Larry
    Larry May 11th, 2014

    Good post. There are many bad examples of encryption on the Net. One (on Microsoft's site), effectively uses this sequence to generate an encryption key:

    var desCrypto = new DESCryptoServiceProvider();
    desCrypto.GenerateKey();
    string sKey = ASCIIEncoding.ASCII.GetString(desCrypto.Key);

    byte[] actualKey = ASCIIEncoding.ASCII.GetBytes(sKey);

    For any non-ascii byte in desCrypto.Key (>= 0x80), the byte is replaced by a question mark in "actualKey".

    It's then a coin toss as to whether you lose any given byte of the key, so for an 8-byte key you have 64% probably of losing at least 4 bytes of key material!

  16. Avatar for Larry
    Larry May 11th, 2014

    meant to say "64% probability of losing at least 4 bytes..."

  17. Avatar for HaakonKL
    HaakonKL August 8th, 2016

    If you use the system standard, then you're also in for heartache at some point due to computers having different standard encodings.

    Just as an example, you might have had an old computer that ran Windows XP. And now, in 2016, you decided to upgrade to Windows Vista 3.11 for Workgroups, aka, Windows 8.1
    (Windows 7 was Vista 2, and Windows 8 was Vista 3). :p

    Guess who just changed their system standard encoding?
    Now, you might say to yourself that this won't ever bother me, because I already run Windows Vista so I'm safe!

    And then in the future, Windows switches over to UTF-8 as the standard.

  18. Avatar for firstpostcommenter
    firstpostcommenter July 7th, 2017

    ok