Converting Bytes To String In Python [Guide]

Converting Bytes To String

This tutorial covers converting Bytes to a string variable (and a string to Bytes) in Python versions 2 and 3.

What is the difference between a string and a byte string?

string is a series of characters.

When stored on disk, the characters are converted to a series of bytes.

What series of bytes represents what character is defined by the character encoding.

When a string is stored to disk, it is converted to bytes (encoding). When it is read back into a program, the bytes must be converted back to a usable and displayable string (decoding).

To convert the bytes back into a string correctly, you must know what character encoding was used when they were written to disk.

There are many different character encodings. The most popular are Unicode (UTF-8), the default in Python 3, and ASCII, which is (usually) the default in Python 2.

Unicode rules the roost these days – it offers portability (meaning code written to use Unicode will work the same way on different systems), rather than relying on the operating system or locale to provide information on how characters are displayed.

Turning a Byte String into a Character String (Decoding)

Python 3

Python 3 has a variable typed called a byte literal – it’s used by prefixing a string variable with b. It means the variable is instanced as a byte type rather than a string type.

Below, a byte string is decoded to a UTF-8 string:

encoding = 'utf-8'
b'This is my string'.decode(encoding) # Notice the b! It means the string is a Byte Literal

The str() function can be used similarly:

encoding = 'utf-8'
str(b'This is my string', encoding)

In Python 3, the default encoding is utf-8, so you can omit the encoding if that’s the encoding in use:

b'This is my string'.decode()

Which is the same as:

b'This is my string'.decode(encoding="utf-8")

Handling Errors in Python 3

Errors may be encountered when decoding a string if unexpected characters are present. The Python 3 str() function allows you to pass a third parameter telling it how to handle these errors:

str(b'This is my invalid string \x80abc', 'utf-8', 'ignore')

Above, we’ve told the str() function to ignore any errors. Running this without ‘ignore’ as the third parameter would result in :

Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 26: invalid start byte

You can use the following options in the third parameter to determine how errors are handled (table pulled directly from the Python docs):

Value Meaning
ignore Ignore the malformed data and continue without further notice.
strict Raise UnicodeError (or a subclass); this is the default.
replace Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding and ‘?’ on encoding.
backslashreplace Replace with backslashed escape sequences

Additional error handlers for less common scenarios are detailed in the Python docs at:

https://docs.python.org/3/library/codecs.html

Using the error handler like so:

str(b'This is my invalid string \x80abc', 'utf-8', 'replace')

Will, therefore, produce the value:

This is my invalid string ?abc

Python 2

In Python 2, the decode() method of a string can be used to decode it using a given character encoding:

encoding = 'utf-8'
'This is my string'.decode(encoding)

or, if you’re using Unicode, the unicode() function can be used:

encoding = 'utf-8'
unicode('This is my string', encoding)

Strings in Python 2 do not have their encoding defined by default. This means that they may be:

  • Whatever your system locale’s default encoding is
  • Whatever MIME type was defined by the file when it was downloaded
  • It could also be just binary data with no meaningful encoding

Essentially – a string in Python 2 is a direct representation of the characters as read by Python, so your environment will affect it (this is why Python 3 encodes everything in Unicode – so that it’s portable and works the same way regardless of your system).

Thus, if you’re writing data in Python 2, it’s useful to define what encoding is being used when writing data and making sure it is used when converting bytes back into a string if you are doing it manually.

Converting Strings into Bytes (Encoding)

The rationale and definitions above all apply below when converting Strings into Byte Strings – it’s just doing the reverse.

Python 3

The decode method can be used in Python 3, with an optional second parameter with the character encoding:

encoding = 'utf-8'
str.encode('This is my string', encoding) # returns the value as bytes

A third parameter can also be passed with one of the error handlers outlined above.

Python 2

The following will return a bytes variable for the given string:

encoding = 'utf-8'
bytes('This is my string', encoding)

If the third parameter is not supplied, the default encoding will be used – see above for how Python 2 handles default encoding.

Checking a Variable’s Type

After converting a variable to another variable type, you may wish to verify the type of the new variable. Click here to see our article on how to do this in Python.

SHARE:
nv-author-image

Brad Morton

I'm Brad, and I'm nearing 20 years of experience with Linux. I've worked in just about every IT role there is before taking the leap into software development. Currently, I'm building desktop and web-based solutions with NodeJS and PHP hosted on Linux infrastructure. Visit my blog or find me on Twitter to see what I'm up to.

Leave a Reply

Your email address will not be published. Required fields are marked *