This tutorial covers converting Bytes to a string variable (and a string to Bytes) in Python versions 2 and 3.
What is the difference between a string and a byte string?
A string is a series of characters.
When stored on disk, the characters are converted to a series of bytes.
What series of bytes represents what character is defined by the character encoding.
When a string is stored to disk, it is converted to bytes (encoding). When it is read back into a program, the bytes must be converted back to a usable and displayable string (decoding).
To convert the bytes back into a string correctly, you must know what character encoding was used when they were written to disk.
There are many different character encodings. The most popular are Unicode (UTF-8), the default in Python 3, and ASCII, which is (usually) the default in Python 2.
Unicode rules the roost these days – it offers portability (meaning code written to use Unicode will work the same way on different systems), rather than relying on the operating system or locale to provide information on how characters are displayed.
Turning a Byte String into a Character String (Decoding)
Python 3 has a variable typed called a byte literal – it’s used by prefixing a string variable with b. It means the variable is instanced as a byte type rather than a string type.
Below, a byte string is decoded to a UTF-8 string:
encoding = 'utf-8' b'This is my string'.decode(encoding) # Notice the b! It means the string is a Byte Literal
The str() function can be used similarly:
encoding = 'utf-8' str(b'This is my string', encoding)
In Python 3, the default encoding is utf-8, so you can omit the encoding if that’s the encoding in use:
b'This is my string'.decode()
Which is the same as:
b'This is my string'.decode(encoding="utf-8")
Handling Errors in Python 3
Errors may be encountered when decoding a string if unexpected characters are present. The Python 3 str() function allows you to pass a third parameter telling it how to handle these errors:
str(b'This is my invalid string \x80abc', 'utf-8', 'ignore')
Above, we’ve told the str() function to ignore any errors. Running this without ‘ignore’ as the third parameter would result in :
Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 26: invalid start byte
You can use the following options in the third parameter to determine how errors are handled (table pulled directly from the Python docs):
|ignore||Ignore the malformed data and continue without further notice.|
|strict||Raise UnicodeError (or a subclass); this is the default.|
|replace||Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding and ‘?’ on encoding.|
|backslashreplace||Replace with backslashed escape sequences|
Additional error handlers for less common scenarios are detailed in the Python docs at:
Using the error handler like so:
str(b'This is my invalid string \x80abc', 'utf-8', 'replace')
Will, therefore, produce the value:
This is my invalid string ?abc
In Python 2, the decode() method of a string can be used to decode it using a given character encoding:
encoding = 'utf-8' 'This is my string'.decode(encoding)
or, if you’re using Unicode, the unicode() function can be used:
encoding = 'utf-8' unicode('This is my string', encoding)
Strings in Python 2 do not have their encoding defined by default. This means that they may be:
- Whatever your system locale’s default encoding is
- Whatever MIME type was defined by the file when it was downloaded
- It could also be just binary data with no meaningful encoding
Essentially – a string in Python 2 is a direct representation of the characters as read by Python, so your environment will affect it (this is why Python 3 encodes everything in Unicode – so that it’s portable and works the same way regardless of your system).
Thus, if you’re writing data in Python 2, it’s useful to define what encoding is being used when writing data and making sure it is used when converting bytes back into a string if you are doing it manually.
Converting Strings into Bytes (Encoding)
The rationale and definitions above all apply below when converting Strings into Byte Strings – it’s just doing the reverse.
The decode method can be used in Python 3, with an optional second parameter with the character encoding:
encoding = 'utf-8' str.encode('This is my string', encoding) # returns the value as bytes
A third parameter can also be passed with one of the error handlers outlined above.
The following will return a bytes variable for the given string:
encoding = 'utf-8' bytes('This is my string', encoding)
If the third parameter is not supplied, the default encoding will be used – see above for how Python 2 handles default encoding.