Have you ever received a PDF or an image file from someone via email, only to see strange characters when you open it? This can happen if your email server was only designed to handle text data. Files with binary data, bytes that represent non-text information like images, can be easily corrupted when being transferred and processed to text-only systems.
By encoding our data, we improve the chances of it being processed correctly by various systems.
In this tutorial, we would learn how Base64 encoding and decoding works, and how it can be used. We will then use Python to Base64 encode and decode both text and binary data.
Python String to bytes, bytes to String
In mathematics, the base of a number system refers to how many different characters represent numbers. The name of this encoding comes directly from the mathematical definition of bases - we have 64 characters that represent numbers. When the computer converts Base64 characters to binary, each Base64 character represents 6 bits of information. Note: This is not an encryption algorithm, and should not be used for security purposes. Now that we know what Base64 encoding and how it is represented on a computer, let's look deeper into how it works.
We will illustrate how Base64 encoding works by converting text data, as it's more standard than the various binary formats to choose from. If we were to Base64 encode a string we would follow these steps:. Recall that Base64 characters only represent 6 bits of data.Python standard library: Encoding and decoding strings
We now re-group the 8-bit binary sequences into chunks of 6 bits. The resultant binary will look like this:. Note: Sometimes we are not able to group the data into sequences of 6 bits. If that occurs, we have to pad the sequence. With our data in groups of 6 bits, we can obtain the decimal value for each group. Using our last result, we get the following decimal values:. Finally, we will convert these decimals into the appropriate Base64 character using the Base64 conversion table:.Applications are often internationalized to display messages and output in a variety of user-selectable languages; the same program might need to output an error message in English, French, Japanese, Hebrew, or Russian.
Web content can be written in any of these languages and can also include a variety of emoji symbols. The Unicode specifications are continually revised and updated to add new languages and symbols.
A character is the smallest possible component of a text. The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF about 1. The Unicode standard contains a lot of tables listing characters and their corresponding code points:. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF 1, decimal.
This sequence of code points needs to be represented in memory as a set of code unitsand code units are then mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a character encodingor just an encoding. In most texts, the majority of the code points are less thanor less thanso a lot of space is occupied by 0x00 bytes. UTF-8 is one of the most commonly used encodings, and Python often defaults to using it.
UTF-8 uses the following rules:.
Base64 Encoding a String in Python
UTF-8 is fairly compact; the majority of commonly used characters can be represented with one or two bytes. UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
This avoids the byte-ordering issues that can occur with integer and word oriented encodings, like UTF and UTF, where the sequence of bytes varies depending on the hardware on which the string was encoded.
Be prepared for some difficult reading. A chronology of the origin and development of Unicode is also available on the site. To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables. Another good introductory article was written by Joel Spolsky. Since Python 3.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal:. Depending on your system, you may see the actual capital-delta glyph instead of a u escape.
In addition, one can create a string using the decode method of bytes. This method takes an encoding argument, such as UTF-8and optionally an errors argument. The following examples show the differences:.
Python comes with roughly different encodings; see the Python Library Reference at Standard Encodings for a list. One-character Unicode strings can also be created with the chr built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point. The reverse operation is the built-in ord function that takes a one-character Unicode string and returns the code point value:.
The opposite method of bytes. The errors parameter is the same as the parameter of the decode method but supports a few more possible handlers. The low-level routines for registering and accessing the available encodings are found in the codecs module.
Implementing new encodings also requires understanding the codecs module. You can also assemble strings using the chr built-in function, but this is even more tedious. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime. Python supports writing source code in UTF-8 by default, but you can use almost any encoding if you declare the encoding being used.
This is done by including a special comment as either the first or second line of the source file:.Some of the features described here may not be available in earlier versions of Python.
Now available for Python 3! Buy the book! The codecs module provides stream and file interfaces for transcoding data in your program. It is most commonly used to work with Unicode text, but other encodings are also available for other purposes. CPython 2. Old-style str instances use a single 8-bit byte to represent each character of the string using its ASCII code.
In contrast, unicode strings are managed internally as a sequence of Unicode code points. The code point values are saved as a sequence of 2 or 4 bytes each, depending on the options given when Python was compiled.
Both unicode and str are derived from a common base class, and support a similar API. When unicode strings are output, they are encoded using one of several standard schemes so that the sequence of bytes can be reconstructed as the same string later.
The bytes of the encoded value are not necessarily the same as the code point values, and the encoding defines a way to translate between the two sets of values. Reading Unicode data also requires knowing the encoding so that the incoming bytes can be converted to the internal representation used by the unicode class. The most common encodings for Western languages are UTF-8 and UTFwhich use sequences of one and two byte values respectively to represent each character.
Other encodings can be more efficient for storing languages where most of the characters are represented by code points that do not fit into two bytes. For more introductory information about Unicode, refer to the list of references at the end of this section. The best way to understand encodings is to look at the different series of bytes produced by encoding the same string in different ways.
The examples below use this function to format the byte string to make it easier to read. The function uses binascii to get a hexadecimal representation of the input byte string, then insert a space between every nbytes bytes before returning the value. The next two lines encode the string as UTF-8 and UTF respectively, and show the hexadecimal values resulting from the encoding.
The result of encoding a unicode string is a str object.
Given a sequence of encoded bytes as a str instance, the decode method translates them to code points and returns the sequence as a unicode instance. The default encoding is set during the interpreter start-up process, when site is loaded. Refer to Unicode Defaults for a description of the default encoding settings accessible via sys. Whether you are writing to a file, socket, or other stream, you will want to ensure that the data is using the proper encoding.The principal built-in types are numerics, sequences, mappings, classes, instances and exceptions.
Some collection classes are mutable. Some operations are supported by several object types; in particular, practically all objects can be compared for equality, tested for truth value, and converted to a string with the repr function or the slightly different str function. The latter function is implicitly used when an object is written by the print function.
Any object can be tested for truth value, for use in an if or while condition or as operand of the Boolean operations below. Operations and built-in functions that have a Boolean result always return 0 or False for false and 1 or True for true, unless otherwise stated. Important exception: the Boolean operations or and and always return one of their operands.
This is a short-circuit operator, so it only evaluates the second argument if the first one is false. This is a short-circuit operator, so it only evaluates the second argument if the first one is true. There are eight comparison operations in Python.
They all have the same priority which is higher than that of the Boolean operations. Objects of different types, except different numeric types, never compare equal.
The behavior of the is and is not operators cannot be customized; also they can be applied to any two objects and never raise an exception. There are three distinct numeric types: integersfloating point numbersand complex numbers.
In addition, Booleans are a subtype of integers. Integers have unlimited precision. Floating point numbers are usually implemented using double in C; information about the precision and internal representation of floating point numbers for the machine on which your program is running is available in sys. Complex numbers have a real and imaginary part, which are each a floating point number. To extract these parts from a complex number zuse z.
The standard library includes the additional numeric types fractions. Fractionfor rationals, and decimal. Decimalfor floating-point numbers with user-definable precision.Let us look at these two functions in detail in this article. The type of encoding to be followed is shown by the encoding parameter.
There are various types of character encoding schemes, out of which the scheme UTF-8 is used in Python by default. Although there is not much of a difference, you can observe that the string is prefixed with a b.
This means that the string is converted to a stream of bytes, which is how it is stored on any computer. As bytes! This is actually not human-readable and is only represented as the original string for readability, prefixed with a bto denote that it is not a string, but a sequence of bytes.
Let us look at the above concepts using a simple example. Similar to encoding a string, we can decode a stream of bytes to a string object, using the decode function. Since encode converts a string to bytes, decode simply does the reverse. Similar to those of encodethe decoding parameter decides the type of encoding from which the byte sequence is decoded.
The errors parameter denotes the behavior if the decoding fails, which has the same values as that of encode. If we use the wrong format, it will result in the wrong output and can give rise to errors. The first decoding is incorrect, as it tries to decode an input string which is encoded in the UTF-8 format. The second one is correct since the encoding and decoding formats are the same.
In this article, we learned how to use the encode and decode methods to encode an input string and decode an encoded byte sequence. This can be useful for encryption and decryption purposes, such as locally caching an encrypted password and decoding them for later use. Your email address will not be published. Generic selectors. Exact matches only. Search in title. Search in content. Search in excerpt. Search in posts.
Search in pages. Prev Python String endswith function. Next Python String isnumeric Function. Leave a Reply Cancel reply Your email address will not be published.
The dark mode beta is finally here. Change your preferences any time. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. However, I'd like to work with the output as a normal Python string. So that I could print it like this:. I thought that's what the binascii. How do I convert the bytes value back to string?
I mean, using the "batteries" instead of doing it manually. Because encoding is unknown, expect non-English symbols to translate to characters of cp English characters are not translated, because they match in most single byte encodings and UTF The same applies to latin-1which was popular the default?
See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range. UPDATE : Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler.
That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:. It should be slower than the cp solution, but it should produce identical results on every Python version. In Python 3the default encoding is "utf-8"so you can directly use:. On the other hand, in Python 2encoding defaults to the default string encoding.
Thus, you should use:. Aaron's answer was correct, except that you need to know which encoding to use. And I believe that Windows uses 'windows'. It will only matter if you have some unusual non-ASCII characters in your content, but then it will make a difference.
By the way, the fact that it does matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them, because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation or read it here. While Aaron Maenpaa's answer just works, a user recently asked :. It can be worse. The decoding may fail silently and produce mojibake if you use a wrong incompatible encoding:.
The dark mode beta is finally here. Change your preferences any time. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. I am new to python3, coming from python2, and I am a bit confused with unicode fundamentals.
I've read some good posts, that made it all much clearer, however I see there are 2 methods on python 3, that handle encoding and decoding, and I'm not sure which one to use. So the idea in python 3 is, that every string is unicode, and can be encoded and stored in bytes, or decoded back into unicode string again.
But there are 2 ways to do it: u'something'. And b'bytes'. Now my question is, why are there 2 methods that seem to do the same thing, and is either better than the other and why? I've been trying to find answer to this on google, but no luck. Neither is better than the other, they do exactly the same thing. However, using. It is also compatible with Python 2. To add to Lennart Regebro's answer There is even the third way that can be used:. Anyway, it is actually exactly the same as the first approach.
It may also look that the second way is a syntactic sugar for the third approach. A programming language is a means to express abstract ideas formally, to be executed by the machine.
Byte Objects vs String in Python
A programming language is considered good if it contains constructs that one needs. Python is a hybrid language -- i. Sometimes functions are more appropriate than the object methods, sometimes the reverse is true. It depends on mental picture of the solved problem. In my opinion, this is a nice example that show the alternative thinking about technically the same thing. In other words, calling an object method means thinking in terms "let the object gives me the wanted result".