Introduction
In modern computing, handling text data with diverse characters from various languages is crucial. This often involves dealing with Unicode strings, which can include special symbols like currency signs (£, $, €) or accented letters. Python provides robust tools for working with these Unicode strings, allowing you to convert and encode them efficiently.
Understanding Unicode
Unicode is a universal character encoding standard that assigns unique code points to every character in almost all human languages. In Python, strings can be either regular byte strings (str
in Python 2, bytes
in Python 3) or Unicode strings (unicode
in Python 2, str
in Python 3). Handling Unicode correctly is essential for ensuring that your application processes text data accurately across different systems and languages.
Converting Unicode Strings to ASCII
When you need to convert a Unicode string to an ASCII-compatible format (removing non-ASCII characters), Python provides several methods. Here’s how:
Using encode
Method
The encode
method converts a Unicode string into a specified encoding, such as ASCII. You can specify how to handle characters that cannot be represented in the target encoding:
# Original Unicode string
unicode_string = u"Klüft skräms inför på fédéral électoral grande"
# Convert to ASCII, ignoring non-ASCII characters
ascii_string = unicode_string.encode('ascii', 'ignore')
print(ascii_string) # Output: b'Klft skrms infr pa federal electorale grande'
Alternatively, you can replace non-ASCII characters with a placeholder:
# Replace non-ASCII characters with '?'
ascii_string_with_replacement = unicode_string.encode('ascii', 'replace')
print(ascii_string_with_replacement) # Output: b'Kl?ft skr?ms inf?r p? f?d?ral ?l?ctoral grande'
Normalizing Unicode Strings
Unicode normalization is another powerful technique that helps in converting strings to a standard form. This can be useful for comparing or processing text:
import unicodedata
# Normalize the string using NFKD (Compatibility Decomposition)
normalized_string = unicodedata.normalize('NFKD', unicode_string).encode('ascii', 'ignore')
print(normalized_string) # Output: b'Klutf skrams infor pa federal electoral grande'
Encoding Unicode Strings for Storage
When you need to store or transmit Unicode strings, they must be encoded into a byte representation. Common encodings include UTF-8 and UTF-16:
# Example Unicode string with special characters
unicode_string = u"£10"
# Encode using UTF-8
utf8_encoded = unicode_string.encode('utf8')
print(utf8_encoded) # Output: b'\xc2\xa310'
# Encode using UTF-16
utf16_encoded = unicode_string.encode('utf16')
print(utf16_encoded) # Output: b'\xff\xfe\xac\x0010'
Writing and Reading Unicode Files
To handle files containing Unicode data, Python’s codecs
module can be used to specify encoding:
import codecs
# Writing a Unicode string to a file with UTF-8 encoding
with codecs.open('example.txt', 'w', 'utf8') as f:
f.write(u"Example text with € symbol")
# Reading from the same file
with codecs.open('example.txt', 'r', 'utf8') as f:
content = f.read()
print(content) # Output: Example text with € symbol
In Python 3, this functionality is built into the open
function:
# Writing using open() in Python 3
with open('example.txt', 'w', encoding='utf8') as f:
f.write("Example text with € symbol")
# Reading from the file
with open('example.txt', 'r', encoding='utf8') as f:
content = f.read()
print(content) # Output: Example text with € symbol
Conclusion
Handling Unicode in Python is straightforward thanks to its built-in support for Unicode strings and flexible encoding/decoding capabilities. By understanding how to convert, encode, and store these strings properly, you can ensure your applications handle internationalization effectively.