XML (Extensible Markup Language) is a markup language used to store and transport data. When working with XML, it’s essential to understand which characters are valid and how to handle invalid ones. In this tutorial, we’ll explore the rules for valid characters in XML and provide guidance on how to work with them.
Valid Characters in XML
The XML specification defines a set of valid characters that can be used in an XML document. According to the XML 1.0 specification, the following characters are allowed:
- Unicode characters in the range
#x20-#xD7FF
- Unicode characters in the range
#xE000-#xFFFD
- Unicode characters in the range
#x10000-#x10FFFF
- The following control characters:
#x9
,#xA
, and#xD
In XML 1.1, the specification has been extended to include additional control characters, but some characters are still not allowed, including:
- NUL (
#x00
) - FFFE (
#xFFFE
) - FFFF (
#xFFFF
)
Characters that Need to be Escaped
Some characters have special meanings in XML and need to be escaped using entity references. These characters include:
<
(less-than sign): must be escaped with<
>
(greater-than sign): should be escaped with>
, although it’s not mandatory&
(ampersand): must be escaped with&
'
(apostrophe): should be escaped with'
, especially in attribute values defined within single quotes"
(quotation mark): should be escaped with"
, especially in attribute values defined within double quotes
Best Practices for Working with XML Characters
When working with XML, it’s essential to follow best practices to ensure that your documents are well-formed and valid. Here are some tips:
- Use a tool or library that writes XML for you and abstracts away character escaping.
- Always escape special characters using entity references.
- Avoid using control characters in your XML documents, as they can cause issues with parsing and processing.
Example Code
Here’s an example of how to clean invalid XML characters from a string in C#:
public static string CleanInvalidXmlChars(string text)
{
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]";
return Regex.Replace(text, re, "");
}
This code uses a regular expression to remove any invalid characters from the input string.
Conclusion
In conclusion, working with valid characters in XML is essential for creating well-formed and valid documents. By understanding which characters are allowed and how to escape special characters, you can ensure that your XML documents are parsed and processed correctly. Remember to follow best practices and use tools or libraries that abstract away character escaping to make your life easier.