Understanding URL Validation: Characters and Restrictions

When working with URLs, it’s essential to understand which characters are allowed and which can cause problems. In this tutorial, we’ll explore the rules governing URL validation, including the types of characters that can be used and those that must be encoded or avoided.

The foundation for understanding URL syntax is defined in RFC 3986, which outlines the general structure and character sets allowed in URLs. According to this specification, URLs may contain a specific set of characters without needing any special encoding. These characters include:

Uppercase letters (A-Z)
Lowercase letters (a-z)
Digits (0-9)
Special characters: -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ;, =

However, not all characters are allowed in every part of a URL. Certain characters are reserved for specific uses within the URL syntax and must be used judiciously to avoid conflicts.

Excluded Characters

There are certain US-ASCII characters that are disallowed within the URI syntax because they can cause confusion or are used for special purposes:

Control characters (US-ASCII coded characters 00-1F and 7F hexadecimal)
Space character (US-ASCII coded character 20 hexadecimal)
Delimiters: <, >, #, %

The # character is excluded because it delimits a URI from a fragment identifier, while the % character is used for encoding escaped characters.

Unwise Characters

Some characters are considered "unwise" and can cause problems in certain contexts:

{, }, |, \, ^, [, ], `

These characters may be allowed but should be used with caution, as they might lead to issues depending on the specific application or context.

Reserved Characters

Reserved characters have special meanings within a URI/URL and include:

;, /, ?, :, @, &, =, +, $, ,

These characters are not reserved in all contexts but can be critical in defining the structure of a URL, such as separating components or indicating special parts like queries or fragments.

Encoding Characters

Characters that are not allowed directly in a URL must be encoded using percent-encoding (e.g., %hh where hh is the hexadecimal value of the character). This ensures that any character can be represented in a URL without causing syntax errors or confusion.

For example, if you need to include a space in a URL parameter, it should be encoded as %20.

Example: Encoding Unwise and Reserved Characters

Consider a URL like http://example.com/path?query=[value]. Here, the [ and ] characters are considered unwise. To properly encode this URL, you would replace these characters with their percent-encoded equivalents:

[ becomes %5B
] becomes %5D

So, the encoded URL would be http://example.com/path?query=%5Bvalue%5D.

Conclusion

Understanding which characters are valid in a URL and how to properly encode or avoid problematic ones is crucial for working with web technologies. By following the guidelines outlined in RFC 3986 and being mindful of excluded, unwise, and reserved characters, developers can ensure their URLs are correctly formatted and functional across different platforms and applications.