URL Encode Characters with Percentage Encoding
Learn about URL encoding in sitemaps and what percentage encoding does. Understand why generated XML sitemaps and search engines often convert to URL encoded characters in URLs.
Characters in URLs are usually URL encoded when:
- Character appears in a context where its usage is reserved. This can often be seen in GET parameter values.
- Character is not ASCII, i.e. within 7bits. In such cases, the character
is converted into to UTF-8, and all bytes in each character are then encoded into the URL.
With URL encoding, each ASCII character / each byte in each UTF-8 character
is converted into
HEX number system notation.
hexadecimal number system is in URLs presented with
% followed by two symbols, each being either in
0-9 or
A-F range.
Examples:
- ASCII space character has byte value 32 which when URL encoded becomes %20:
- In decimal: 32 = 3*10 + 2*1.
- In hexadecimal: 20 = 2*16 + 2*0.
If you are unsure if you are using URL encoding, perhaps even unnecessary URL encoding,
you should check the
output page source first.
Most browsers support a
view source option.
With link checker and sitemap tools such as
TechSEO360
it can be argued if links with illegal or non-standard URL encoding
should be ignored or converted before shown in the website scan results.
Therefore you can use the following options to control if URLs are
percentage encoded during website scan:
- Scan website | Crawler options | Ensure URL "path" component is percentage encoded.
- Scan website | Crawler options | Ensure URL "query" component is percentage encoded.
Note:
If you are fixing linking errors in your website,
remember you can see information about all
internal links and redirects.
If you have URLs that require to be URL encoded, it is an error not to URL encode them.
Some search engines, web crawlers, browsers, servers etc. are able to correctly understand
URLs that are not properly encoded, but it is always safer to have your
URLs properly URL encoded / URL escaped with percentage encoding.
Quote from
official sitemaps protocol website:
In addition, all URLs (including the URL of your sitemap) must be URL-escaped and encoded for readability by the web server on which they are located.
Note: We have seen some tools that erroneously do not properly URL percentage encode with
UTF-8 byte values, but instead use bytes values from another document character set or data representation they use internally.
Before you start reading:
- Rules for URL encoding varies depending on the place and context in the URL.
- There are a few inconsistencies in RFC standards due to updates and revisions.
Resources about percent encoding in URLs:
- RFC 1738 - Functional Recommendations for Internet Resource Locators. RFC 1738 is from February 1995.
- RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax. RFC 2396 is from August 1998.
- RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. RFC 3986 is from January 2005.
- Percent Encoding - Wikipedia about percent encoding / hexadecimal % URL encoding.