Sitemaps URL Encode of Characters with Percentage Encoding
Quick Explanation of URL Encoding
Characters in URLs are usually URL encoded when:
- Character appears in a context where its usage is reserved. This can often be seen in GET parameter values.
- Character is not ASCII, i.e. within 7bits. In such cases, the character
is converted into to UTF-8, and all bytes in each character are then encoded into the URL.
URL Encode Uses Hex Percentage Encoding for Characters
With URL encoding, each ASCII character / each byte in each UTF-8 character
is converted into HEX
number system notation.
hexadecimal number system is in URLs presented with
followed by two symbols, each being either in 0-9
- ASCII space character has byte value 32 which when URL encoded becomes %20:
- In decimal: 32 = 3*10 + 2*1.
- In hexadecimal: 20 = 2*16 + 2*0.
URL Encoding in Website and Page HTML Source
If you are unsure if you are using URL encoding, perhaps even unnecessary URL encoding,
you should check the output page source
Most browsers support a view source
With link checker and sitemap tools such as
A1 Sitemap Generator
it can be argued if links with illegal or non-standard URL encoding
should be ignored or converted before shown in the website scan results.
Therefore you can use the following options to control if URLs are
percentage encoded during website scan:
- Scan website | Crawler options | Ensure URL "path" component is percentage encoded.
- Scan website | Crawler options | Ensure URL "query" component is percentage encoded.
If you are fixing linking errors in your website,
remember you can see information about all
internal links and redirects
URL Encoding in XML Sitemaps and Webserver
If you have URLs that require to be URL encoded, it is an error not to URL encode them.
Some search engines, web crawlers, browsers, servers etc. are able to correctly understand
URLs that are not properly encoded, but it is always safer to have your URLs properly URL encoded / URL escaped.
Quote from official
sitemaps protocol website:
In addition, all URLs (including the URL of your Sitemap) must be URL-escaped and encoded for readability by the web server on which they are located.
Further Reading About URL Character Encoding
Before you start reading:
- Rules for URL encoding varies depending on the place and context in the URL.
- There are a few inconsistensies in RFC standards due to updates and revisions.
Resources about percent encoding in URLs:
- RFC 1738 - Functional Recommendations for Internet Resource Locators. RFC 1738 is from February 1995.
- RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax. RFC 2396 is from August 1998.
- RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. RFC 3986 is from January 2005.
- Percent Encoding - Wikipedia about percent encoding / hexadecimal % URL encoding.