Non-Latin characters in URLs - is it better to encode them or replace with their Latin "counterparts"?
We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment)
Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno.
or if we should replace them with their Latin "counterparts" (similar looking letters)
Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno.
I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?
Most of the times, search engines deal with latin counterparts good, although sometimes, results for i.e. "létání" and "letani" slightly differ.
So, in terms of SEO, almost no harm is done - once your site has good content, good markup and all that other stuff, your site won't suffer from having latin URLs.
You don't always know what combination of system browser and plugins users use, so make them as easy as possible - all websites use standard latin in URLs, because non-latin symbols can choke anything from server through browser to any plugin that might break user's experience.
And I can't stress this enough; Users before SEO!
"what's better from SEO perspective"
Who's your audience? Americans who think all those extra letters are a mistake?
Or folks who read (and search) for "non-ASCII" letters because those non-ASCII letters are part of their language?
SEO is a bad thing to chase. Complete, correct, consistent and usable is what you what to build first.
well i suggest you to replace them with there latin counterparts because it's user friendly and your website will be accessible on every single computer (as the keyboard changes from computer to another but all of them have latins letters), but for SEO perspective i don't think it's gonna be a problem.
Pawel, first of all, you should decide whether you're going to optimize for global Google (google.com) or Polish one.
In accordance with the URI specification, RFC 3986, only 7bit ASCII characters are allowed, and characters among those mentioned in the specification as control characters must be properly escaped. If you want to represent other characters or URI control characters then you should be using IRI, RFC 3987. Keep in mind that HTTP is not compatible with IRI, however.
When in doubt RTFM.
Another issue is that there are Unicode code points whose glyphs look very much alike in most fonts, which is absolutely ideal for phishers. Stick to ASCII and the glyphs are visibly different when the characters are.