It is currently possible to scan all single language (and some multi language) websites. This includes extracting titles, text, etc.
Currently A1 Sitemap Generator does the following when scanning:
- Encounters a website using some character set, usually UTF-8, UTF-16 or ISO 8859-1.
The actual website language used can e.g. be English, German, Arabic, Russian, Chinese or similar.
- The website text is converted into the local computer windows configured codepage. This requires:
- The local computer Windows codepage character set contains the language(s) used in the website.
- The website uses either UTF-8, UTF-16 (+ some more) or the same/similar codepage as local computer.
- All string scanning, processing etc. is handled in local codepage, whether single or multibyte encoded character set.
- When creating e.g. UTF-8 XML files, all text in local codepage is correctly converted into UTF-8.
If a website uses multiple languages, you can try one of following solutions:
- Best:
- Find and use a code page that includes support for all languages in the website.
- Alternative:
- Split website scan up in multiple projects, one for each language.
- With each website scan project, make sure to use an appropriate codepage.
There exist many codepages. Some are language specific. Some encompass more languages, usually related ones, e.g. latin or cyrillic.
- ISO 8859-1: Contains most of western European languages. This (or slightly different) codepage is what most "western" Windows computers default to.
- Code page windows-1252 charset: Very similar to ISO 8859-1.
- ISO 8859-2: Contains many central and eastern European languages.
- Code page windows-1250 charset: Quite similar to ISO 8859-2.
- Code page windows-1251 charset: Contains Cyrillic languages (e.g. Russian) and English.
- KOI-8: Popular Russian codepage. Good for Russian/English documents.
We currently use Borland/CodeGear Delphi for Windows 32bit native development.
While have some longterm goal of getting our source code target
multiple developer tools and operating system platforms,
Delphi currently remains our primary tool.
The current problem is that Delphi makes it hard to get
complete support for Unicode, even when using 3rd party solutions
(whereof most only are concerned with user interface controls and not string processing as such).
A more serious obstacle is that Unicode solutions implemented in Delphi today risk needing to be replaced and
reworked the day Delphi supports Unicode internally in its class libraries and native string data type.
We have written an almost complete file and string abstraction layer
that maps and have hundreds of function calls, broad and specialized.
All functions currently uses the Delphi native string
base type. This layer currently switches some of its mappings depending on what kind of code page character set encoding is used,
single byte or DBCS / MBCS.
When Borland/CodeGear has implemented its Unicode solution for native Windows development,
most likely based on either UTF-8 or UTF-16, we will expand
our string abstraction system to include this and get full internal
support for unicode. The steps for this will be:
- Some rework and coding of our string abstraction layer. The amount of work will depend on the route Borland/CodeGear chooses, e.g. the need for conversion between Unicode types and the native Delphi "string" data type.
- Expand all string function mappings with Unicode version(s). The point is to avoid converting and rewriting the many thousands and yet thousands of places calling these string functions.
- Even with above, there will be many areas of code that will need to be reviewed and changed slightly.
- Quality aussurance which means lots of testing.
We have prepared for this in a long time. However, we hope you
understand that there is a large amount of work and quality assurance
associated with this, and we therefore wait to see the path Borland/CodeGear
takes with Delphi before we commit the many man-hours on the final solution.
Currently the roadmap for Delphi has complete Unicode support in Delphi "Tiburon" (codename) which is slated for first half 2008.
At some point before this, we will know the details of the path Borland/CodeGear has chosen for implementing Unicode support.
This means we can most likely implement final steps of Unicode support sooner than above date.