Unicode – Why Use It?

In this article, I’m going to answer the question most of the developers have before beginning to use Unicode.  Why should one bother to write code that supports Unicode, especially if they are not targeting any non-English language market?  Before we fall into it, I’d like you to know of Character sets, and their three varieties.  If you’re seeking for a how-to guide on using Unicode, this article won’t help you.  I suggest taking a look at my Unicode article.

Character Sets

Single-Byte Character Sets

In a single byte character set, all characters occupy only one byte.  This means that a string would simply be a series of bytes coming one after another, and the end is indicated by a NULL character, which is a zero byte.  This is the simplest of all character sets.  Most of the well-known C runtime functions like strlen expect the characters to be in this format.  But using this kind of character sets can cause problems when you want to build a program in a different language other than English.  The problem is not only you should have a coding system (called a code page) for each and every language, but also some languages (like the Japanese Kanji) need so many characters that can’t be fit in a single byte, and 255 characters in a single byte character set is not room enough for them.  This need, caused in the birth of double-byte character sets.

Double-Byte Character Sets

Sometimes called the Multi-Byte Character Sets (MBCS), the double byte character sets (DBCS) are just like the single byte character sets at the beginning.  The fist 128 codes for each code page is the same as ASCII character codes.  The other 128 codes are there and each language can define these characters to be whatever symbols they need in writing that language.  So what’s the difference with single byte character sets?  The difference is that for each code page, there are some codes that specify that the next byte following them should also be interpreted with that code, making a double byte.  This results in a code page in which some characters require one byte, and the others require two bytes.  Like the single byte character sets, the end of the string is specified by a single NULL character.  This implementation is kind of painful.  Consider you need to find out the size of a string.  You can’t pass such a string to strlen directly, because it expects each byte is a character.  You need to write your own string parsing routines for each CRT function that checks to see if a byte in the string specifies that the next byte should also be interpreted together with it or not.  Fortunately enough, the MSVC CRT library has some functions that are called the MBCS functions, and can handle both the single byte and double byte character sets.  For example, the MBCS version of strlen is called _mbslen.

The Windows API offers some functions that let you manage the MBCS strings to some extent.  These API functions are listed in the below table.

LPTSTR CharNext( LPCTSTR pszCurrentChar ); Retrieves a pointer to the next character in a string.
LPSTR CharNextExA( WORD CodePage, LPCSTR pCurrentChar, DWORD ); Retrieves a pointer to the next character in a string.  This function can handle both SBCS and DBCS strings.
LPTSTR CharPrev( LPCTSTR pszStart, LPCTSTR pszCurrent ); Retrieves a pointer to the preceding character in a string.
LPSTR CharPrevExA( WORD CodePage, LPCSTR pStart, LPCSTR pCurrentChar, DWORD ); Retrieves a pointer to the preceding character in a string.  This function can handle both SBCS and DBCS strings.
BOOL IsDBCSLeadByte( BYTE TestChar ); Determines whether the specified byte is the first byte of a character in DBCS.
BOOL IsDBCSLeadByteEx( WORD CodePage, BYTE TestChar ); Determines whether the specified byte is the first byte of a character in DBCS.

Although these functions make DBCS programming a lot easier, still DBCS is more pain that one would take.  The solution to this mess is the Wide-Byte Character Set, or Unicode.

Wide-Byte Character Sets

The wide-byte character sets have been born to eliminate the problems with DBCS.  What was the problem?  The characters in DBCS had different lengths.  What does the wide byte character sets offer to eliminate this problem?  It simply dictates that each and every character is made of two bytes, with no exception.  This makes working with string a lot easier.  A string is a series of two-byte characters following each other, with two NULL bytes at the end to indicate a NULL character, and hence the end of the string.  So, the CharNext/CharPrev and other MBCS related APIs are no longer necessary.  Easy, eh?  But wait.  Like anything that has some dark sides, there is a dark side to Unicode also!  The dark side of it is all your strings will take up two much the space of a SBCS string, and much more space than a MBCS string.  This results in higher memory usage, as well as larger executables.  The answer to this problem (if one can call it a "problem") is that this extra space is necessary if you want international applications, and even if you only use English in your applications, then you can simply *ignore* this price you pay because of two reasons.  The second (!) reason is that these days, large capacity hard disks and RAMs are really cheap.  The first reason, you’ll know by going on through this article.

Unicode Support on Different Platforms

Unicode and Windows 95/98

Windows 95/98 are not new operating systems, they are just a layer built upon the the old 16-bit Windows.  Because Unicode was not considered in the design of 16-bit Windows operating systems, Microsoft did not pull Unicode into Windows 95/98.  So, Windows 95/98 do all (well, almost all) their internal work using ANSI strings.  Of course they offer some Unicode functions I’ve pointed out in my other article, but, as Jeffrey Richter states in his book, Programming Applications for Microsoft Windows, many of these functions are bogus.  So, if you’re only targeting Windows 95/98, then you don’t need to think about Unicode at all.

Unicode and Windows NT/2000/XP

Windows NT was built using Unicode from the start.  All of the internal operating system functions do their jobs using Unicode strings.  And, it’s the same on both English and non-English versions of Windows NT.  Windows 2000, which was built on Windows NT, and also Windows XP, which is built upon Windows 2000, both inherit all the Unicode support.

So, it’s somehow incorrect to say that Windows NT/2000/XP support Unicode.  Instead, one must say they support ANSI!  The ANSI support does not come free of course.  All the ANSI functions in Windows NT/2000/XP are simply wrappers around the Unicode functions, and their only job is converting between Unicode and ANSI strings, and calling the Unicode version.

This means that if an ANSI function which expects an ANSI string as input is called in Windows 2000 for example, the system internally copies the ANSI string to a Unicode buffer, and calls the Unicode version, passes the Unicode buffer to it, and returns the result of the Unicode function.  Also if the ANSI function has an ANSI buffer for output, the system internally allocates a Unicode buffer, passes it to the Unicode version which fills in that buffer with Unicode characters, and then converts that Unicode string to ANSI and copies it to the specified ANSI buffer that you pass to the function.  This introduces a lot of overhead which we’ll discuss later.

Unicode and Windows CE

Like Windows NT, Windows CE is natively Unicode.  This means that you can call the Unicode version of the APIs successfully on Windows CE.  But, note that because Windows CE was designed to be as small as possible, it does not support ANSI at all.  So, if you are targeting Windows CE, you should always use Unicode, else your application will not run.

Unicode and the Component Object Model (COM)

As a general rule, all COM interfaces should expect the string to be Unicode.  This is true even on Windows 95/98.  So, if you are developing for Windows NT/2000/XP or Windows CE, you can easily use COM, and all your source code would be Unicode.  However, if you’re developing for Windows 95/98, here the pain starts!  You should convert string between Unicode and ANSI all the time.  The simplest way I know of to do this is using the Conversion macros I’ve discussed in my Unicode article.

To Unicode, or Not to Unicode – That’s the Question

My answer to the above question is simply "Use Unicode".  Of course, I mean you should use Unicode if you are developing for any other platform other than Windows 9x, and also you should create generic source code that you can compile both using Unicode and ANSI, and distributing the ANSI modules for Windows 9x, and Unicode modules for the rest of the operating systems.

I understand that unless you want to write a Windows CE application, you can stick to ANSI, but I’ll show you something in a minute that is all the reason not to use ANSI on Windows NT/2000/XP.  All this is about is efficiency.

Inside the NTDLL.DLL, which is the core of Windows NT, you can find a lot of API functions that together form the ANSI support on Windows NT.  Some of these functions are RtlInitAnsiString, RtlInitUnicodeString, RtlAnsiStringToUnicodeString, RtlUnicodeStringToAnsiString, RltUnicodeToMultiByteN, RtlMultiByteToUnicodeN, Basep8BitStringToUnicodeString, BasepUnicodeStringTo8BitString, etc.  I won’t go into describing these functions, you can find a good description on them in the December 1997 "Under the Hood" section.

Now let’s see how the ANSI support is implemented on Windows NT.  Each thread is associated with a Unicode buffer.  The ANSI functions that need to have ANSI strings as input need to copy it to a Unicode buffer.  Some of these functions use the thread’s Unicode buffer and copy and convert the ANSI string to that Unicode buffer, and some other allocate a Unicode buffer using HeapAlloc.  The latter group should free the buffer using HeapFree once they are done with the buffer.

The ANSI functions that need to have an output string work alike.  They call the Unicode version either passing is the thread’s Unicode buffer, or the memory buffer they allocate themselves using HeapAlloc, and then convert the resulting string to ANSI and copy it to the buffer that the caller specifies.  If they’ve allocated some memory, they should free it using HeapFree.

So, let’s summarize it.  The ANSI functions need to specify/allocate some memory buffer to the Unicode functions (overhead), and convert the input ANSI strings to Unicode and the output Unicode strings to ANSI (overhead), copy the ANSI strings to Unicode buffers and vice versa (overhead), and most of the time free the allocated memory (overhead).  So, using the Unicode API does not add anything but overhead to your application, no?!  This is the first reason to ignore the dark side of the Unicode I had promised.  If you care about efficiency a bit, you’ll never ever use ANSI on Windows NT/2000/XP again!

Ok, now you know about all the overhead this can cause, but, hey, how much is that overhead?  It’s quite a bit!  To prove it, use the following benchmarking application that is adapted from the "Under the Hood" article pointed to above.  When I ran it on my system, I got some results you might be interested to know!

#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <stdio.h>

#define BUFFER_SIZE MAX_PATH
#define ITERATION 1000000

int main(int , char* [])
{
LARGE_INTEGER before, after, freq;
double dAnsi, dUnicode;
char szAnsi[ BUFFER_SIZE ];
wchar_t wszUnicode[ BUFFER_SIZE ];
unsigned long ul = 0;

::QueryPerformanceFrequency( &freq );

::SetThreadPriority( ::GetCurrentThread(),
THREAD_PRIORITY_TIME_CRITICAL );

::Sleep( 0 );
::QueryPerformanceCounter( &before );

while (ul ++ < ITERATION)
{
HMODULE hMod = ::GetModuleHandleA( "kernel32.dll" );
}

::QueryPerformanceCounter( &after );

dAnsi = (double) after.QuadPart - before.QuadPart;
dAnsi /= freq.QuadPart;
ul = 0;

::Sleep( 0 );
::QueryPerformanceCounter( &before );

while (ul ++ < ITERATION)
{
HMODULE hMod = ::GetModuleHandleW( L"kernel32.dll" );
}

::QueryPerformanceCounter( &after );

dUnicode = (double) after.QuadPart - before.QuadPart;
dUnicode /= freq.QuadPart;
ul = 0;

::printf( "GetModuleHandle:\n\tANSI version:\t\t%lf"
"\n\tUnicode version:\t%lf\n\n\n", dAnsi, dUnicode );

::Sleep( 0 );
::QueryPerformanceCounter( &before );

while (ul ++ < ITERATION)
{
::GetCurrentDirectoryA( sizeof(szAnsi) / sizeof(szAnsi[ 0 ]),
szAnsi );
}

::QueryPerformanceCounter( &after );

dAnsi = (double) after.QuadPart - before.QuadPart;
dAnsi /= freq.QuadPart;
ul = 0;

::Sleep( 0 );
::QueryPerformanceCounter( &before );

while (ul ++ < ITERATION)
{
::GetCurrentDirectoryW( sizeof(wszUnicode) / sizeof(wszUnicode[ 0 ]),
wszUnicode );
}

::QueryPerformanceCounter( &after );

dUnicode = (double) after.QuadPart - before.QuadPart;
dUnicode /= freq.QuadPart;

::printf( "GetCurrentDirectory:\n\tANSI version:\t\t%lf"
"\n\tUnicode version:\t%lf\n\n\n", dAnsi, dUnicode );

return 0;
}

And here are the results I got when I ran this program on my machine.  BTW my machine is a Windows XP Professional system, with an AMD 1GHz CPU, and 192MB of RAM:

H:\Projects\VC++\test\UniBenchmark\Release>UniBenchmark
GetModuleHandle:
        ANSI version:           1.500179
        Unicode version:        1.283605


GetCurrentDirectory:
        ANSI version:           0.497415
        Unicode version:        0.201535



H:\Projects\VC++\test\UniBenchmark\Release>UniBenchmark
GetModuleHandle:
        ANSI version:           1.507318
        Unicode version:        1.290783


GetCurrentDirectory:
        ANSI version:           0.498030
        Unicode version:        0.201641



H:\Projects\VC++\test\UniBenchmark\Release>UniBenchmark
GetModuleHandle:
        ANSI version:           1.496500
        Unicode version:        1.281841


GetCurrentDirectory:
        ANSI version:           0.497670
        Unicode version:        0.202688



H:\Projects\VC++\test\UniBenchmark\Release>UniBenchmark
GetModuleHandle:
        ANSI version:           1.498664
        Unicode version:        1.282629


GetCurrentDirectory:
        ANSI version:           0.496627
        Unicode version:        0.201643

The results are approximately the same each time.  On the average, the ANSI version of GetModuleHandle that has an input string parameter takes 16.81 percent more time to complete than the Unicode version, and the ANSI version of GetCurrentDirectory that has an output string parameter takes 146.41 percent more time to complete than the Unicode version, which is well above two times the Unicode version!  While this may not be much of an issue in normal and small utility applications, its quite a bit important on heavy applications that need to run fast, for example a server application, or a graphics package.

So, the final word is that Unicode applications run at least two times faster than ANSI applications on my machine.  Also consider the fact that they require more memory to run than the Unicode applications.  Now the choice is yours, you can forget about learning how to program in Unicode, and have slow Windows NT application, or you can learn it, and have super fast Windows NT applications, and distribute an ANSI version for Windows 9x if necessary.  But, anyway, always think twice before writing ANSI applications for Windows NT at least!

This article originally appeared on BeginThread.com. It’s been republished here, and may contain modifications from the original article.

Posted in Visual C++