Unicode

If you have developed applications for Windows, probably you’ve heard about Unicode before.  What is Unicode, and how to use it in writing a program?  These are the questions I try to answer in this article.

What is Unicode?

Unicode is nothing but a set of standards for a character-coding system.  Unicode expects each character to be two bytes wide, instead of the normal one-byte wide ANSI characters.  So, we can have 65,536 different characters in a Unicode character set.  This is room enough for characters of all the current languages of the world, plus a wide free range that can be used for indicating symbols.  The Platform SDK documentations for Unicode can be found at Platform SDK\Base Services\International Features\Unicode and Character Sets.

Unicode and C/C++

To implement Unicode in C/C++, you should consider learning about wide characters. Wide characters are characters that consist of two bytes.  So, wide characters are a good candidate for writing Unicode applications in C/C++.  As you know, the data type for a simple character in C is the char type.  The data type for a wide character is wchar_twchar_t is defined like this inside stdlib.h:

typedef unsigned short wchar_t;

You can use wchar_t instead of char anywhere.  For example, to declare a wide character string, you can write code like this:

wchar_t * pwszMyString = "This is my wide character string";

Well, to correct myself, you can use wchar_t instead of char *almost* anywhere.  What does the above code try to do?  It tries to assign a literal string (which is a pointer to a normal character, or a char *) to a pointer to type wchar_t, or unsigned short.  What will it cause?  A compile error.  So, what should you do?  Should you just make an array of unsigned shorts?  The answer is no.  The Microsoft’s C/C++ compiler that comes with Visual C++ helps you quite a bit.  If you prefix a literal string by a L, the compiler considers that string to be a wide character string, instead of a normal string.  So, to correct the above code, all you need to do is to put a L right before the starting quotation mark (note that there shouldn’t be any spaces between the L and the quotation mark):

wchar_t * pwszMyString = L"This is my wide character string";

One thing to note.  The following pieces of code are the same, because for literal characters, the compiler automatically adds a byte containing zero to the character is it is being a wide character.

wchar_t wc1 = L'A';
wchar_t wc2 = 'A';

Unicode and the Standard C/C++ Library

The standard C/C++ library does support wide characters.  Suppose you want to calculate the length of a wide character string.  If you just call strlen on it, you won’t get the desired result.  Because strlen searches the string char by char, which is byte by byte, not character by character.  You get the difference?  A character could be one byte (char) or two bytes (wchar_t).  But the char data type is always one byte, no matter the application is Unicode or not.

For every C/C++ library function that deals with normal character strings, there is an equivalent for wide characters.  For functions which have "str" in their names, the wide character version has the same name with "wcs" instead of "str".  Like wcslen, wcschr, wcsstr, wcscmp, wcscpy, wcsrchr, wcscat, wcscoll, wcscpyn, wcscspn, wcsicmp, wcsdup, wcsftime, wcslwr, wcsupr, etc….  For other functions, there are also wide character versions, for example, swprintf for sprintf, _wfopen for fopen, _getws for gets, etc….

You’ll use these "wcs" functions just as you use the normal character functions.  Only you should pass wchar_t‘s to them, and also expect to get back wchar_t‘s.  Just to demonstrate:

size_t length = wcslen( L"My String" );
// length will be 9

Unicode and the Windows API

Windows 95 and 98 were built upon 3.1, and they don’t support Unicode.  All the applications written for Windows 98 should be ANSI applications.  Microsoft has considered Unicode from the beginning of the NT project, so Windows NT (and its ancestors,2000, and XP) support Unicode from the ground up.  So, applications that are to be run on NT systems should be Unicode.  So, how come there are applications that work on both 98 and NT systems for example?  The reason is that Windows NT supports a translation layer between Unicode and ANSI.  Whenever an application calls an ANSI function, system allocates a buffer for the strings used in that function, converts the ANSI strings to Unicode strings, and calls the Unicode version of the function with the Unicode temporary buffers as parameters.  After that call returns, the ANSI function is responsible to free the buffers allocated.  So, this translation layer is completely invisible to the application, and hence, to the application developer.  But, this translation involves some overhead of allocating buffers, and converting ANSI text to Unicode and vice versa.  So, if an application is only targeted for NT platforms, the Unicode versions should run faster than the ANSI versions.

Now, let’s take a look at the Unicode system at the API level.  Consider the function CreateFile.  Microsoft documents this function to be located in kernel32.dll.  So, now open up the DEPENDS.EXE tool, and open kernel32.dll.  Surprisingly, you won’t find any exported symbol named CreateFile.  So, is Microsoft lying?!  Not really.  If you look again, you’ll see two other functions:  CreateFileA, and CreateFileW.  The A stands for ANSI, and the W stands for wide character.  CreateFileA is the ANSI version of CreateFile, and CreateFileW is the wide character version of CreateFile.  CreateFile is actually only a preprocessor macro, defined like this:

#ifdef UNICODE
#define CreateFile CreateFileW
#else
#define CreateFile CreateFileA
#endif

The UNICODE symbol, as well as the _UNICODE symbol, indicate that you want to build a Unicode application or not.  If you define UNICODE and _UNICODE before including the header files, your application would be Unicode.  If you don’t define them, which is the default mode, you’ll have an ANSI application.  So, what is a Unicode/ANSI application?  A Unicode application is an application that calls wide character versions of API functions, and uses wide character strings internally, and an ANSI application is an application that calls ANSI versions of API functions, and uses ANSI strings internally.

There are two important points.  The first one is that some API functions, like GetDC, don’t have anything to do with strings at all, so they have only one version that is usable by both ANSI and Unicode applications.  The second note is that actually Windows 98 a handful of Unicode APIs, which are listed in the below table.  Other than these functions, the system DLLs for Windows 98 have export symbols for wide character functions as well, but they all return FALSE, and GetLastError would return ERROR_CALL_NOT_IMPLEMENTED which is defined as 120.

Unicode functions implemented on Windows 98

EnumResourceLanguagesW GetTextExtentPoint32W
EnumResourceNamesW GetTextExtentPointW
EnumResourceTypesW lstrlenW
ExtTextOutW MessageBoxExW
FindResourceW MessageBoxW
FindResourceExW TextOutW
GetCharWidthW WideCharToMultiByte
GetCommandLineW MultiByteToWideChar

Only I should mention a point here.  Some programmers think that because Windows 98 doesn’t support Unicode, they can’t use wcslen for example on Windows 98.  This is completely wrong!  wcslen is not a part of the Windows operating system, but a part of standard ANSI C/C++ runtime library.  So, it’s implemented on Windows 98 as well.

So, these macros defined in the Windows header files enable you to write Unicode-compatible code, which you can compile in Unicode mode only by defining two preprocessor symbols, UNICODE and _UNICODE.  (Actually UNICODE symbol is used in the Windows header files, and _UNICODE is used in the headers of Microsoft C/C++ runtime library.)  But, this is not enough.  How should you handle string literals, and char/wchar_t data types?  One possible way is to write your code like this:

#ifdef UNICODE
MessageBox( NULL, L"My message", L"My caption", MB_OK );
#else
MessageBox( NULL, "My message", "My caption", MB_OK );
#endif

Although it works, but it’s overkill.  You must have #ifdef’s all around your code, making it less manageable and readable.  One better solution is provided with the tchar.h header file.

The Generic Macros

The generic macros which are defined in tchar.h are used to write a single source code that can be compiled for both Unicode and ANSI.  Tchar.h solves three problems:

  1. Data Types
    Tchar.h defines the data type TCHAR.  Here’s a rough definition of TCHAR:

    #ifdef _UNICODE
    typedef wchar_t TCHAR;
    #else
    typedef char TCHAR;
    #endif

    So, the TCHAR data type automatically maps to the correct version of character data type according to whether the application is Unicode (the _UNICODE symbol is defined) or not.  This solves the problem of defining compatible data types for Unicode and ANSI applications.

  2. String Literals
    If you write such a code using the TCHAR data type, your code won’t compile fine in Unicode mode.

    TCHAR szMyString[ ] = "My String";

    Why?  Because if you compile your application in Unicode mode, the compiler tries to assign a pointer to type char to a pointer to type unsigned short, and that’s not acceptable.  Again, if you write your code like below, you’ll again get compile errors.

    TCHAR szMyString[ ] = L"My String";

    If you compile the above code in ANSI mode, you’ll be trying to assign a pointer to type unsigned short to a pointer to type char, which again is not acceptable.  Tchar.h offers a solution to this problem.  You must put any string literal inside a _T macro, like this:

    TCHAR szMyString[ ] = _T("My String");

    If you’re building ANSI, the _T macro expands to nothing, and it is just removed.  However, if you build for Unicode, the _T macro expands to a capital letter L, which is what identifies a wide character string.  This is just what we needed, no?!  Note that the _T macro is defined in other forms, like _TEXT, and TEXT (the latter is defined in the Windows header files), but because typing _T is the simplest form, I myself always use it instead of the other forms.

  3. Standard Library Functions
    As we saw before, the Windows API functions are defined using preprocessor macros to be mapped to the correct version of the API functions.  For example, lstrlen, which is the API equivalent to strlen, is actually in two forms of lstrlenA, and lstrlenW, for normal and wide characters.  So, it’s completely OK to write code like this:

    TCHAR szMyString[ ] = _T("My String");
    int length = lstrlen( szMyString );

    But, if you decide to use strlen instead of lstrlen, boom!  You’ll get build problems all the time for Unicode mode.  Tchar.h helps you by defining the generic form of all the C/C++ runtime functions that have two versions for normal and wide character strings.  Usually for functions with "str" or "wcs" in their names, you can replace them by "_tcs" to get the generic form of functions, like _tcslen, _tcsstr, _tcsupr, and so on.  For other functions, there are similar generic forms as well.  For instance, the generic form of fopen and _wfopen is _tfopen.  You can find the generic forms of all these functions by looking at MSDN documentation for each function.

Writing Unicode-compatible Source Code

Writing Unicode compatible source code has an advantage even if you don’t want to build Unicode at this time.  It can free you from having two source code file sets, one for ANSI and one for Unicode.  It makes it possible to build a Unicode version at any time without having to change even one line in the source code.  Only you’ll need to follow a set of rules.

  1. Use TCHAR instead of char.  This solves the problem of incompatibilities between char and wchar_t data types.
  2. Wrap all the string literals inside a _T macro.   This would let the compiler know whether to consider the string literals as normal or wide character.
  3. Use the generic form of Standard Library as well as API functions.  For example, use _tcslen or MessageBox instead of strlen or MessageBoxA.
  4. Never assume that sizeof( character ) == 1!
    Of course, no matter your application is Unicode or ANSI, always the size of char data type is equal to 1, but the size of a character (which may be normal or wide character varies.  Always try to use sizeof( TCHAR ).  For example, instead of writing code like this:

    //allocate a 100-character string
    TCHAR * pszString = (TCHAR *) malloc( 101 );

    write it like this:

    //allocate a 100-character string
    TCHAR * pszString = (TCHAR *) malloc( 101 * sizeof(TCHAR) );

    Because in wide character version, size of each character is 2 bytes, so 202 bytes should be allocated for a 100 character string (of course, the last 2 bytes would be for the terminating NULL character, 0x0000).

Using these guidelines, you can write source code that compiles fine for both ANSI and Unicode versions.

Also it’s useful to point out some other macros defined in the Windows header files.  LPCTSTR or PCTSTR defines a generic constant pointer to a string.  You may consider the definition of these macros like this in winnt.h:

typedef char CHAR;
typedef wchar_t WCHAR;

typedef WCHAR * LPWSTR, * PWSTR;
typedef const WCHAR * LPCWSTR, * PCWSTR;

typedef CHAR * LPSTR, * PSTR;
typedef const CHAR * LPCSTR, * PCSTR;

#ifdef UNICODE
typedef LPWSTR LPTSTR, PTSTR;
typedef LPCWSTR LPCTSTR, PCTSTR;
#else
typedef LPSTR LPTSTR, PTSTR;
typedef LPCSTR LPCTSTR, PCTSTR;
#endif

These macros are extensively used in Windows header files.  So, the following lines of code are identical:

LPTSTR pszString = TEXT("My String");
TCHAR * pszString = _T("My String");

And only it’s a matter of preference to pick up which one to use.

Conversion Macros

The ATL library offers a set of conversion macros declared in atlconv.h.  These conversion macros simplify the task of converting between Unicode and ANSI strings.  Normally, you convert Unicode strings to ANSI strings using the WideCharToMultiByte function, and ANSI strings to Unicode strings using the MultiByteToWideChar function.  But that offers a lot of work to do.  For using the conversion macros (see MFC technical note 59) you first invoke the USES_CONVERSION macro that creates a number of stack variables, and then, you use the macros as appropriate.  Here’s a list of all macros with information on what type they convert from, and their return types.

Conversion Macros

macro from type to type
A2W PSTR PWSTR
W2A PWSTR PSTR
A2CW PSTR PCWSTR
W2CA PWSTR PCSTR
T2COLE PTSTR PCOLESTR
OLE2CT POLESTR PCTSTR
T2OLE PTSTR POLESTR
OLE2T POLESTR PTSTR
A2OLE PSTR POLESTR
OLE2A POLESTR PSTR
W2OLE PWSTR POLESTR
OLE2W POLESTR PWSTR
A2COLE PSTR PCOLESTR
OLE2CA POLESTR PCSTR
W2COLE PWSTR PCOLESTR
OLE2CW POLESTR PCWSTR
T2A PTSTR PSTR
A2T PSTR PTSTR
T2W PTSTR PWSTR
W2T PWSTR PTSTR
T2CA PTSTR PCSTR
A2CT PSTR PCTSTR
T2CW PTSTR PCWSTR
W2CT PWSTR PCTSTR
A2BSTR PSTR BSTR
W2BSTR PWSTR BSTR
T2BSTR PTSTR BSTR

Note:  POLESTR and PCOLESTR are identical to PWSTR and PCWSTR in Win32 environments when OLE2ANSI preprocessor macro is not defined.  For Win16, Macintosh, and Win32 with OLE2ANSI defined, POLESTR and PCOLESTR are identical to PSTR and PCSTR.

Here’s an example of using these conversion macros in action:

USES_CONVERSION;
TCHAR szFunctionName[ 100 ];
GetWindowText( hwndEditControl, szFunctionName, 100 );
HMODULE hMod = LoadLibrary( _T("mydll.dll") );
FARPROC pProc = GetProcAddress( hMod, T2CA( szFunctionName ) );
FreeLibrary( hMod );

Because GetProcAddress function receives a PCSTR parameter, and szFunctionName is a PTSTR, we have used the T2CA macro to convert the Unicode string to an ANSI string, and pass to the GetProcAddress function.

You don’t need to memorize the whole table of conversion macros.  For a tip on how to remember them, consider the following points.

  • Each macro name is consisted of two parts, the "from" part and the "to" part, separated by a ‘2’:  from2to.  Sometimes the "to" part receives a ‘C’ prefix, which indicates the return value is a constant pointer, like A2T and A2CT.  The "from" part never receives a ‘C’ prefix.
  • Here’s a list of the letters acceptable for "from" and "to", as well as their respective types:
    letter(s) type
    A PSTR
    W PWSTR
    T PTSTR
    OLE POLESTR
    BSTR BSTR

    Also note that the BSTR type does not get the ‘C’ prefix, and is specific to the "to" part.

These conversion macros come handy mostly in writing COM code.  COM expects all the strings to be Unicode strings, even on Windows 98.  Suppose you have an application that reads an ANSI text string from an edit control, passes it to a COM method, and retrieves another string from the COM method, and prints in on a window.  This application should first convert the ANSI string into Unicode to pass it to the COM method, and then convert the string returned by the COM method from Unicode back to ANSI, and then print it on the screen.  Writing this code could involve a lot of coding just to convert the strings from Unicode to ANSI and vice versa, but using the conversion macros, it would be as easy as a piece of cake!


Now you have all the knowledge you need to make your program support Unicode.  Here I want to mention some point that many programmers are confused with.  Categorizing applications as Unicode and ANSI is only meaningful to application developers.  According to the system, there’s no difference between a Unicode and an ANSI application at all.  Unicode applications often call wide character versions of API function, whereas ANSI applications call the normal character versions.  But, this is completely possible to use Unicode functions explicitly in an ANSI application and vice versa.  Consider the following sample that could be a piece of code in an ANSI application which runs pretty fine on both Windows 98 and Windows NT:

MessageBoxW( NULL, L"A Unicode string", L"test", MB_OK );

Also, nothing can prevent you from calling an ANSI function from a Unicode application.  Only you should consider data type compatibility, and you should revise your algorithms so that they consider that a character can be either one or two bytes long, and is not guaranteed to be a single byte.

For a sample of a Unicode application that is revised to be compiled on both Unicode and ANSI modes, see the article Changing IE Show Picture Setting.

This article originally appeared on BeginThread.com. It’s been republished here, and may contain modifications from the original article.

Posted in Visual C++