A character set is set of assignments of integers to characters. It is very common for us to think of the number 65 as referring to the letter 'A', but that isn't always the case. This concept is the most important thing to understand here and not understanding it is the biggest source of confusion about character sets. Also, by "character" we generally mean "displayed glyph" and not the C++ "char" data type.
A code page is for all practical purposes the same thing as a character set.
An encoding is a way to represent character set values on a computer as a string of bytes. With ASCII encoding, encoded bytes are the same as character set bytes. But with encodings such as MBCS and UTF8, 1 character != 1 byte.
SBCS means "Single Byte Character System"; each character is represented by one byte. MBCS means "Multi Byte Character System"; some characters are represented by one byte, while others are represented by two or more bytes. DBCS means "Double Byte Character System" and is a kind of MBCS where characters are represented by at most two bytes. UTF8 (see below) is a kind of MBCS as well.
ASCII is a simple SBCS character set whose name stands for "America Standard Code for Information Interchange". It has been around since the 1950s, but the last important version of it is the 1968 revision (ANSI X.3.4-1968). ASCII includes 128 characters with values of 0-127 (see the chart below). ASCII is useful for displaying English, but that's about it.
Extended ASCII is an SBCS character set that most properly describes the 1983 ANSI revision (ANSI_X3.110-1983) of ASCII. It includes ASCII plus another 127 characters from 128 to 255 (see the chart below). Extended ASCII is useful for displaying English and most Western European languages. It is the same as ISO 8859-1 (below) but has some extra control characters.
ISO 8859-1 is an SBCS character set similar to Extended ASCII. It is geared towards displaying characters useful for most Western European languages and not towards displaying symbolic characters. It is the character set that most unsophisticated web pages on the Internet use. There is an ISO 8859-2, -3, etc. but they are of little concern in the world of Unicode.
Unicode is a "universal" character set that is effectively a superset of all other conventional character sets. As the Unicode web site says, "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." While the Unicode standard stands alone, it is for all practical purposes the same as ISO 10646; the two groups intentionally define these to be the same. The Unicode standard puts all characters into a 16 bit space, though room is left open for implementing more. An example of a font that may be on your Windows computer that has a very large percentage of glyphs from the Unicode character set is the 20MB "Arial Unicode" TrueType font. See http://www.unicode.org for more.
UCS means "Universal Character Set". UCS2 simply means that you are using two bytes (uint16_t) to store a Unicode character. UCS2 is how Microsoft implements Unicode in Windows. Most Unix variants implement Unicode via UCS4, which means that 4 bytes (uint32_t) are used per character. Since the Unicode standard puts all characters into a 16 bit character space, UCS4 is overkill for Unicode.
UTF means "Unicode Transformation Format". UTF8 is an MBCS encoding of Unicode that is designed to be a superset of ASCII. The idea behind UTF8 is that string processing and parsing code that works with ASCII will work with UTF8 without modification. If you have an ASCII string, if could just as well be a UTF8 string. UTF8 adds the ability to store Unicode values as multi-byte sequences that involve byte values greater than 127. Thus, with UTF8, 1 byte sometimes equals 1 character set value and sometimes it doesn't. But either way, all UTF8 sequences map directly to Unicode characters; UTF8 is Unicode. Many web pages (especially Asian web pages) are implemented with UTF8 instead of ISO 8859-1.
UTF16 is an encoding of Unicode that uses two bytes per character for nearly all characters. It leaves room at the very top to act like UTF8 and implement a multi-byte encoding for some characters. But hardly anybody uses those characters and thus UTF16 is for most uses the same thing as UCS2.
Q: Why is UTF8 used so often instead of
UCS2? Windows NT/XP uses 16 bit characters ("WCHAR") natively,
right?
A: It's all about compatibility. There is a very large amount of
existing code that implements one byte (Uint8) per character and it
would be prohibitive to port it all to Uint16. The fact that Windows XP
runs with UCS2 and has UCS2 APIs isn't
enough to make it practical for many to use it. Windows 98 doesn't use
UCS2, Unix and Macintosh don't use UCS2, and most third party tools
you'll need to use don't use UCS2. So unfortunately, as great as UCS2
is, it is impractical for many or most projects to use it.
Q: Is Unicode really enough to hold all characters with 16 bits, or is
it going to need to be revised some day to 32 bits?
A: Unless we start communicating with beings from another planet, 16
bits will be enough. However, some scholars of classic Asian script
have argued that there are many additional ancient Asian characters
that have not been recognized by the Unicode standard.
Q: What do I have to know in order to program with strings that are
UTF8-encoded?
A: For all values < 127, UTF8 strings are the same as ASCII, so any
code that you have that looks for control characters or characters such
as <>:?0^, etc. will work fine. However, you must remember that
for values > 128, one byte doesn't necessarily correspond to one
glyph value. You will need a decoding function to decipher a UTF8
string.
Q: What' does the strlen() of a UTF8 string mean? Is it the number of
bytes or the number of characters
("glyphs")?
A: It is the number of bytes. If you want to know how many characters
(a.k.a. code points) a
string has, you will
need to apply a decoding function.
Q: How do I convert characters between various character sets and
encodings?
A: There are various encoding and decoding functions to convert between
any recognized character set and encoding to any other. Unix-based
platforms use the iconv API,
while Windows uses the MultiByteToWideChar API. The Maxis Core
framework has the ConvertEncoding function.
The Unicode character set is too big to display here, but charts for the entire character set can be found at: http://www.unicode.org/charts/. Microsoft has a very expansive font called "Arial Unicode MS" that includes much of the Unicode character set. You can use this font and the Windows "Character Map" applet to navigate the Unicode character set. Lastly, the Framework documentation consists of a large Excel spreadsheet that includes every Unicode character, one per cell.