LIS 523/5 - Special Characters

There are several methods of incorporating special characters, such as mathematical symbols and characters used by various languages:
  1. for characters and diacritics from Western European languages, entering them as single bytes using the Latin-1 character set;
  2. for these and many other symbols, including those in some other writing systems, using HTML 4 character entity references,
    1. in mnemonic form (e.g., α) or
    2. in the form of decimal Unicode references (e.g., α);
  3. on a page marked as using a particular character set,
    1. for a specific character set (e.g., Turkish) or
    2. a compressed form of Unicode (usually UTF-8);
  4. images.

Method 1 is frequently used for materials in languages other than English but does not mix with method 3. It may also create strange results if the browser's auto-detection sets the page coding incorrectly.

Opera seems to be a little weaker than Internet Explorer and Netscape at supporting the complete range of character entity references. Support from older browsers may be much more limited.

Method 2, especially 2.b, is recommended as the safest approach, but can be clumsy if a lot of special characters and diacritics are required.

Method 2.b potentially supports many more characters than method 2.a, and even allows diacritics to be added to letters where precombined forms are not included in Unicode.

This method seems to be most common in page editors. For example, in Open Office, when you include non-Western-European special characters in an HTML file, Open Office translates the special characters to numerical character entity references. (When you ask to save the file, Open Office issues a warning, but still saves the file with the character entity references intact.)

To specify a particular character set in HTML for method 3, you can use a "Content-Type" meta tag such as

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">

Method 3.a works for many non-Western European alphabets, but generally does not allow combining different alphabets (even slight variations on the same alphabet, like Turkish and Icelandic, or Spanish and Polish, may conflict).

Other character sets use ASCII character codes above #127 for their own purposes. If you are viewing this page in Latin-1, you should see the following as a capital A with a dieresis (or umlaut) followed by a capital A with a ring:

ΔΕ

Browsers allow the user to select the character set. To do this, for example, in Internet Explorer, select "Encoding" from the "View" menu.

As you select different character sets, you will typically see different special characters; for example, in Greek, capital delta and capital epsilon, and in Cyrillic, capital de and capital ye. If the desired character set is not installed, you may just see question marks or boxes; Internet Explorer also seems to mess up the display for Hebrew. Some character sets, such as those for Chinese, use two-byte codes; so, in those cases, you will see only one character.

Method 3.b is the most universal, but is not yet supported by many tools. More obscure Unicode characters, such as those for North American indigenous syllabics, are still not supported by common browsers; Internet Explorer does not even support IPA, though Netscape, Firefox, and Opera do. Even characters for widely used languages such as Chinese may require the user to download additional support; Internet Explorer may generate an error message instead of downloading the added support automatically as it is supposed to do. Even where the character set is supported, the relative positions of the characters in the rendering may be wrong (for example, a vowel mark may appear to the right instead of to the left of a related consonant).

For each character, UTF-8 uses from one to six bytes ("octets") as follows:

The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

(Yergeau, F. 1998. UTF-8, a transformation format of ISO 10646. http://www.ietf.org/rfc/rfc2279.txt).

If you select to view the page as UTF-8, you will likely see above a single lower-case A with a hook opening to the right. To understand why this symbol appears, consider first its Unicode value: 105 in hexadecimal, 261 in decimal, and 100000101 in binary. This translates into UTF-8 as the following two bytes in binary (where the "payload" is marked in bold): 11000100 10000101. The first of these bytes is indeed the code for the capital A with dieresis (#196). The second, however, is not quite the code for the capital A with hook (#197), which is actually 11000101 in binary; but browsers generally ignore the second bit of UTF-8 bytes after the first, since it serves no real function.

Some more powerful page editing software (for example, Microsoft FrontPage) allows you to create pages in UTF-8 coding and will insert UTF-8 byte sequences rather than character entity references when you ask it to insert special characters.

For occasional hand coding of special characters, Calculator in Windows provides an easy translator among hexadecimal, decimal, and binary. Select "Scientific" under "View"; then, click the radio button for the base in which you want to input, type the number, and click the radio button for the base to which you want to convert.

Method 4 is actually the safest of the lot, but using images instead of text has obvious disadvantages.

For More Information


Home

Last updated April 24, 2007.
This page maintained by Prof. Tim Craven
E-mail (text/plain only): craven@uwo.ca
Faculty of Information and Media Studies
University of Western Ontario,
London, Ontario
Canada, N6A 5B7