Background

At an abstract level, working with Unicode is just a type conversion. Data from outside your program arrives in the type List<Byte<Encoding>> and must be converted to the type String (List<Char>) before it's usable. After you're done, all Strings must be converted to List<Byte<Encoding>> before being printed or saved.

Languages that don't enforce or even acknowledge this process make this more difficult to do, but self-discipline can avoid problems.

Encoding

There's not much point in talking about a Unicode String in memory because the programming language takes care of that--most languages support all the usual string operations on Unicode Strings that they used to support on ASCII Strings. Some operations may be slower, or Unicode Strings may take up more memory, however. Here are some useful encodings. The pre-Unicode encodings are still useful to know because they were still widely used until 2000, at least in the US.

Unicode Encodings

Pre-Unicode Encodings

How to Tell Which Encoding Is Used

Start by reading the documentation for the corpus. Then move on to file. Use is file filename, and it makes a valiant effort to guess the encoding (along with a bunch of other properties, see man file for details).

If you don't want to use file, you can try to determine the encoding manually. If you know the language, load the file into a browser and cycle through the encodings for that language. For example, if it's a Western European language and the accented characters are coming out wrong, cycle through UTF-8, Latin1, MacRoman and DOS encodings, possibly in a different order if you know something about the age and author of the file.

Code

Perl

If you are just reading and writing files, the easiest thing is to specify the encoding when opening the file so that all strings will be automatically converted.

#!/usr/bin/perl (-)
open(INF, "<:utf8", "utf8.txt");
open(OUTF, ">:utf8", "pl_utf8.txt");
while(<INF>) {
  print OUTF "pl:" . $_;
}
close(INF);
close(OUTF);

However, if you have obtained strings from some other source with unspecified encoding, Perl has a separate module, Encode, which provides decode_utf8 to allow you to produce UTF-8 strings from bytes. Encode::decode allows you to specify an encoding as the first parameter if you have non-UTF-8 data.

#!/usr/bin/perl (-)
use Encode;
my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;
$ustring2 = decode_utf8( $ustring2 ); # or decode('utf8', $ustring2);

print "$ustring1$ustring2";
__DATA__
Hello ☺!

I used this summary to write this section. It covers everything in more detail. These instructions apply to Perl 5.8 only. Complain to your local system administrator if you are still on 5.6.

Java

Java is much like Perl. Here's an example from an online tutorial.

#!/usr/bin/java (-)
import java.io.*;
class VerboseJava {
    public static void main(String[] args) {
        try {
            // only copies one line because I got tired of typing
            BufferedReader rdr = new BufferedReader( new InputStreamReader(new FileInputStream("utf8.txt"), "UTF8"));
            BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("jv_utf8.txt"), "UTF8"));
            String line = rdr.readLine();
            out.write(line);
            System.out.println(line);
            out.close();
        } catch (IOException e) {
            System.out.println("This exception handling is done wrong.");
        }
    }
}

Python

Python is much like the previous two languages with two differences. First, the Unicode-aware file object must be imported: the default file object is just a stream of bytes. Second, Unicode strings are of type unicode; normal strings are just a series of bytes. Here is how to use the Unicode-aware objects.

#!/usr/bin/python (-)
# First, replace the default open with the Unicode-aware open
from codecs import open
inf = open('utf8.txt', encoding='utf8')
# the encodings can be different of course.
# also note that you are not required to use named arguments
outf = open('py_utf8.txt', 'w', 'utf8')
for line in inf:
  outf.write(line)

You can also manually read in normal strings (byte sequences), convert them to Unicode, work with them, and convert them back to byte sequences for output. This means you must keep track of what type your strings are. In the real world, you will make constant mistakes.

#!/usr/bin/python (-)
# this is the hard way
line = open('utf8file.txt').next() # line is a normal string
u_str = line.decode('utf8') # u_str is a Unicode string
answer = process(u_str) # answer SHOULD BE a Unicode string
output = answer.encode('utf8') # output is a normal string
open('utf8output.txt', 'w').write(output)

C++

Mac OS has a serious bug and does not support locales, as far as I can tell. However, this is the accepted way to do it, where supported (I have tried this on Ubuntu Linux):

int main(unsigned argc, const char * argv[])
{
  locale loc("");

  for (unsigned a = 1; a < argc; ++a) {
    wifstream fs(argv[a]);
    fs.unsetf(ios_base::skipws);
    fs.imbue(loc);

    unsigned ccount = 0, wscount = 0;
    wchar_t ch;

    while (fs >> ch) {
      if (isspace(ch)) {
        ++wscount;
      }
      ++ccount;
    }

    cout << argv[a] << ": ";
    if (fs.bad() || !fs.eof()) {
      cout << "error encountered after " << ccount << " characters" << endl;
      return 1;
    } else {
      cout << wscount << " whitespace characters out of " << ccount << endl;
    }
  }
  return 0;
}

This example program counts the number of whitespace characters in a file. Your current locale needs to be UTF-8 in order to read a UTF-8 file. Do this by running:

$ LC_ALL=en_US.UTF-8 g++ unicode.cpp && ./a.out example.txt

If you're stuck with Mac OS and C++, you might try using the iconv program to convert everything to UTF-16 and read it into a wstring. I did this with a Common Lisp implementation with no Unicode support and it seemed to work. The example program came from this blog post.

Practicalities

You must have a display able to view your particular encoding or you'll see garbage even if the bytes are correct. PuTTY on Windows and Terminal on Mac OS both support UTF-8, though I think PuTTY requires you to change the setting from the default of ASCII.

Some programming languages have trouble with Unicode literals and variable names. So you can't necessarily write a literal string or variable name in Arabic without some extra work.

How to use Unicode (last edited 2008-01-24 21:42:54 by NathanSanders)