Background
At an abstract level, working with Unicode is just a type conversion. Data from outside your program arrives in the type List<Byte<Encoding>> and must be converted to the type String (List<Char>) before it's usable. After you're done, all Strings must be converted to List<Byte<Encoding>> before being printed or saved.
Languages that don't enforce or even acknowledge this process make this more difficult to do, but self-discipline can avoid problems.
Encoding
There's not much point in talking about a Unicode String in memory because the programming language takes care of that--most languages support all the usual string operations on Unicode Strings that they used to support on ASCII Strings. Some operations may be slower, or Unicode Strings may take up more memory, however. Here are some useful encodings. The pre-Unicode encodings are still useful to know because they were still widely used until 2000, at least in the US.
Unicode Encodings
- UTF-8 -- Backward compatible with ASCII, it uses variable width encoding for characters higher than 128.
- UTF-16 -- Fixed width encoding, two bytes per character. I'm not sure how it handles characters higher than 65535 (#FFFF). Probably you won't need those characters anyway. Characters below 256 are padded with 00, which can cause problems in zero-terminated-string languages if your program assumes it is working with some other encoding.
- UTF-32 -- Fixed width encoding, four bytes per character. Currently, the 4.28 billion possible characters are not all defined, so it should remain fixed width for some time.
Pre-Unicode Encodings
- ASCII -- 7-bit encoding which is almost always seen today implemented as 8 bits. The 8th bit is left undefined by the standard, though.
- Latin1: ANSI code page 1252 -- The standard Windows code page until Windows 2000, its high characters are mostly accented characters for Western European languages.
"8-bit ASCII", "DOS ASCII" -- I'm not sure of the real name, but the top 128 characters have a mixture of accented characters and line-drawing characters useful for creating character-user-interfaces. Hopefully these first four encodings are listed in order of frequency. DOS and MacRoman encodings should be quite rare by this time.
MacRoman -- Similar to DOS ASCII, except that instead of line-drawing characters, there are a bunch of symbols.
- Arabic: Code page 1256 -- I don't know if this was the standard for Arabic before Unicode or not.
- S-JIS -- A Japanese encoding similar to UTF-8, but of course only for Roman and Japanese characters. In the 1990s, it was promulgated primarily by Microsoft and hence failed to attract a united following.
- EUC -- Another Japanese encoding which has held on in opposition to S-JIS and then Unicode. Since 2000 its use has supposedly lessened, but you might still run into it.
There are others of course, these are just the ones I have seen. The Python documentation has a long list. Glancing over the list, it appears that at least some of the facts in this list are wrong.
How to Tell Which Encoding Is Used
Start by reading the documentation for the corpus. Then move on to file. Use is file filename, and it makes a valiant effort to guess the encoding (along with a bunch of other properties, see man file for details).
If you don't want to use file, you can try to determine the encoding manually. If you know the language, load the file into a browser and cycle through the encodings for that language. For example, if it's a Western European language and the accented characters are coming out wrong, cycle through UTF-8, Latin1, MacRoman and DOS encodings, possibly in a different order if you know something about the age and author of the file.
Code
Perl
If you are just reading and writing files, the easiest thing is to specify the encoding when opening the file so that all strings will be automatically converted.
#!/usr/bin/perl (-)
open(INF, "<:utf8", "utf8.txt");
open(OUTF, ">:utf8", "pl_utf8.txt");
while(<INF>) {
print OUTF "pl:" . $_;
}
close(INF);
close(OUTF);
However, if you have obtained strings from some other source with unspecified encoding, Perl has a separate module, Encode, which provides decode_utf8 to allow you to produce UTF-8 strings from bytes. Encode::decode allows you to specify an encoding as the first parameter if you have non-UTF-8 data.
#!/usr/bin/perl (-)
use Encode;
my $ustring1 = "Hello \x{263A}!\n";
my $ustring2 = <DATA>;
$ustring2 = decode_utf8( $ustring2 ); # or decode('utf8', $ustring2);
print "$ustring1$ustring2";
__DATA__
Hello ☺!
I used this summary to write this section. It covers everything in more detail. These instructions apply to Perl 5.8 only. Complain to your local system administrator if you are still on 5.6.
Java
Java is much like Perl. Here's an example from an online tutorial.
#!/usr/bin/java (-)
import java.io.*;
class VerboseJava {
public static void main(String[] args) {
try {
// only copies one line because I got tired of typing
BufferedReader rdr = new BufferedReader( new InputStreamReader(new FileInputStream("utf8.txt"), "UTF8"));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("jv_utf8.txt"), "UTF8"));
String line = rdr.readLine();
out.write(line);
System.out.println(line);
out.close();
} catch (IOException e) {
System.out.println("This exception handling is done wrong.");
}
}
}
Python
Python is much like the previous two languages with two differences. First, the Unicode-aware file object must be imported: the default file object is just a stream of bytes. Second, Unicode strings are of type unicode; normal strings are just a series of bytes. Here is how to use the Unicode-aware objects.
#!/usr/bin/python (-)
# First, replace the default open with the Unicode-aware open
from codecs import open
inf = open('utf8.txt', encoding='utf8')
# the encodings can be different of course.
# also note that you are not required to use named arguments
outf = open('py_utf8.txt', 'w', 'utf8')
for line in inf:
outf.write(line)
You can also manually read in normal strings (byte sequences), convert them to Unicode, work with them, and convert them back to byte sequences for output. This means you must keep track of what type your strings are. In the real world, you will make constant mistakes.
#!/usr/bin/python (-)
# this is the hard way
line = open('utf8file.txt').next() # line is a normal string
u_str = line.decode('utf8') # u_str is a Unicode string
answer = process(u_str) # answer SHOULD BE a Unicode string
output = answer.encode('utf8') # output is a normal string
open('utf8output.txt', 'w').write(output)
C++
Mac OS has a serious bug and does not support locales, as far as I can tell. However, this is the accepted way to do it, where supported (I have tried this on Ubuntu Linux):
int main(unsigned argc, const char * argv[])
{
locale loc("");
for (unsigned a = 1; a < argc; ++a) {
wifstream fs(argv[a]);
fs.unsetf(ios_base::skipws);
fs.imbue(loc);
unsigned ccount = 0, wscount = 0;
wchar_t ch;
while (fs >> ch) {
if (isspace(ch)) {
++wscount;
}
++ccount;
}
cout << argv[a] << ": ";
if (fs.bad() || !fs.eof()) {
cout << "error encountered after " << ccount << " characters" << endl;
return 1;
} else {
cout << wscount << " whitespace characters out of " << ccount << endl;
}
}
return 0;
}
This example program counts the number of whitespace characters in a file. Your current locale needs to be UTF-8 in order to read a UTF-8 file. Do this by running:
$ LC_ALL=en_US.UTF-8 g++ unicode.cpp && ./a.out example.txt
If you're stuck with Mac OS and C++, you might try using the iconv program to convert everything to UTF-16 and read it into a wstring. I did this with a Common Lisp implementation with no Unicode support and it seemed to work. The example program came from this blog post.
Practicalities
You must have a display able to view your particular encoding or you'll see garbage even if the bytes are correct. PuTTY on Windows and Terminal on Mac OS both support UTF-8, though I think PuTTY requires you to change the setting from the default of ASCII.
Some programming languages have trouble with Unicode literals and variable names. So you can't necessarily write a literal string or variable name in Arabic without some extra work.
