(Formerly known as ICOPOST.)
This is a suite of tools for part-of-speech tagging corpora.
The last official version is 1.8.4, released Sep. 23, 2002, but there is a more recent unofficial version, 1.8.6-tresoldi.
Documentation:
A Case Study in POS Tagging Using the ICOPOST Toolkit
Conversion scripts:
ACOPOST uses a format that it calls "cooked" to extract ngrams. It includes some perlscripts for converting from various formats into cooked, including one that is supposed to convert from tagged Penn Treebank (PTb) format, but it is not completely successful.
First, there are actually multiple formats in the PTb. For example, in the atis section (which consists of one file), there is no punctuation and thus each sentence is divided by a line of equals signs, whereas in the wsj section, there is punctuation, and sometimes the equals-sign-lines surround multiple sentences, so putting sentences on one line must be done differently depending on which section of the PTb you are extracting from.
Second, the script doesn't even try to put sentences on one line, which makes a difference in the ngram files that ACOPOST creates, and it places a space at the beginning of the cooked file, which messes up ACOPOST's ability to figure out whether a given token is a word or a tag.
As a result, I created 2 c++ scripts that seems to successfully convert from two PTb formats:
- atis2cooked.cpp which, as the name suggests, turns the single atis file into a cooked file.
- wsj2cooked.cpp which, as the name suggests, turns one single atis file into a cooked file. Because there are many wsj files, this is supported by a script which runs wsj2cooked.cpp on multiple wsj files. It can, of course, be modified by the user to run on whatever wsj files one needs it to run on. The way I use it, it keeps appending cooked data to the same file so many wsj files end up in one cooked file, but of course, you can modify and use it however you want.
Bugs:
When you try to compile the official version (This apparently does not apply to the Tresoldi version) in OS X, it will give you an intimidating list of errors. These can be fixed by taking the following steps:
- In primes.c, add the line #define ulong unsigned long
In t3.c, delete the line #include <values.h>, which refers to an obsolete header, and add the lines #include <limits.h> and #include <float.h>
Links:
