The ICE-GB has parsed transcriptions of conversations and parsed essays. The speakers (writers) have numerous properties recorded in an index file stored in Corpora/en/ice-gb/ice-gb-2/text/sspeaker.txt. This index file can be used to group the corpus files into useful arrangements. For example, I figured out which Government Office Region each place-of-birth field belonged to and grouped all corpus files by Government Office Region.

The actual corpus files are in Corpora/en/ice-gb/ice-gb-2/data/???.cor The parses use an indentation-based tree structure with faux XML tags surrounding each parse. Check out the britishdialects project from subversion to get a reader for the corpus. Then use iceread.read to read a corpus in ~ncsander/Public/ice.py for code that will process this. Here is an example of how to use it.

$ svn co /Volumes/Data/svnrep/britishdialects dialect
$ cd dialect/ice
$ python2.5
>>> import iceread
>>> from util import dct # this is Nathan's utility library for dictionaries
>>> speakers = iceread.read('sspeakers.txt', 12) # group speakers by column 12
>>> speakers.keys()
['A', 'B', 'C', 'D', 'E']
>>> dct.map(len, speakers)
{'A': 68, 'C': 62, 'B': 89, 'E': 16, 'D': 22}
>>> from pprint import pprint
>>> pprint(speakers["A"][0])
[('PU,CL',
  [('SU,NP', ['NPHD,PRON']),
   ('VB,VP', ['OP,AUX', 'MVB,V']),
   ('OD,NP',
    [('NPPR,AJP', [('AJPR,AVP', ['AVHD,ADV']), 'AJHD,ADJ']),
     'NPHD,N',
     ('NPPO,PP', ['P,PREP', ('PC,NP', ['NPHD,N', 'PAUSE,PAUSE'])])]),
   ('A,PP', ['P,PREP', ('PC,NP', ['NPHD,N'])]),
   ('A,PP',
    ['P,PREP',
     ('PC,NP', [('DT,DTP', ['DTCE,ART']), 'NPHD,N', 'PAUSE,PAUSE'])])]),
 '']

Actually the code ignores the actual words for some reason. So if you need words you'll have to change the part of the code that ignores the words, which are enclosed in { } (it's the function iceread.clean that needs to change). -- NathanSanders

The free ICE-GB demo is actually very useful because it has the exact same index files as the full corpus. So you can download the demo, only a few megabytes, figure out how to process the index files on a small sample, and then copy your code to jones to run on the full corpus. Also you can run their crappy Windows client on your machine, which you can't do from jones. (From their documentation, it looks like TIGERsearch, except not as good; I haven't run it.)

IceCorpus (last edited 2008-08-20 15:40:49 by NathanSanders)