Differences between revisions 26 and 27

Deletions are marked like this. Additions are marked like this.
Line 91: Line 91:
  * PADT: The Prague Arabic Dependency Treebank.
Line 222: Line 223:
  * Prague Arabic Dependency Treebank (PADT).

Search by Language

Search by Directory

Search by Corpus Name

Just Starting

A good starting point is the NLTK data set, since it has an example corpus for almost any task. (Here is the NLTK site's summary).

/Volumes/Data/Corpora/nltk-data

  • abc: Australian Broadcasting Commission 2006. Articles from the website, Rural and Science sections, in two large files.

  • alpino: Alpino Treebank (Dutch) from the Eindhoven Corpus, syntactically parsed. Website.

  • biocreative_ppi: Critical Assessment of Information Extraction Systems in Biology. 1000 randomly selected sentences from the data set of gene/protein named entity recognition with additional annotations. Information.

  • brown: The Brown Corpus, texts from a variety of literature annotated with part of speech. Manual.

  • cess_cat: CESS-CAT Treebank, part of CESS-ECE. 500K words of Catalan, parsed syntactically and semantically (with WordNet).

  • cess_esp: CESS-ESP Treebank, also part of CESS-ECE. 500K words of Spanish, parsed the same way.

  • chat80: Natural language interface for world geography from 1979 to 1982. Details.

  • cmudict: Carnegie Mellon Pronouncing Dictionary 1998. 127K entries (uppercase word, number of pronunciations, and phonemes with stress indicated). Found here.

  • conll2000: Conference on Computational Natural Language Learning shared task from 2000. Chunking data from WSJ, with sentences chunked into phrases (noun, verb, prepositional).
  • conll2002: Conference on Computational Natural Language Learning shared task from 2002. Chunking data from Spanish and Dutch newswire articles.
  • floresta: Floresta Treebank (Portuguese), parsed in Penn Treebank format. Homepage.

  • genesis: Genesis Corpus, from the Bible in various translations (English, French, German, Swedish, and Finnish) and versions. Formatting, markup, and verse numbers stipped.
  • gutenberg: Texts from gutenberg.org, mostly from the 17th to 19th centuries (authors: Jane Austen (3), William Blake (2), G. K. Chesterton (3), King James Bible, John Milton, Shakespeare (3), Walt Whitman).
  • ieer: NIST 1999 Information Extraction-Entity Recognition Corpus, data from newswire.
  • inaugural: US Presidential Inaugural Address Corpus (C-Span), addresses from 1789-2005.
  • indian: Indian Language POS-tagged Corpus (includes Bangla, Hindi, Marathi, Telugu), creating tagsets for Indian languages.
  • kimmo: Morphological parser. Data in English, Spanish, Turkish.
  • mac_morpho: MacMorpho POS-tagged Corpus (Brazilian Portuguese). Over 1M words (Folha de Sau Paulo daily newspaper 1994) to train taggers. Details.

  • movie_reviews: Sentiment Polarity Dataset 2.0 (2004). User reviews extracted from imdb, divided into positive and negative. Details.

  • names: 5001 female and 2943 male names, alphabetically, one per line. From here.

  • nps_chat: NPS Chat Corpus 2008. Over 10K posts of the 500K extracted from various online chat services. Website.

  • paradigms: Paradigm Corpus, collection of morphological paradigms.
  • pil: Patient Information Leaflet Corpus 2.0 (2006). 471 documents of instructions to patients regarding their medications. Details.

  • ppattach: The data used in Ratnaparkhi's (1994) Maximum Entropy Model for Prepositional Phrase Attachment. From the Wall Street Journal.

  • problem_reports: Problem Report Corpus (2006). 10K lines of pos-tagged text (Stanford Log-linear POS Tagger) from problem report files (apache, eclipse, firefox, linus, openoffice). Details.

  • propbank: Proposition Bank Corpus 1.0. 113K annotated verb tokens, annotated for arguments and adjuncts and with inflectional info. LDC Documentation.

  • qc: Question Classification Corpus, data from experiments. Details.

  • reuters: Reuters-21578 Corpus (ApteMod version). Over 10K documents from Reuters financial newswire service, divided into training and test sets. Website.

  • rte: RTE Corpus (Challenges 1, 2, 3). Recognizing Textual Entailment. Details.

  • senseval: SENSEVAL 2 Corpus. POS-tagged, sense-tagged text data for four words: "hard", "interest", "line, "serve".
  • shakespeare: Shakespeare XML Corpus sample. Selection of the complete plays of shakespeare marked up in XML (Jon Bosak). Full Corpus.

  • sinica_treebank: Sinica Treebank Corpus Sample (Chinese). 10K parsed sentences (Academia Sinica Balanced Corpus of Modern Chinese), raw versions included. Details.

  • state_union: US Presidential State of the Union Address Corpus (C-Span). State of the Union Addresses from 1945 to 2006. All addresses from 1945.

  • stopwords: Stopwords Corpus (Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish), high frequency grammatical words ignored in text retrieval. English stoplist here. Other lists from here.

  • switchboard: derived from TalkBank Switchboard Corpus, v0.1. 36 calls selected from the original 2438 of the Switchboard corpus (LDC source corpus number no longer valid, LDC93S7), complete discourse and treebank annotations.

  • timit: TIMIT Acoustic-Phonetic Continuous Speech Corpus sample (American English). 16 speakers, 8 dialect regions, (1 male and female from each), total 130 sentences, total 160 sentence recordings, WAV format. Original corpus Documentation.

  • toolbox: Toolbox lexical data samples with a corpus reader.
  • treebank: 5% of the Penn Treebank, 1650 sentences from the Wall Street Journal. Raw, tagged, parsed and combined data. Original corpus.

  • udhr: Universal Declaration of Human Rights Corpus. The UDHR in hundreds of languages, various encodings for some. Project overview.

  • unicode.notes: Notes for encodings and such.
  • verbnet: Verbnet lexicon v2.1. hierarchical classification of verbs from the online lexicon.

  • webtext: Collection from scraping public web postings, various genres.
  • wordnet: WordNet 3.0, lexical reference of English, words grouped into sets of cognitive synonym. Words linked by meaning can be navigated by a browser online. Documentation.

  • wordnet_ic: WordNet v3.0 information content files.

  • ycoe: Corpus reader for the York-Toronto-Helsinki parsed corpus of old english prose (ycoe). Corpus distributed here.

Additional NLTK corpora in the directory /Volumes/Data/Corpora/nltk-data-0.2 (Everything not listed here is also in the above nltk-data directory)

  • 20_newsgroups: 20K newsgroup documents from 20 various newsgroups. Info.

  • levin: Levin Verb Classes, probably useful as a gold standard for word sense disambiguation.
  • roget: Roget's Thesaurus from Project Gutenberg.
  • semcor1.7: Concordances of semantically tagged Brown Corpus files linked to WordNet 1.7 senses.

  • wordnet: WordNet 1.7.1, a lexical reference organized into synonym sets linked by different relations. Version 3.0 above.

  • words: Words and affixes for Czech, German, English (US and British), French, Hungarian, Italian, Dutch, Slovak. Taken from OpenOffice and /usr/dict/words.

By directories

Most top-level directories are also language codes:

  • Genji: Genji monogatari, morphologically parsed, 2 MB of text
  • Gutenberg: Presently four texts (Young Knights of the Empire, The Winds of Chance, The Master of Silence, Eben Holden A Tale of the North Century) from Project Gutenberg.
  • North American News Text Corpus: Collection of journalistic text 1994-1997 (LA Times/Wash Post, Reuters General, Reuters Financial, WSJ, NY Times), minimal markup to separate articles. LDC Documentation.

  • North American News Text Supplement: One year supplement to the above corpus.
  • ar (Arabic)
    • BAMA_v2: Buckwalter Arabic Morphological Analyzer (BAMA) v2.0. Three Arabic-English lexicon files (prefixes, suffixes, and stems making up 78K entries). Includes three morphological compatibility tables, perl script for morphological analysis, and POS tagging. LDC Documentation.

    • ar_new_transl_p1: Arabic News Translation Text, Part 1. Selected Xinhua, AFP, and An Nahar source texts from 2002-2004 (total 441K Arabic words), each translated by one of eight translation agencies. LDC Documentation.

    • ar_treebank_p1v3: Part of a project to build an Arabic treebank of 1M words (Modern standard Arabic). Part 1 v3. POS with full vocalization and syntactic analysis (734 stories, 145K words). LDC Documentation.

    • ar_treebank_p2v2: Part 2 v2. Vocalization, Lemma IDs, more specific tags for verbs/particles (501 stories, 144K words). LDC Documentation.

    • ar_treebank_p3v1: Part 3 v1. Same features as Part 2, 600 stories from An Nahar (340K words). LDC Documentation.

    • ar_treebank_p3v2: Part 3 v2. Full corpus. Syntactic and POS annotation, gloss, and word segmentation (600 stories). LDC Documentation.

    • ar_treebank_p4v1: Part 4 v1. MPG and POS annotation, gloss, and word segmentation (397 stories, 161K words). LDC Documentation.

    • arabicenglishlexicon: Edward William Lane's work of the late 1800s in 8 parts, all PDF format. From this website.

    • bbn-aub-darpa-bablyon-levantine-arabic: BBN/AUB DARPA Babylon Levantine Arabic Speech Transcripts. Spontaneous recorded speech of the Levantine dialect (Lebanon, Syria, Jordan, Palestine). LDC Documentation.

    • callhome: LDC 1997 Arabic Lexicon (eEncoded in ISO-8859-6 with documentation) and Arabic Transcripts (80 training transcripts, in Arabic script and romanized versions).
    • gigaword_arb_3: Newswire data from Arabic news sources (6 agencies, 547 files), all UTF-8 supported. LDC Documentation.

    • levantine-arabic-qt-training-data: Arabic CTS Levantine Fisher Training Data Sets (3: transcripts only LDC Documentation and 4: speech and transcripts LDC Documentation).

    • multiple-translation-arabic: Multiple Translation Arabic (MTA) Part 2. 100 news files, with multiple human translations for each. LDC Documentation.

    • PADT: The Prague Arabic Dependency Treebank.
    • quaran: Original Quran obtained free from this website. Versions in UTF-8 with vowel diacritics, and versions stripped of XML and line markers here also.

  • bul (Bulgarian)
    • BulTreeBank: HPSG-based Syntactic Treebank of Bulgarian. Text for analysis available from the website.

  • cmn (Mandarin Chinese)
    • ctb_v6: Chinese Treebank 6.0. 780K words (2036 text files in GBK and UTF-8 encoding), Penn Treebank format. Raw, bracketed, segmented, and posttagged versions. LDC Documentation.

    • gigaword_cmn_3: Four news sources, all text files converted to UTF-8. LDC Documentation.

    • tagged_cgw2: Tagged Chinese Gigaword. POS-tagged version of Chinese Gigaword 2. LDC Documentation.

  • cs (Czech)
    • czech_broadnews: 286 audio news files and transcripts from three radio stations (50 hrs) by University of West Bohemia in Pilsen (2000). LDC Documentation.

  • de (German)
    • dictionaries:
      • german-english: German-to-English word list, based on wordlist by Frank Richter, extended by Paul Hemetsberger and users of this website, 2002-2004.

    • treebanks
      • tigercorpus: The Tiger Treebank (50000 sentences) from Frankfurter Rundshcau (newspaper), tagged and annotated, includes query tool Tigersearch. Homepage. This is unfortunately no longer supported.

      • tuebadz: Tuebingen Treebank of Written German. 36K sentences from Die Tageszeitung, syntactically annotated manually Details.

  • en (English)
    • 2000_comm_dialog_act: Speakers completing a task (plan a trip). 648 dialogues and annotations. LDC Documentation.

    • 2001_comm_dialog_act: Speakers completing a task (plan a trip). 1683 dialogues and annotations. LDC Documentation.

    • 2002_speaker_recognition: NIST Speaker Recognition Evaluation. Over 9K speech files for use in text-independent speech recognition. LDC Documentation.

    • 2004_HARD_topics-and-annotations: Training and Evaluation sets for topic creation, clarification form responses, and relevance assessment. LDC Documentation.

    • ICE-EA: ICE Corpus. East Africa (Kenya and Tanzania), lexicon.
    • ICE-HK: ICE Corpus. Hong Kong, lexicon.
    • ICE-India: ICE Corpus lexicon.
    • ICE-Philippines: ICE Corpus lexicon.
    • ICE-Singapore: ICE Corpus lexicon.
    • WordNet-2.0: See version 3.0 in the NLTK data set.

    • articulation-index: Recordings of speakers pronouncing real and nonsense syllables, used to determine if subjects could correctly identify syllables in the presence of noise. LDC Documentation.

    • bbn-pronoun-coreference-and-entity-type: Manual annotation of pronoun coreference entity types for the Penn Treebank. LDC Documentation.

    • brown: Brown Corpus: Standard Sample of Present-day American English 1979. 500 tagged texts from 16 genres. Documentation. Other versions on Jones:

    • brown-10percent
    • brown-clean
    • brown-clean10percent
    • brown-shorttag
    • ccgbank: Translation of the Penn Treebank into Combinatory Categorical Grammar derivations. Also corrects some inconsistences and errors in the original annotation. LDC Documentation.

    • cslu_kids: CSLU Kids' Speech v1.1. Spontaneous and prompted speech from 1100 children (kindergarten to 10th grade) in Oregon, transcriptions included. LDC Documentation.

    • dictionaries: Project Gutenberg text of the 1913 Webster's unabridged dictionary of English.
    • discourse-graphbank: 135 texts from AP Newswire and WSJ, annotated with coherence relations. LDC Documentation.

    • english-gigaword_2e: International Newswire documents with minimal markup from five sources. LDC Documentation.

    • english-gigaword_3: Six sources in this version, through 2006. LDC Documentation.

    • hoosier-lexicon: Hoosier Mental Lexicon. 20K words from here at Indiana University, including POS, phonemic representation, and frequency from the Brown Corpus.
    • icle: International Corpus of Learner English. Essays written by EFL learners from 14 different native languages (Bulgarian, Chinese, Czech, Dutch, Finnish, Finland-Swedish, French, German, Italian, Japanese, Polish, Russian, Spanish, Swedish), typically university level, 3rd or 4th year of English study.
    • ice-gb: IceCorpus. Great Britan, tagged and parsed, spoken and written, future versions with audio.

    • icsi_meeting: International Computer Science Institute meetings (75 from 2000-2002, 72 hrs), recorded and transcribed. LDC Transcript Documentation. LDC Audio Documentation.

    • isl_meeting: Interactive System Laboratories, Carnegie Mellon 2000-2001 (18 meetings, 10 hrs speech), recorded and transcribed. LDC Transcript Documentation. LDC Audio Documentation.

    • mde-rt-03: Metadata Extraction (MDE). Transcripts and annotations (LDC Documentation) and 60 hrs speech from telephone conversations and broadcast news (LDC Documentation).

    • mde-rt-04: Metadata Extraction (MDE). Transcripts and annotations (LDC Documentation) and speech training data (LDC Documentation).

    • nist_meet_pilot_transcr: NIST Meeting Pilot Corpus. Transcriptions of 19 meetings. LDC Documentation.

    • nltk-penntreebank-clean: version of the below Penn Treebank.
    • nltk-penntreebankcombined-clean: version of the below Penn Treebank.
    • penntreebankv3: Penn Treebank v3. WSJ, Atis, and Brown Corpus data, parsed and tagged. Homepage.

    • sbcsae_p3: Santa Barbara Corpus of Spoken American English part 3. 16 WAV format speech files, part of the American subcorpus of the ICE corpus. Natural speech from different regions, origins, ages, and ethnic and social backgrounds. LDC Documentation.

    • susanne: 130K word subset of Brown Corpus. Taxonomy and annotation for the grammar of English. Information.

    • switchboard:
    • timit: TIMIT Acoustic-Phonetic Continuous Speech Corpus (Ameican English), recordings of 630 speakers in 8 dialect regions. LDC Documentation.

  • eo (Esperanto)
    • varioustext: They all seem to be in Windows UTF-16 format.
    • wizard_of_oz: Translated from English
  • es (Spanish)
  • he (Hebrew)
    • treebank: Hebrew Treebank Version 2.0 from the Mila Knowledge Center for processing Hebrew. 6500 sentences from the Ha'aretz daily newspaper, full word segmentation and morphosyntactic analysis. Tag set is "as close as possible to that of the English Penn Treebank." Website.

  • hrv (Croatian)
    • west-point-croatian-speech-corpus: Database of speech recordings of prompted scripts from Zagreb in 2000 and 2001 by the DFL and CTELL. LDC Documentation.

  • ja (Japanese)
    • callhome: Lexicon of 80K words, each with morphological, phonological, and stress information. LDC Documentation. Transcripts of 120 telephone conversations, five to ten minute segments. LDC Documentation.

    • tueba-js: Tuebingen Treebank of Spoken Japanese. Spontaneous conversations manually transliterated (18K sentences), stylebook for the treebank included in the directory. Details.

  • ko (Korean)
    • klex: Finite-state lexical transducer for Korean. Relies on xfst. Useful for morphological analysis and generation. LDC Documentation.

    • morph_anot_kor_text: Morphologically annotated sections of the Korean Newswire corpus; 1500 sentences from 1994 to 2000, with POS tags and morphological analysis. LDC Documentation.

  • multilingual
    • CHILDES: Child Language Data Exchange System. Conversations between children and their playmates and caretakers. Languages: American English (bloom70, bloom73, brent, weist), British English (manchester), German (weissenborn, simone), Slavic (croatian). Documentation.

    • Celex2: Second release from the Dutch Centre for Lexical Information. Plain ASCII of lexical databases of English(2.5), Dutch(3.1), and German(2.5). For each includes detailed info on orthography, phonology, morphology, syntax, and word frequency. Documentation.

    • english_chinese_treebank: English Chinese Translation Treebank v1.0. 325 files (146K words) of news from Xinhua News Agency (the Xinhua data in Chinese Treebank 5.0), translated, POS-tagged, treebanked. LDC Documentation.

    • europarl: Parallel translations of the proceedings of the European Parliament. 25-30M words per language pair. Comes with a script to create a parallel corpus from any of the two languages (French, Italian, Spanish, Portuguese, English, Dutch, German, Danish, Swedish, Greek, Finnish). Documentation.

    • extr_multiling_train: TIDES Extraction (ACE) Multilingual Training Data (2003). Broadcast and newswire data from 2000. Arabic (over 40K words), Chinese (over 90K), and English (over 90K). LDC Documentation.

    • verbmobil: Parallel conversations with translation (English <--> German and German <--> Japanese). Project overview.

  • nic (Mawukakan)
    • mawukakan-lexicon: A language of the Mande group of the Niger-Congo family. Mawukakan-English and Mawukakan-French lexicon (UTF-8 and supports Doulos SIL). LDC Documentation.

  • nltk-data-0.2: see above under "Just Starting."
  • po (Polish)
    • ipi-pan: IPI-PAN Corpus of Polish. Over 250M segments by the Linguistic Engineering Group at ICS PAS, morphosyntactically annotated. Homepage. A few Perl scripts to test TNT's ability to POS tag Polish here.

  • primate
    • vervet-monkey-calls: 30 hours of vervet monkey calls collected in 1977 and 1978. From the Talkbank Ethology Corpus. 60 files, 5GB, all in WAV format, 60 annotation files of selected audio. LDC Documentation.

  • sv (Swedish)
    • talbanken05: A Swedish treebank of 300K words, half from text and half from speech. Can be freely copied. Modernized version of Talbanken76. Lexical, phrase structure, and dependency structure annotations. Details.

  • tr (Turkish)
    • metuTreebank: Metu-Sabanci Turkish Treebank (may be an alpha or beta version). 7262 grammatical sentences, morphologically and syntactically annotated from the METU Turkish corpus. Includes a viewer, evidently corrected at some point. Website.

  • zh (Chinese)
    • callhome: Lexicon of 44K words with phonological and morpohological information, as well as word frequencies. LDC Documentation. Transcripts of 120 telephone conversations in 5-10 minute segments by native speakers, includes demographic information. LDC Documentation.

    • chinese-english-named-entity-lists: Bidirectional listing of proper names from news sources, ranging from 7K to nearly 300K words per pairing. LDC Documentation.

    • chinese-english-news-magazine-parallel-text: News stories and their transcriptions from Sinorama Magazine, Taiwan 1976-2004 (6336 story pairs, corpus aligned at sentence level, Chinese in Big5). LDC Documentation.

    • chinese-gigaword-2e: Three news sources, file additions to version 1. LDC Documentation.

    • chinese-news-translation-text-p1: Selected texts from Xinhua and AFP (474K Chinese characters) translated between 2003 and 2005 by one of seven agencies. LDC Documentation.

    • chinese-proposition-bank: Annotations for files from the first 250K words of Chinese Treebank 5.0 (37K propositions total, no aux verbs). LDC Documentation.

    • chinese-treebank4: Part of the Penn Chinese Treebank project to create a 500K word corpus of Chinese. Articles from Xinhua, HKSAR, and Sinorama (404K words, 838 files GB encoded, follows English Penn Treebank format). LDC Documentation.

    • chinese-treebank5: Same sources (507K words, 890 files GB encoded). Files in raw, bracketed, segmented, and posttagged versions. LDC Documentation.

    • hkust-mandarin-telephone: Recordings and transcriptions of telephone conversations in Mandarin from mainland China, although it was collected by the Hong Kong University of Science and technology. No phonetic or phonological transcription. 24 calls in WAV format, speaker demographics included. LDC Documentation. Corresponding transcripts, Chinese encoded in GBK. LDC Documentation.

    • mtc_p3: Multiple Translation Chinese Part 3. 100 news stories from Xinhua and AFP, each translated by four teams. LDC Documentation.

By language

  • Multilingual
    • CHILDES: Child Language Data Exchange System. Conversations between children and their playmates and caretakers. Languages: American English (bloom70, bloom73, brent, weist), British English (manchester), German (weissenborn, simone), Slavic (croatian). Documentation.

    • Celex 2: Second release from the Dutch Centre for Lexical Information. Plain ASCII of lexical databases of English(2.5), Dutch(3.1), and German(2.5). For each includes detailed info on orthography, phonology, morphology, syntax, and word frequency. Documentation.

    • Europarl: Parallel translations of the proceedings of the European Parliament. 25-30M words per language pair, includes a script to create a parallel corpus from any two languages (French, Italian, Spanish, Portuguese, English, Dutch, German, Danish, Swedish, Greek, Finnish). Documentation.

    • English Chinese Translation Treebank v1.0: 325 files (146K words) of news from Xinhua News Agency (the Xinhua data in Chinese Treebank 5.0), translated, POS-tagged, treebanked. LDC Documentation.

    • TIDES Extraction (ACE) Multilingual Training Data (2003): Broadcast and newswire data from 2000. Arabic (over 40K words), Chinese (over 90K), and English (over 90K). LDC Documentation.

    • VerbMobil: Parallel conversations with translation (English <--> German and German <--> Japanese). Project overview.

  • African
    • Mawukakan: Mawukakan-English and Mawukakan-French lexicon (UTF-8 and supports Doulos SIL). LDC Documentation.

  • Arabic
    • Arabic News Translation Text, Part 1: Selected Xinhua, AFP, and An Nahar source texts from 2002-2004 (total 441K Arabic words), each translated by one of eight translation agencies. LDC Documentation.

    • Arabic Treebanks: Part of a project to build an Arabic treebank of 1M words (Modern standard Arabic)
      • Part 1 v3: POS with full vocalization and syntactic analysis (734 stories, 145K words). LDC Documentation.

      • Part 2 v2: Vocalization, Lemma IDs, more specific tags for verbs/particles (501 stories, 144K words). LDC Documentation.

      • Part 3 v1: Same features as Part 2, 600 stories from An Nahar (340K words). LDC Documentation.

      • Part 3 v2: Full corpus. Syntactic and POS annotation, gloss, and word segmentation (600 stories). LDC Documentation.

      • Part 4 v1: MPG and POS annotation, gloss, and word segmentation (397 stories, 161K words). LDC Documentation.

    • Arabic English Lexicon: Edward William Lane's work of the late 1800s in 8 parts, all PDF format. From this website.

    • Buckwalter Arabic Morphological Analyzer (BAMA) v2.0: Three Arabic-English lexicon files (prefixes, suffixes, and stems making up 78K entries). Includes three morphological compatibility tables, perl script for morphological analysis, and POS tagging. LDC Documentation.

    • BBN/AUB DARPA Babylon Levantine Arabic Speech Transcripts: Spontaneous recorded speech of the Levantine dialect (Lebanon, Syria, Jordan, Palestine). LDC Documentation.

    • CALLHOME: LDC 1997
      • LDC Arabic Lexicon. Encoded in ISO-8859-6 with documentation.
      • LDC Arabic Transcripts. 80 training transcripts, in Arabic script and romanized versions.
    • Arabic Gigaword 3: Newswire data from Arabic news sources (6 agencies, 547 files), all UTF-8 supported. LDC Documentation.

    • Arabic CTS Levantine Fisher Training Data Sets: Levantine dialect
    • Multiple Translation Arabic (MTA) Part 2: 100 news files, with multiple human translations for each. LDC Documentation.

    • Prague Arabic Dependency Treebank (PADT).
    • Quran: Original obtained free from this website. Versions in UTF-8 with vowel diacritics, and versions stripped of XML and line markers here also.

  • Bulgarian
    • BulTreeBank: HPSG-based Syntactic Treebank of Bulgarian. Text for analysis available from the website.

  • Chinese
    • CALLHOME (Mandarin):
      • Lexicon: 44K words with phonological and morpohological information, as well as word frequencies. LDC Documentation.

      • Transcripts: 120 telephone conversations in 5-10 minute segments by native speakers, includes demographic information. LDC Documentation.

    • Chinese <--> English Named Entity Lists v1.0: Bidirectional listing of proper names from news sources, ranging from 7K to nearly 300K words per pairing. LDC Documentation.

    • Chinese English News Magazine Parallel Text: News stories and their transcriptions from Sinorama Magazine, Taiwan 1976-2004 (6336 story pairs, corpus aligned at sentence level, Chinese in Big5). LDC Documentation.

    • Chinese Gigaword 2 (Mandarin): Three news sources, file additions to version 1. LDC Documentation.

    • Chinese Gigaword 3 (Mandarin): Four news sources, all text files converted to UTF-8. LDC Documentation.

    • Tagged Chinese Gigaword: POS-tagged version of Chinese Gigaword 2. LDC Documentation.

    • Chinese News Translation Text Part 1: Selected texts from Xinhua and AFP (474K Chinese characters) translated between 2003 and 2005 by one of seven agencies. LDC Documentation.

    • Chinese Proposition Bank 1.0: Annotations for files from the first 250K words of Chinese Treebank 5.0 (37K propositions total, no aux verbs). LDC Documentation.

    • Chinese Treebank 4.0: Part of the Penn Chinese Treebank project to create a 500K word corpus of Chinese. Articles from Xinhua, HKSAR, and Sinorama (404K words, 838 files GB encoded, follows English Penn Treebank format). LDC Documentation.

    • Chinese Treebank 5.0: Same sources (507K words, 890 files GB encoded). Files in raw, bracketed, segmented, and posttagged versions. LDC Documentation.

    • Chinese Treebank 6.0: 780K words (2036 text files in GBK and UTF-8 encoding), Penn Treebank format. Raw, bracketed, segmented, and posttagged versions. LDC Documentation.

    • HKUST Mandarin: Part of a project to collect 200 hrs of phone conversations by Hong Kong University of Science Technology.
      • Telephone Speech Part 1: 24 calls in WAV format, speaker demographics included. LDC Documentation.

      • Telephone Transcripts Part 1: Corresponding transcripts, Chinese encoded in GBK. LDC Documentation.

    • Multiple Translation Chinese (MTC) Part 3: 100 news stories from Xinhua and AFP, each translated by four teams. LDC Documentation.

  • Croatian
    • West Point Croatian Speech Corpus: Database of speech recordings of prompted scripts from Zagreb in 2000 and 2001 by the DFL and CTELL. LDC Documentation.

  • Czech
    • Czech Broadnews: 286 audio news files and transcripts from three radio stations (50 hrs) by University of West Bohemia in Pilsen. LDC Documentation.

  • English
    • International Corpus of English: Written and spoken texts, 1M words. See IceCorpus and this site.

      • ICE-GB: Great Britan, tagged and parsed, spoken and written, future versions with audio.
      • ICE-EA: East Africa (Kenya and Tanzania), lexicon.
      • ICE-HK: Hong Kong, lexicon.
      • ICE-India: lexicon.
      • ICE-Philippines: lexicon.
      • ICE-Singapore: lexicon.
    • Brown Corpus: Standard Sample of Present-day American English 1979. 500 tagged texts from 16 genres. Documentation. Other versions on Jones:

      • Brown 10percent
      • Brown clean
      • Brown clean 10percent
      • Brown shorttag
    • Susanne: 130K word subset of Brown Corpus. Taxonomy and annotation for the grammar of English. Information.

    • Penn Treebank v3: WSJ, Atis, and Brown Corpus data, parsed and tagged. Homepage.

      • NLTK Penn Treebank versions (clean and combined clean) are here too.
    • North American News Text Corpus (and supplement): collection of journalistic text 1994-1997 (LA Times/Wash Post, Reuters General, Reuters Financial, WSJ, NY Times), minimal markup to separate articles. LDC Documentation.

    • Comm Dialog Act: User dialogues for use in improving speech-enabled interfaces, subject was planning trips.
    • 2002 Speaker Recognition: NIST Speaker Recognition Evaluation. Over 9K speech files for use in text-independent speech recognition. LDC Documentation.

    • 2004 HARD Topics and Annotations: Training and Evaluation sets for topic creation, clarification form responses, and relevance assessment. LDC Documentation.

    • WordNet 2.0: See version 3.0 in the NLTK data set.

    • Articulation Index: Recordings of speakers pronouncing real and nonsense syllables, used to determine if subjects could correctly identify syllables in the presence of noise. LDC Documentation.

    • BBN Pronoun Coreference and Entity Type: Manual annotation of pronoun coreference entity types for the Penn Treebank. LDC Documentation.

    • CCGBank: Translation of the Penn Treebank into Combinatory Categorical Grammar derivations. Also corrects some inconsistences and errors in the original annotation. LDC Documentation.

    • CSLU Kids' Speech v1.1: Spontaneous and prompted speech from 1100 children (kindergarten to 10th grade) in Oregon, transcriptions included. LDC Documentation.

    • Dictionary: Project Gutenberg text of the 1913 Webster's unabridged dictionary of English.
    • Discourse Graphbank: 135 texts from AP Newswire and WSJ, annotated with coherence relations. LDC Documentation.

    • English Gigaword 2: International Newswire documents with minimal markup from five sources. LDC Documentation.

    • English Gigaword 3: Six sources in this version, through 2006. LDC Documentation.

    • Hoosier Mental Lexicon: 20K words from here at Indiana University, including POS, phonemic representation, and frequency from the Brown Corpus.
    • ICLE: International Corpus of Learner English. Essays written by EFL learners from 14 different native languages (Bulgarian, Chinese, Czech, Dutch, Finnish, Finland-Swedish, French, German, Italian, Japanese, Polish, Russian, Spanish, Swedish), typically university level, 3rd or 4th year of English study.
    • ICSI Meeting: International Computer Science Institute meetings (75 from 2000-2002, 72 hrs), recorded and transcribed. LDC Transcript Documentation. LDC Audio Documentation.

    • ISL Meeting: Interactive System Laboratories, Carnegie Mellon 2000-2001 (18 meetings, 10 hrs speech), recorded and transcribed. LDC Transcript Documentation. LDC Audio Documentation.

    • Metadata Extraction (MDE): Part of the DARPA Efficient, Affordable, Reusable, Speech-to-text program (EARS).
    • NIST Meeting Pilot Corpus: Transcriptions of 19 meetings. LDC Documentation.

    • PARC 700 Dependency Bank: 700 sentences, randomly extraced from section 23 of the Penn Treebank, parsed with LFG grammar, gold standard annotations of grammatical dependency relations. Homepage.

    • Proposition Bank Corpus 1.0: 113K annotated verb tokens, annotated for arguments and adjuncts, also with inflectional information. LDC Documentation.

    • Santa Barbara Corpus of Spoken American English part 3: 16 WAV format speech files, part of the American subcorpus of the ICE corpus. Natural speech from different regions, origins, ages, and ethnic and social backgrounds. LDC Documentation.

    • Switchboard:
    • TIMIT Corpus (American English): TIMIT Acoustic-Phonetic Continuous Speech Corpus, recordings of 630 speakers in 8 dialect regions. LDC Documentation.

    • Saarbruecken Corpus of Spoken English: Selections
      • Indianapolis Interviews: Professor Norrick interviews senior citizens, age 80 and up. PDF.
      • Jokes: Transcripts of Professor Norrick and his students at Northern Illinois University and Saarland University. PDF.
      • Stories: Transcripts of Professor Norrick and his students at Northern Illinois University. PDF.
    • Project Gutenberg: Four texts from the /Volumes/Data/Corpora directory
      • Young Knights of the Empire, The Winds of Chance, The Master of Silence, Eben Holden A Tale of the North Century
  • Esperanto
    • Wizard of Oz: translated from English.
    • Various texts: all seem to be in UTF-16 encoding.
  • German
    • Dictionaries: German-to-English word list, based on wordlist by Frank Richter, extended by Paul Hemetsberger and users of this website, 2002-2004.

    • Treebanks:
      • Tigercorpus: The Tiger Treebank (50000 sentences) from Frankfurter Rundshcau (newspaper), tagged and annotated, includes query tool Tigersearch. Homepage. This is unfortunately no longer supported.

      • Tuebingen Treebank of Written German (tuebadz): 36K sentences from Die Tageszeitung, syntactically annotated manually Details.

  • Hebrew
    • Hebrew Treebank Version 2.0: From the Mila Knowledge Center for processing Hebrew. 6500 sentences from the Ha'aretz daily newspaper, full word segmentation and morphosyntactic analysis. Tag set is "as close as possible to that of the English Penn Treebank." Website.

  • Japanese
    • CALLHOME:
      • Lexicon: 80K words, each with morphological, phonological, and stress information. LDC Documentation.

      • Transcripts: 120 telephone conversations, five to ten minute segments. LDC Documentation.

    • Tuebingen Treebank of Spoken Japanese (tueba-js): Spontaneous conversations manually transliterated (18K sentences), stylebook for the treebank included in the directory. Details.

  • Korean
    • Klex (Finite-State Lexical Transducer for Korean): Relies on xfst, useful for morphological analysis and generation. LDC Documentation.

    • Morphologically Annotated Korean Text: Collection of text extracted from the Korean Newswire Corpus (1500 sentences from 1994 to 2000) with POS tags and morphological analysis. LDC Documentation.

  • Polish
    • IPI-PAN Corpus: Over 250M segments by the Linguistic Engineering Group at ICS PAS, morphosyntactically annotated. Homepage.

  • Spanish
  • Swedish
    • Talbanken05 v1.1: Modernized version of Talbanken76. Swedish treebank of 300K words, written and spoken language. Lexical, phrase structure, and dependency structure annotations, all in ISO-8859-1 encoding. Details.

  • Turkish
    • Metu-Sabanci Turkish Treebank: (may be an alpha or beta version) 7262 grammatical sentences, morphologically and syntactically annotated from the METU Turkish corpus. Includes a viewer, evidently corrected at some point. Website.

  • Primate
    • Vervet monkey calls: From the Talkbank Ethology Corpus. 60 files, 5GB, 30 hours all in WAV format, 60 annotation files of selected audio. LDC Documentation.

By corpus

  • Brown Corpus
  • Penn Treebank
    • v3 in the English directory (en)
  • NLTK Corpora
    • NLTK packages v0.2 and v0.9.4 at the top of the page.
  • LDC Gigaword Corpora
    • In Arabic (3e), Chinese (2e, 2e-tagged, 3e), and English (2e, 3e).
  • CALLHOME
    • In Arabic, Chinese, Japanese and Spanish.
  • International Corpus of English (See IceCorpus for notes about its organization)

    • ICE-GB (Great Britain), ICE-HK (Hong Kong), ICE-EA (East Africa: Kenya and Tanzania), ICE-India, ICE-Philippines, ICE-Singapore.
  • Others ... this list is not complete.

JonesCorpora (last edited 2009-06-12 20:46:47 by ScottLedbetter)