Zambezia (1993), XX (H).RESEARCH REPORTCORPUS OF ZIMBABWEAN ENGLISH AT THEUNIVERSITY OF ZIMBABWE COMPUTER CENTREW. E. LOUWDepartment of English, University of ZimbabweandJOSEPHINE JORDANDepartment of Psychology, University of ZimbabweAbstractA corpus of Zimbabwean English comprising Dawson's Structures and Skills inEnglish and Grant et al, English for Zimbabwe: an English course for secondaryschools is available on computer tapes at the University of Zimbabwe ComputerCentre. Also on tape is the dictionary hie which lists in alphabetical order all thewords contained in the books, together with a frequency Hie, which lists the words inorder of use. The corpus provides a readily accessible source of lexical itemsencountered in Zimbabwe secondary schools and demonstrates the employment ofthese lexical items in grammatical structure and idiom.BETWEEN OCTOBER 1987 and December 1993 a partial corpus of ZimbabweanEnglish was captured and installed at the University of Zimbabwe ComputerCentre. The corpus contains two sets of secondary school English Languagetextbooks, each comprising four volumes. The series are Dawson, Structuresand Skills in English and Grant et al., English for Zimbabwe.The corpus is housed on a 2 400 foot magnetic tape which occupies 3,5megabytes when fully loaded and has 592 994 words (tokens) of runningtext. On the same tape are two lists for each book: a dictionary file and afrequency file. A dictionary file lists, in alphabetical order, all words in thecorpus, otherwise known as types. The frequency file provides a list offrequencies in descending order. Examples of dictionary files and frequencyfiles are given in Table I.Even today language corpora are not a very widespread phenomenon.The reason for this is that, until recently, corpora have had to be developedby keying in material rather than by electronic loading. There are no morethan four well-known corpora of English language in the world, the oldest ofthese being the Brown University corpus (Francis and Kucera, 1982) and theLOB (Lancaster, Oslo, Bergen) corpus (Johannson, 1978). Neither of these131132CORPUS OF ZIMBABWEAN ENGLISHtwo exceeds one million words. There is a third covering Indian English, theKolhapur corpus (Shastry, 1988). The most recent, and by far the largest,corpus is the COBUILD (Collins Birmingham University International LanguageDatabase) which comprises 7,3 million words of running text in the maincorpus,' and which can be extended to 21 miHion words of running text using the reserve corpora. The COBUILD corpora took from 1978 to 1984 to load,mostly by Optical Character Recognitions (OCR) techniques using the KDEM (Kurzweil Data Entry Machine) at Birmingham University. One and a half million words of the COBUILD corpus comprise the greatest existing corpusof spoken text (Sinclair, 1987b). This sub-corpus had to be keyed in, as voicerecognition loading is not yet sufficiently sophisticated to load spoken text reliably.Table IEXAMPLES OF WORD FREQUENCIES & DICTIONARY FILE LISTINGS OFENGUSH FOR ZIMBABWE BOOK I.Rank12345678910Wordtheto ain ofandyouisitheNo. ofOccam2 9301520146313251 1631 0571057760577552Rank394959626392WordwritewordssentencesfollowingreadparagraphNo. ofOccurr.19117615315015090EntryNo.12345678910Word NoofOccurr.a 1463aardwolfaardwolfsabaluhyaabbreviatedabbreviationabbreviationsabcabelability314235114The first column indicates the frequency file rank of the word in the list.The second column constitutes a continuation of the ranking of the list in the first column.The final column indicates the frequency of the first ten words of the dictionary file. OCcuris i OccurrencesThe Zimbabwean corpus differs from the above corpora in that it is aspecialized corpus drawn from the major English language textbooksencountered in Zimbabwean secondary schools. All the corpora mentioned above have the standard language as their main concern and draw theirtexts from many carefully sampled provenances. This was especially the case with the COBUILD corpus as the object of COBUILD was to produce thefirst computer concordanced dictionary of English (Sinclair, 1987a).W. E. LOUW & J. JORDAN 133The objective in loading the corpus of Zimbabwean secondary schoolmaterials was to provide a resource of interest to academics, languageteachers and curriculum planners, psychologists, writers, publishers, andthose interested in specific areas, such as research in reading. Typicalresearch objectives might be investigations into the mismatch betweenthese materials and authentic Zimbabwean spoken and written text;comparison of key language terms in these books with, for example, thesame terms in the Zimbabwe Hansard, The Herald, news bulletins, interviewsand the like; comparison of the English curriculum and its coverage by theZimbabwean textbooks with those for example, held in a one-million-wordsub-corpus held by Renouf within COBUILD's research division.1This report is designed to afford Zimbabwean researchers in manyfields some insight into the contents of the corpus and the techniques ofcorpus linguistics, both of which may be of interest in their disciplines.CORPUS DEVELOPMENTAs funding for the project was too limited to provide a general corpus ofZimbabwean English drawn from all textual provenances, spoken and written,it was decided that the first step in Zimbabwean corpus development shouldbe the creation of a specialized language corpus. The series included in thecorpus were chosen from the textbooks prescribed by the Ministry ofEducation in 1986 for Forms One to Four in secondary schools. For a fulldescription of the sampling techniques for general corpus development, seeRenouf (1987).The appearance of the selected textbooks was not very promising forthe successful loading on to the computer using OCR techniques. Both setsof materials were printed on local Mutare bond paper which affordsinsufficient contrast for OCR scanning. In addition the materials were brokenup in two ways. Firstly, they contained many illustrations which disrupt thedigitizing of the text. KDEM software is self-educating and runs more quicklyas it becomes used to the fonts involved. Each time the scanner encountersillustrative material, this process has to be reinitiated.The second way in which the texts were broken up is that pages wereprinted in two columns. This means that each column has to be masked offby the KDEM operator during loading. If this is not done, the KDEM 'eye' runslinearly across the page and incorporates the text for both columns into asingle span. Newer KDEM software avoids this problem.The texts were loaded at Birmingham University and edited withinMULTICS screen edit to correct the numerous scanner errors. Editing took1 Antoinette Renouf, English Language Research Unit, Birmingham University. Since the corpuswas installed it has been used successfully to support research into the difficulty of vocabularytests (Jordan, 1989) and the use of personality words in Zimbabwean English (Jaynes, 1991).CORPUS OF ZIMBABWEAN ENGLISH134u «, Ž«ihlp of dobal replacement of consistently misread?e fO^h Se SrS S'or where there was no consistency, strings fromE""SSSShfhSto be called up and corrected piecemeal. ThisS^SSES, workingatoff-peaktimes when computer^dnce'he S^opy had been cleaned up by editing, it was possible todraw wor^equency Sd alphabetical lists for both sets of books. Thiswould disclos7how effective the scanning and editing processes had beenT^e list generation processes operate on the machine-readable text once Ahas beenconverted into a string of single items each occupying one hne orreraritoUNK operating systems (available at the University of ZimbabweComputer Centre), the 'uniq' facility provides an intermediate stage for theameration of these lists and affords the option of word-frequency counting.It is a simple procedure to reorganize the list in alphabetical order or inorder of descending frequency.A tape labelled CORP01 with the full texts and their dictionary andfrequencyfiles is available at the University of Zimbabwe Computer Centreorfrom the authors. UNIX manuals are available from the manager.GENERAL CHARACTERISTICS OF CORPORAUntil the advent of computational linguistics, there was a generally heldbelief that the words of a language fall into two separable and monolithiccategories: that there are M words and form words (a distinction which canbe traced as far back as Aristotle). Full words are generally words thatcontain some obvious semantic content, such as a referent, e.g. 'tree' or'house', or are verbal in character such as 'fly' or 'run'. Form words werebelieved to be mainly grammatical in function, e.g. 'the', 'a', 'if 'then', etcThe machine-based retrieval of full words from large corpora brought with itthe surprise that, although these forms appeared to the intuition to have fulllexical status in all of their uses, the more frequent the term, the more thoseuses shaded towards grammatical function. This phenomenon of *washingout' of meaning has been described as progressive delexicaJlzation by Sinclair(1987c). The most illuminating example offered by Sinclair is the term 'take,'for which there is only one intuitively recoverable full meaning but someforty-six progressively delexicalized meanings cited in the COBUILD dictionary(Sinclair, 1987a, p. 1489). At the far end of the progressive delextcalizationscale, the term 'take' is readily interchangeable with grammatical terms olform word status. For example, it is not clear whether the status of 'take' in'take a look at this book' is that of a full word or form word under thetraditional labelling, given that it can be replaced by 'have a look' or omittedaltogether as in 'look at the book'. Thus the distinction between full and formwords is more in the nature of a continuum, or what linguists call a 'dine',W. E. LOUW & J. JORDAN 135and a separate judgement can be made about each form of 'take', e.g. 'to takea bus' will be coded as more full than 'to take a bath'. The latter use formsthe basis of much poor humour in pantomime and situational comedy, forexample, as in the response 'where shall I take it to?'At first sight, a word-frequency list appears to divide itself between theform words (most frequent) and the full words (those which make up thetail). However, in a general corpus, such as the COBUILD corpus, researchhas discovered that the 2 000 most frequent words are those which sufferthe most progressive delexicalization. The COBUILD dictionary is the firstdictionary to make a detailed computer-assisted grammatical andlexicographical description of the terms. Furthermore, these 2 000 termsform the basis of a lexical syllabus developed by COBUILD and incorporatedinto language-teaching materials for the study of English. The researcherssee these 2 000 words as being so powerful a teaching tool that the emphasisin language teaching can now be swung with confidence away from grammarand into vocabulary. Mastery of 2 000 terms, in all of their uses, will carry thefundamentals of English grammar with them. In the case of COBUILD thereare, of course, no made-up examples and the new COBUILD language materials(Willis and Willis, 1988), in common with the dictionary, draw all theirexamples from authentic text, although in the case of Willis and Willis suchtexts are often elicited from informants who become authentic characterswithin the texts; a far cry from the John and Mary of made-up examples.It is this observation which makes the Zimbabwean corpus of particularinterest. The Zimbabwean corpus is the product of the intuition ofZimbabwean materials writers. It carries with it the expectation that thelexical forms set out will cluster around full intuitive meaning, rather thandelexical meaning, and where sentences are made-up examples, for use inpattern practice drills, they will lack what Sinclair (1988) calls 'naturalness'.Sinclair offers the following three sentences for coding as natural or unnatural:1. We searched.2. We searched all night.3. We searched all night for the missing climbers.Of the three, coders identify sentence two quite readily as the mostnatural, and sentence three, the sentence most likely to feature in language-teaching materials, where the examples have been made up. Made-upexamples are always self-contextualizing to a degree which is unnatural inauthentic text.Because made-up examples are never a product of genuine interaction,such interaction as does take place using them will generally take on aritualized form. These rituals are readily discernible if one studies a word-frequency list from specialized corpora such as English Language Teaching(ELT). Very often, high up in the envelope of grammatical words, words suchCORPUS OF ZIMBABWEAN ENGLISH'paragraph' and 'groups' wffl be prominent. In other words withXcSpSa, some words become frequent because of their subjectrritualuse and consequently move up into the grammatical envelope,Tbecause of the way the text is divorced from authenticity,nove down.LINGUISTIC CHARACTERISTICS OF THE CORPUS AND ITS RESEARCHPOTENTIALrhe specific characteristics of a specialized corpus will be immediatelyapparent to those with any experience In corpus linguistics. However, to theuninitiated the characteristics of such corpora often hold surprises. Forexample it comes as a surprise to teachers of English that the language ofclassroom and textbook management and organization features soprominently in the word-frequency lists. In the list partly presented in TableI as one descends from the most frequent form 'the' ('the' makes up aboutfour per cent of all texts in English), and reaches words like 'following','words' 'write' 'sentences','read', it is remarkable to reflect how often thesewords appear in the texts and yet how infrequently their meanings aretaught directly in the classroom.A case in point is the word 'paragraph'. Teachers often express surprisethat textbooks can contain more than 90 occurrences of a word, the meaningof which we all take for granted. There would be an argument for incorporatingdefinitions of such basic subtechnlcal terms into the textbooks themselves.The list for inclusion could, of course, be furnished from an analysis of thecorpus.Corpora are equally revealing in their application to other disciplines.For example, researchers in reading have at their disposal a readily accessiblesource not only of lexical items encountered in the secondary school butalso of the involvement of those items in grammatical structures and idioms.Indeed, grammatical structures can be assembled in a profile form for theentire text under discussion: for example, the percentage of active declarativesentences in relation to more complex forms such as relatives and passives.Information of this kind Involves not only an examination of frequency listsbut also of the transitional probabilities of each item. A concordanctagprogramme or 'grep' within UNIX can provide this facility. The informationdiscovered in this way can be cross-compared in reading research withother texts that the researcher might wish to match against the corpus.Contemporary ethnographers do not leave an encounter with the corpusdisappointed. There Is strong prlma facie evidence, in purely numericalterms, of sexism which must have a pernicious and profound effect In thefirst four years of secondary school.Quite apart from the fact that normal expository text produces a higherproportion of the form 'he' than the form 'she', no amount of reasoning inW. E. LOUW & J. JORDAN 137that direction could justify the discrepancies set out in Table II. Furthermore,if the actual spans associated with these forms are sought (Louw, in press),the case is instantly decided. The most frequent non-grammar word collocateof girl', in Louw's research, is the word 'marry'. There is no comparable formfor 'boy', but the most frequent non-grammar word collocate for 'boy' is'wonder' emanating, in its frequency, from a long story, sexist in character,entitled 'Wonder Boy'.These are only some of the many research applications to which theZimbabwean English Language corpus may finally be applied. The corpuswill, doubtless, be enlarged into other provenances as OCR technologydevelops and, indeed, as Zimbabwean texts become available on floppydisks or CDROMs.Table IIWORD-FREQUENCIES IN THE EIGHT BOOKS INCLUDED IN THE CORPUSFrequency of selected Frequency of selected nounspersonal pronounsText He She Text Man Woman Boy Girl1234567855270282862644167469253918025916119917477716912345678110105136627560911014263394210222120442924918732537172115134710AcknowledgementsThe development of the corpus was undertaken with funds from theUniversity of Zimbabwe Research Board and the British Council, withcopyright permission kindly supplied by Longman Zimbabwe and CollegePress. The loading and analysis of the text took place at BirminghamUniversity, for which special acknowledgement is given to Professor JohnSinclair of the Department of English and Antoinette Renouf and JeremyClear of the English Language Research Unit.ReferencesDAWSON, D. 1984 Structures and Skills in English: Book 4 (Harare, CollegePress).DAWSON, D. 1985 Structures and Skills in English: Book 1 (Harare, CollegePress).CORPUS OF ZIMBABWEAN ENGLISHDAWSON, D. 1986 Structures and Skills in English: Book 2 (Harare, CollegePress")DAWSON, D.' 1986 Structures and Skills in English: Book 3 (Harare, CollegePrcss")«GRANT N J H. 1984 English for Zimbabwe: an English course for secondaryschools. Book 4 (Harare, Longman).PRANT N J H and BiMHA, J. 1981 English for Zimbabwe: an Englishcourse for secondary schools. Book 2 (Harare, Longman).TRANT N J H and MAMUTSE, E. 1983 English for Zimbabwe: an Englishcourse 'for secondary schools. Book 3 (Harare, Longman).GRANT N J H. and NDANGA, H. 1981 English for Zimbabwe: an Englishcourse for secondary schools. Book 1 (Harare, Longman).FRANCIS, W. N. AND K.UCERA, H. 1982 Frequency analysis of English usage:lexicon and grammar (Boston, Houghton and Miffin).JAYNES K. 1991 'A bottom-up approach to personality testing' (Harare,University of Zimbabwe, Department of Psychology, Unpubl.dissertation).JOHANNSON, S. 1978 'Manual of Information to accompany the Lancaster-Oslo-Bergen Corpus of British English for use with Digital Computers'(Oslo, Oslo University).JORDAN, J. 1989 'Forging an environmental supports bank for vocabularyand verbal reasoning testing in Zimbabwe', Psychology and DevelopingSocieties, 1,165-175.Louw W. E. in press 'Computer assisted materials evaluation: contentvs national policy', English Language Research Journal, III, 29-42.RENOUF, A. J. 1987 'Corpus Development', in J. M. Sinclair (ed.), Lookingup': An Account of the COBUIID Project in Lexical Computing (London,Collins), 1-40.SHASTRY, S. V. 1988 'The Kolhapur Corpus of Indian English', ICAMEJournal, XII, 15-26.SINCLAIR, J. M. 1987a Collins COBUILD English Language Dictionary(London, Collins).SINCLAIR, J.M.(ed.) 1987b Looking up: An Account of the COBUILD projectin Lexical Computing (London, Collins).SINCLAIR, J. M. 1987c 'Collocation: A progress report', in R. Steele (ed.),Essays Presented in Honour of Michael Halliday (Amsterdam, JohnBenjamins).SINCLAIR, J.M. 1988 'Naturalness in Language', English Language ResearchJournal, II, 11-20.SINCLAIR, J. M. and RENOUF, A. J. 1987 'A lexical syllabus for languagelearning', in M. J. McCarthy and R. A. Carter (eds.) Vocabulary inLanguage Teaching (London, Longman), 140-160.WILUS, J. R. and WILUS, J. D. 1988 Collins COBUILD English Course(London, Collins).