Mid-frequency readersPaul NationLALS, Victoria University of Wellington, Wellington, New ZealandLaurence AnthonyWaseda University, Tokyo, JapanThis article describes a new free extensive reading resource for learning themid-frequency words of English and for reading well known texts with minor vocabulary adaptation. A gap exists between the end of graded readers at around 3,000 word families and the vocabulary size needed to readunsimplified texts at around 8,000 word families. Mid-frequency readers aredesigned to fill this gap. They consist of texts from Project Gutenberg adaptedfor learners with a vocabulary size of 4,000 word families, 6,000 word families and 8,000 word families. Each text is available at these three differentlevels. The goal is to have at least fifty such texts at each of the three differentlevels freely available. The adaptation is done using the BNC/COCA wordfamily lists and the AntWordProfiler program. The article also discusses research that needs to be done on learning mid-frequency vocabulary and oncreating and using mid-frequency readers.The vocabulary demands of readingResearch on vocabulary comprehension hasshown that a learner of English needs tounderstand around 98% of the running words(tokens) in a text for unassisted comprehension(Hu & Nation, 2000; Schmitt, Jiang, & Grabe,2011). Using corpora from various genres,Nation (2006) showed that this value equates toaround 8,000 word families (see Table 1), whichis an ambitious goal for most learners andwould require a lot of deliberate and incidentallearning of vocabulary.5Table 1: Vocabulary sizes needed to get 98%coverage (including proper nouns) of various kinds of texts (Nation, 2006)Texts98% coverageNovels9,000 word familiesNewspapers8,000 word familiesSpoken English7,000 word familiesChildren’s movies6,000 word familiesTo reach these high vocabulary sizes, extensivereading should play a large role in any vocabulary learning program, both in helping thelearning of vocabulary and in improving its use
Journal of Extensive Reading2013, Volume 1(see Pigada & Schmitt, 2006; Waring & Takaki,2003, for reviews). Unfortunately, most gradedreading schemes end at around the 3000 wordfamily level. This means that if learners witha vocabulary size of 3,000 word families ormore want to continue doing extensive readingwhich is at the right level for them, there is nosuitable material. The 5,000-6,000 word-familygap between the end of graded readers and therequirements for unassisted comprehension issimply too large. Also, it is possible that even if alearner has the vocabulary knowledge requiredfor unassisted reading, some of the vocabularywill not be accessed quickly enough for fluentextensive reading. Thus, the need to bridge thegap between graded readers and authentic textsis even more important.In the past, a series to bridge this gap, appropriately called the Bridge series, was publishedby Longman, Green, and Co. The Bridge seriescontained 32 titles including fiction works,such as Animal Farm, Lucky Jim, Persuasion, TheRed Badge of Courage, and Great Expectations,and non-fiction works, including The Mysterious Universe, Changing Horizons, and Mankindagainst the Killers. Although the series is nowout of print, the number of printings for someof the books shows that they, at least, sold well.The following is a note describing the series thatappeared in the introduction to Animal Farm.The Bridge Series is intended for students of English asa second or foreign language who have progressed beyond the elementary graded readers and the LongmanSimplified English Series but are not yet sufficiently advanced to read works of literature in their original form.The books in the Bridge Series are moderately simplifiedin vocabulary and often slightly reduced in length, butwith little change in syntax. The purpose of the texts isto give practice in understanding fairly advanced sentence patterns and to help in the appreciation of Englishstyle. We hope that they will prove enjoyable to read fortheir own sake and that they will at the same time helpstudents to reach the final objective of reading originalworks of literature in English with full understandingand appreciation.6ISSN: 2187-5065Technical Note:In the Bridge Series words outside the commonest 7000(in Thorndike and Lorge: A Teacher’s Handbook of 30,000Words. Columbia University, 1944) have usually beenreplaced by commoner and more generally usefulwords. Words used which are outside the first 3,000 ofthe list are explained in a glossary and are so distributedthroughout the book that they do not occur at a greaterdensity than 25 per running 1000 words.(from the introduction to the Bridge Series editionof Animal Farm, 1945)The Bridge series involves a reasonable amountof glossing (the glossary is in the form of a listwith definitions at the back of the book) and asmall amount of adaptation. For example, Animal Farm contains a glossary of around 880 wordswhich cover approximately 3.3% of the runningwords in the text. The number of glossed wordsfor Animal Farm is high because no words werereplaced in the text. Other glossaries range from120 to 600 words. Although having an extensiveglossary at the back of the book could interruptthe flow of reading, glossed words in the BridgeSeries are not bolded or marked in any wayin the text. Learners are supposed to look upwords only when they need to.With the growth of personal computers and thedevelopment of word family lists and computerprograms that use them, the study of the vocabulary load of text has become increasingly moredetailed. For example, Nation (2009) lookedin detail at the number of changes that wouldneed to be made to adapt texts for learners atvarious vocabulary size levels. In Table 2, wecan see that to adapt the Project Gutenberg version of the novel Lord Jim by Joseph Conrad fora learner who knew 4,000 word families, 5% ofthe word families would need to be glossed and0.75% of the word families would need to bereplaced. In column 3, the “target word familiesto gloss” is arbitrarily set at a maximum of 5%.If this percentage is lowered, then the percentage in Column 4 needs to be increased. Severalunknown words will be easy to guess from context, and words which are easy to guess shouldnot be chosen for replacement. The lowest frequency level words are replaced unless they are
Journal of Extensive ReadingISSN: 2187-50652013, Volume 1Table 2: Percentage of target word families to support and word families to replace in Lord Jimat various levels of previous knowledgeAssumed knownword families2,0003,0004,0005,0006,0007,0008,0009,000% coverage of knownword families% target wordfamilies to 05.04.353.442.802.261.85% of word familiesto replace6.682.840.7500000Total %100100100100100100100100repeated within the text or they are easy toguess.most well-developed and well-known example.Table 2 shows that as learners’ vocabulary sizeincreases, the percentages of changes that needto be made become small. However, the weakness of this method of calculating changes isthat a small percentage can still be a large number of word families. Lord Jim is 132,413 tokenslong, so 5% of the tokens equals 6,621 tokens.This is well over 2,500 word families, which isfar too heavy an unknown vocabulary load fora reader. What this means is that the numberof words replaced needs to be greater so thatonly a small percentage of the running words(well under 2%) are unknown words. The critical figure is the raw number of unknown wordfamilies that need to be dealt with by the reader,not the percentage coverage of text by unknownwords.Following the lead of Schmitt and Schmitt (2012),here we consider the high-frequency vocabularyto include the most frequent and wide ranging3,000 word families of English (see Table 3).The arguments in favour of including the first3,000 word families in the high-frequency levelare that the 3,000 word-family level is neededto gain 95% coverage of the running words inmost texts (when the coverage of proper nounsand marginal words is included), and that mostgraded readers end at around the 3,000 wordfamily level. Note that this figure differs fromNation (2001) who considered this level to contain only the first 2,000 word families.High-, mid-, and low-frequency vocabularyIt is useful to distinguish three broad frequencylevels of vocabulary: high-frequency vocabulary, mid-frequency vocabulary, and low-frequency vocabulary. The idea of high-frequencywords has a long history, and Michael West's(1953) A General Service List of English Wordscontaining around 2,000 word families is the7In Table 3, the mid-frequency vocabulary consists of around 6,000 word families, which whenadded to high-frequency vocabulary adds up to9,000 word families. The reason for making thearbitrary cut-off point between mid-frequencyand low-frequency vocabulary after the 9th1000 word-family level is because 9,000 wordfamilies provide 98% coverage of most texts,when the coverage of proper nouns and othermarginal words is also included.
Journal of Extensive Reading2013, Volume 1ISSN: 2187-5065Table 3: High-frequency, mid-frequency, and low-frequency vocabularyVocabulary levelWord family levels (and total) Nature of the vocabularyHigh-frequency1st 1000-3rd 1000 (3,000)Mid-frequency4th 1000-9th 1000 (6,000)Low-frequency10th 1000 onWide range, very high-frequency, essential, general purpose vocabularyWide range, moderate frequency, general purpose vocabularyNarrower range, low-frequency, some technicalvocabulary unique to a particular disciplineIn order to create the word family lists reportedin the Nation (2009) study, an untagged versionof the British National Corpus (BNC) was used.This was divided along genre divisions into 10roughly equally sized sections each 10,000,000word tokens long. At around the 9,000 wordfamily level, the range figures for the mostfrequent words changed from a value of 10 toa value of 9. That is, at around the 9,000 wordfamily level, the word families did not occur inall 10 sections of the BNC, but in only 9 of them.This can be seen as marking a change fromgenerally useful vocabulary to more narrowlyfocused vocabulary.Table 4 shows examples of word families from arevised list of mid-frequency word family levellists that were developed for this study on thebasis of frequency information from the BNCcombined with that from the Corpus of Contemporary American English (COCA) kindlysupplied by Mark Davies (Nation, 2012). Theword families in Table 4 are taken from the listsbeginning at the letter b and are shown here sothat readers of this article can get a feel for thekinds of words in the mid-frequency vocabulary.Table 4: Example word families from the six 1000 mid-frequency word-family levels using theBNC/COCA listsWord family frequency levelExample word families4th 1000ballet, balloon, ballot, bankrupt, barn, barrel, baseball5th 1000badge, bail, bait, balcony, bald, banner, Baptist6th 1000babe, bachelor, baffle, bandage, banish, banquet, barb7th 1000badger, bale, ballad, bamboo, baptism, baptize, barbarian8th 1000babble, backfire, baggy, ballistic, banal, bandit, barber9th 1000backlog, bailiff, bandwagon, banister, banter, barbaric, bard8
Journal of Extensive ReadingISSN: 2187-50652013, Volume 1Mid-frequency words are commonly known byadult native speakers of the language, and wewould expect native-speaking children beginning secondary school to know many of thesewords to some degree. Note that the relatedwords Baptist, baptize, and baptism in Table 4are separate word families. This is because thestem form of these words is a bound form, not afree-form. That is, there is no word Bapt whichstands as a free word. Note also that compoundwords, such as backfire and bandwagon, are included. This is because these are not transparentcompounds where the meaning of the word canbe explained directly from the word parts. Thetest for transparent compounds is to see if it ispossible to state the meaning of the compoundusing the parts with few if any further contentwords needed. For example, your birthday is theday of your birth.The low-frequency words of the language area very large group. The BNC/COCA lists goup to the 25th 1000 word-family level, but thelow-frequency words stretch far beyond this.It is not easy to say how many low-frequencyword families there are in English, but variousestimates put the number at somewhere around100,000 word families (Nation, in press). Thecurrent BNC/COCA word family lists going upto and including the 25th 1000 plus the four listsof proper nouns, marginal words, transparentcompounds and abbreviations provide over99% coverage of the tokens in most texts andcorpora. At least half of the words outside thelists turn out to be proper nouns, and a largenumber of the remainder are transparent lowfrequency members of word families already inthe existing lists but which have not yet beenadded to the families (Nation, in press).Table 5 shows the typical coverage of highfrequency, mid-frequency and low-frequencyword families. The high-frequency words, midfrequency words, and proper nouns, exclamations, transparent compounds and abbreviations add up to over 98% of the running wordsin the text. The high-frequency words, propernouns, exclamations, transparent compoundsand abbreviations add up to around 95% of therunning words.Table 5: Coverage of the British National Corpus (BNC) by high, mid- and low-frequencyword familiesType of vocabulary% coverageHigh-frequency (3,000 word families)Mid-frequency (6,000 word families)Low-frequency (10th 1000 word-family level on)Other (Proper nouns, exclamations, transparent compounds, abbreviations)90%5%1-2%3-4%Total100%Table 6: Distribution of high, mid- and low-frequency word families in a variety of genresLevelHigh-frequency - 3,000SpokenTV/MoviesChildren’s 5%2.99%97.83%96.47%93.72%93.20%Mid-frequency -