Corpus-Based Frequency Profiling: Migration to a Word List Based on the British National Corpus

 

Leah Gilner* and Franc Morales

 

Abstract

 

The selection and assessment of ELT materials involve multiple criteria. The use of frequency word lists to profile the vocabulary makeup of a text is one such criterion. It provides a quantifiable characterization and classification of lexical material in terms of corpus-based frequency measures. The process of vocabulary profiling is not without challenges, first among which is the identification of a word list adequate for ELT. The choice will determine the amount of information, if any, that can be derived from a text. This paper provides an appraisal of a frequency word list based on the British National Corpus (BNC) and shows the benefits that can be gained by profiling with this list rather than with the long-established General Service List (West, 1953).

 

 

There are two basic appeals to moving on from the GSL. First, the GSL was compiled based on data from corpora tallying up to 10 million tokens while the BNC is ten times larger. The differential in size makes it possible to obtain a more accurate account of the frequency organization of the English lexicon. Second, the GSL leaves uninformative gaps when deployed in profiling, ranging from 10% to 25% depending on the text (Nation and Waring, 1997). Due to the way in which the GSL was manufactured, expanding this word list is nearly impossible if one is to follow the original directives and criteria (Faucett et al, 1936; Lorge, 1949; West, 1953). While necessarily observing different criteria, word lists that supplement the GSL have been proposed (Coxhead, 2000; Xue and Nation, 1984) and investigated (Hyland and Tse, 2007). It can be said that expandability is, perhaps, the weakest point of the GSL and the best reason for seeking a replacement. The BNC affords the possibility of addressing this issue.

However, since the BNC has not been designed to inform ELT, knowledge of its origin and composition is an important factor to take into consideration when seeking to derive information and insight from it. The BNC is a 100-million-word sample, synchronic, general, monolingual, mixed corpus of present-day British English. The compilation of the BNC was a collaborative undertaking carried out by dictionary publishers (Oxford University Press, Longman, Chambers Harrap) and academic institutions (Oxford University, Lancaster University, the British Library) with financial backing from British government agencies (Leech et al., 2001). About 90% of the corpus is comprised of written language, categorized as imaginative (i.e. fiction) or informative (i.e. non-fiction, expository); most of the written texts date from 1975 or later (20% of imaginative texts date from 1960). The 10-million-word subcorpus of spoken language contains samples recorded between 1991 and 1993; 40% of the samples represent conversational language use, that is to say, “spontaneous interactions engaged in by some 127 adults aged 15 and over” (Leech et al., 2001, p.2); 60% of the samples represent task-oriented speech (lectures, sermons, TV/radio programs, consultations),  “those types of […] spoken activity that were unlikely to be recorded by the conversational volunteers during a typical day of their lives (Leech et al., 2001, p. 3).

It stands to reason that not every frequency word list based on the BNC has the potential to be equally informative in ELT. Nation (2004) inspected the 3,000 most frequent word families in the BNC and found that they contain material from the GSL as well as the Academic Word List (Coxhead, 2000). The presence of this academic vocabulary interwoven with words of general service - which are thought to provide the foundation for subsequent learning (Nation, 2001) - made it difficult to “decide how the GSL could be replaced” (Nation, 2004, p.12) when considering, precisely, core vocabulary. Following a different line of inquiry, Nation (2006) produced 14,000 word families, organized according to frequency data from the BNC, in order to asses “how large a receptive vocabulary is needed for typical language use” (p. 59). Of interest, his analyses reveal important information in those gaps uncharacterized by the GSL. For example, ‘topic words’ were identified in the 4,000 word family and beyond.

The evolution of this list of 14,000 word families (hereafter, BNC-ELT word list) was not limited to expandability (over the 3,000 originally used) but subsequently included the reorganization of the word families according to the spoken subcorpus of the BNC. As Nation explains in the “readme” file of the RANGE software available from his website, “previously the lists had been sequenced using figures from the whole BNC but because of the overwhelming amount of formal written material this resulted in lists that did not satisfactorily represent informal spoken uses of English”. It should be noted that range was the main criterion used in the creation of the list and that frequency was the second criterion (I.S.P. Nation, personal communication, February 13, 2008).

Although it is reasonable to assume that further refinement of the BNC-ELT list might take place, we have adopted this latest revision of the list and have used it in the analyses carried out in this investigation in order to characterize a relatively large and varied sample of the kinds of authentic materials that can be used in ELT. As findings will show, this characterization gives powerful reason to migrate from the GSL to the BNC-ELT.

We begin with a closer look at the BNC by inspecting the raw frequency measures it yields and how these metrics characterize the whole corpus, a generalizable insight into lexical choice in language use. Following we present an analysis that illustrates how these same metrics characterize a sample of materials used in ELT. Some structural properties of the BNC-ELT list will then be presented together with a description of the ELT corpus compiled for this investigation. We bring the two together - the BNC-ELT list and the ELT corpus – in a series of analyses that will allow us to explore the extent and manner in which profiling can be used to characterize the lexical content of ELT materials. We close by providing comparative measures between the GSL and the BNC-ELT for referential purposes that may facilitate the transition from one to the other.

The importance of lexical frequency in language use cannot be overstated. Analysis of any reasonable amount of language in use reveals substantial uniformity with regards to lexical content. The numbers are quite impressive. Nearly 50% of all language used is confined to 100 words, 75% to 2,000 words, and 85% to 5,000 words (Leech et al., 2001). These numbers show an extremely sloped curve of distribution where very few items in the vocabulary account for most discourse while the majority of items in the vocabulary occur with severe, even extreme, infrequency (Ellis, 2002). The implications for language instruction are clear. Speakers demonstrate marked preferences when it comes to lexical choice, making it feasible to isolate a vocabulary of objective value.

Table 1 presents data on the lexical coverage that the most frequent words of the English language provide for the British National Corpus. According to the data obtained from Leech et al. (2001), 397,041 of the 757,087 words (52.44%) in the corpus account for only 0.0039% of the occurrences in the corpus, while 100 of the 757,087 words (0.0132%) account for 45.8786% of the occurrences in the corpus. The analysis we provide in Table 1 illustrates this phenomenon in greater detail.

It is evident that exceedingly few types (words) account for the vast majority of tokens (occurrences). While the 10 most frequent types occur over 21 million times, there are 397,041 types (52.44%) that occur only once in the entire corpus (Leech et al., 2001).

Table 1 shows that the first 100 words (types) occur 45,878,600 times (tokens) in the BNC. The next set of 100 types contributes 6,373,600 tokens (or 6.3736% of the corpus) while the next 300 types add 8,358,400 tokens. From here, the column labeled ‘Difference’ shows how the amount of tokens contributed by the subsequent addition of types gradually diminishes. The column labeled ‘Cumulative’ grows as the amount of types increases, yet

 

Types

Tokens

% of BNC types

% of BNC corpus

Difference

Cumulative

100

45,878,600

0.0132%

45.8786%

 

200

52,252,200

0.0264%

52.2522%

6.3736%

6.3736%

500

60,610,600

0.0660%

60.6106%

8.3584%

14.7320%

1,000

67,569,500

0.1321%

67.5695%

6.9589%

21.6909%

1,500

71,864,900

0.1981%

71.8649%

4.2954%

25.9863%

2,000

74,950,900

0.2642%

74.9509%

3.0860%

29.0723%

2,500

77,332,500

0.3302%

77.3325%

2.3816%

31.4539%

3,000

79,255,000

0.3963%

79.2550%

1.9225%

33.3764%

3,500

80,828,900

0.4623%

80.8289%

1.5739%

34.9503%

4,000

82,144,700

0.5283%

82.1447%

1.3158%

36.2661%

4,500

83,254,100

0.5944%

83.2541%

1.1094%

37.3755%

5,000

84,214,800

0.6604%

84.2148%

0.9607%

38.3362%

5,500

85,060,900

0.7265%

85.0609%

0.8461%

39.1823%

6,000

85,809,200

0.7925%

85.8092%

0.7483%

39.9306%

6,500

86,480,400

0.8586%

86.4804%

0.6712%

40.6018%

7,000

87,088,700

0.9246%

87.0887%

0.6083%

41.2101%

 

Table 1. Breakdown of the most frequent words in English.

 

showing that the amount of tokens does not quite double even after the inclusion of tokens corresponding to 7,000 types. Summing up, the 100 most frequent types (words) of the language amount to more tokens (occurrences) than the following 6,900 types (words) combined and, together, the 7,000 most frequent types account for 87.09% of all tokens in the BNC.

It is relevant to question if the observed frequency distributions are limited to large corpora and whether texts of smaller size exhibit similar metrics. Furthermore, it is pertinent to ask if ELT materials are equally served by this information. Table 2 answers this question affirmatively by analyzing eight texts of different sizes and types of discourse (spoken and written). These texts have been randomly chosen from the ELT corpus that was compiled for this study and that will be introduced later on. For now, note that these texts have been arranged so that “beginner” materials are displayed in the upper rows and “advanced” materials in the lower rows, that is, the Interview script is a transcription of a dialogue of little complexity while the NYT article is a piece of news from the New York Times and, therefore, of reasonable difficulty for advanced learners.

The first observation is that the coverage provided by the most frequent types (words) in English is superior for the ELT materials than it is for the BNC (82.65% over 79.25%). The second observation is that while values fluctuate, trends are uniform and correlate. Approximately half (or more) of all tokens of any text are confined to the 100 most frequent types in the language while approximately three-fourths (or more) of all tokens of any text are confined to the 2,000 most frequent types. The third observation is that, together, the 3,000 most frequent types offer better coverage of the Interview script than of the NYT article. The fourth observation is that, as we

 

Source

Tokens

100

500

1,000

2,000

3,000

BNC

100,000,000

45.89%

60.61%

67.57%

74.95%

79.25%                   

All texts

52,169

49.91%

63.97%

69.63%

78.29%

82.65%               

Interview script

560

62.11%

80.33%

84.88%

88.89%

91.62%   

Short story

15,935

54.95%

69.65%

74.97%

80.95%

84.76%

Family movie script

17,730

45.81%

58.65%

63.39%

75.93%

80.48%

News article (intermediate)

448

43.95%

61.16%

69.30%

80.00%

85.81%   

Novel

10,542

51.77%

66.85%

72.74%

79.70%

84.16%

ESP reading (technology)

1,602

45.56%

58.78%

65.15%

72.14%

77.02%

IHT article

1,555

45.71%

58.80%

67.64%

75.61%

79.00%

NYT article

1,265

42.28%

54.68%

64.45%

72.25%

79.89%

 

Table 2. Profile of a variety of texts based on BNC raw frequencies.

 

look from row to row in descending order, we can see that profiling does not provide a definite correlation between frequency and difficulty even though the data shows a tendency in that direction. In other words, better coverage (in this case, fewer infrequent words) does not necessarily imply lesser difficulty.

The word lists employed in the two previous analyses were manufactured by identifying the most frequent unlemmatized types (words) in the English language (for example, the first ten types are the, of, and, a, in, to, it, is, to, and was). For the purposes of ELT, such a list is not as useful as a list where items are lemmatized and, more importantly, clustered into word families. A word family refers to a grouping containing a headword, its inflections, and its closest derivations (Nation, 2001). It is posited that awareness of word family relationships can “greatly decrease the learning burden of derived words containing known base forms,” (Nation, 2001, p. 8) although the extent to which this is true will depend on a given learner’s experience and linguistic background among other things (Mochizuki and Aizawa, 2000; Sakata, 2007).

One of the strengths of the BNC-ELT list is that types are clustered into families. The BNC-ELT is comprised of a total of 50,598 types (words), grouped into 14 sublists of 1,000 families each which in turn are ranked by descending frequency. That is, the first sublist contains the 1,000 most frequent word families in the English language, the second sublist contains the following 1,000 most frequent families, and so on. The detail and scope of the BNC-ELT list makes this contribution an unparalleled resource for the identification of lexical distributions in ELT materials.

Table 3 shows some of the structural characteristics of the BNC-ELT sublists together with the code that will be used to refer to them in the discussion that follows. As mentioned, each sublist contains 1,000 word families. An example of a family from the sublist SL-01 is ABLE: able, ability, abler, ablest, ably, abilities, unable, and inability while an example of a family from the sublist SL-14 is ALLURE: allure, allured, allures, alluring, and alluringly.

 

Code

Families

Words

Average

SL-01

1,000

6,348

6.35

SL-02

1,000

5,593

5.59

SL-03

1,000

4,517

4.52

SL-04

1,000

4,287

4.29

SL-05

1,000

3,992

3.99

SL-06

1,000

3,494

3.49

SL-07

1,000

3,272

3.27

SL-08

1,000

3,192

3.19

SL-09

1,000

3,050

3.05

SL-10

1,000

2,840

2.84

SL-11

1,000

2,794

2.79

SL-12

1,000

2,568

2.57

SL-13

1,000

2,426

2.43

SL-14

1,000

2,225

2.23

BNC-ELT

14,000

50,598

3.61

 

Table 3. Characteristics of the BNC-ELT list.

 

We can see from Table 3 that less frequent families have fewer members (Nation, 2007). The column labeled ‘Average’ quantifies this trend by presenting the average number of types per family for each of the sublists and for the BNC-ELT list as a whole.

The ELT corpus compiled for this investigation consists of eight collections of texts, each taken from the kinds of sources (newspaper articles, movie scripts, short stories, novels, etc.) generally referred to as authentic material (Gilmore, 2007). Table 4 presents some general statistics which serve to inform on the composition of the ELT corpus compiled for this study. The collections were compiled from the kind of sources that are often used when the desire is to provide students with authentic models of naturally-occurring, fluent language use (Brown and Yule, 1983) and that, in our experience teaching at university level, often find their way into the classroom. These collections can be said to illustrate a natural grading (Gilmore, 2007) in terms of content, presentation, and register (Carter and McCarthy, 1994), ranging from Interview scripts and Short stories which are likely to be used with less experienced students (i.e. first-year university) to articles from the International Herald Tribune (IHT) and the New York Times (NYT) which might be selected for more experienced and advanced students.

 

Code

Description

Items

Tokens

Avg. length

CLT-01

Interview scripts

116

49,613

453

CLT-02

Short stories

18

52,413

2,958

CLT-03

Family movie scripts

19

325,581

18,937

CLT-04

News articles (intermediate)

134

72,187

538

CLT-05

Novels

14

398,854

29,614

CLT-06

ESP readings (technology)

31

55,486

1,866

CLT-07

International Herald Tribune

164

154,658

1,010

CLT-08

New York Times

58

48,701

903

ELT Corpus

All collections

554

1,157,493

2,226

 

Table 4. Characteristics of the ELT corpus.

 

The design of the ELT corpus was approached from an ELT practitioner’s perspective in as much as we wanted to compile an assortment of material that might reflect the choices made by colleagues in the field. The ELT corpus deliberately contains collections that have markedly more tokens than others (i.e. Family movie scripts vs. Interview scripts) in order to highlight the effect text length has on profiling results. We now describe each collection in turn.

The Interview scripts collection is comprised of transcriptions of 116 interviews taking place between speakers of different backgrounds. The materials can be described as modeling naturally-occurring interactions, as when people are getting to know each other, and in which English is used as an international language. The Interview scripts have an average length of about 450 tokens (occurrences), making them relatively short. From the ELT practitioner’s perspective, the length, breadth, and depth of the interviews make them appropriate for less experienced students. Topics are discussed in general terms and speakers often provide narratives about personal experiences.

The Short stories collection contains 18 children’s stories (i.e. The Tale of Peter Rabbit, The Emperor’s New Clothes, Rapunzel). The stories are written for a young L1 reading audience but the fictional, imaginative aspects of the texts can make them entertaining and engaging for L2 learners of an older age. The stories are lengthy, on average about 3,000 tokens, and could be a resource for extensive reading in lower and intermediate levels.

The Family movie scripts collection contains 19 scripts, on average about 19,000 tokens long, from movies that seem to be widely-recognized, such as E.T. and Back to the Future. The nature of the genre implies that a substantial amount of the discourse comes in the form of dialogues. To a large extent, the structural and conceptual complexity of the material is bound by the target audience (parents and children). As family movies are less likely to involve in-depth development of ideas or elaborate argumentation, they are deemed most appropriate for intermediate-level students.

The News articles (intermediate) collection includes 134 newspaper articles from Voice of America and the English version of a well-known Japanese newspaper. Topics vary from economics and politics to health and education. They are relatively short, on average about 500 tokens long, and differ from the articles in the IHT and the NYT collections in terms of lexical and structural complexity as well as depth of exposition; these factors combine to make this collection accessible to intermediate-level students.

The Novels collection contains 14 full-length fiction stories written for adult audiences, each averaging 30,000 tokens in length. The nature of these kinds of texts implies a more in-depth development of plot and characters than the other collections and is likely to include examples of spoken discourse in the form of dialogue interwoven in the narrative. Full-length novels are also likely to make use of a larger and more varied vocabulary than shorter texts. Given these characteristics, the texts in this collection represent extensive reading material for students at advanced-levels.

The ESP readings (technology) collections contain 31 texts which describe aspects and constructs related to computers, the Internet, and electronics. The texts average about 2,000 tokens, are procedural in nature, and make use of domain-specific, specialized vocabulary. This collection is deemed to represent intermediate- and advanced-level material for Science and Engineering majors.

The International Herald Tribune (IHT) and New York Times (NYT) collections contain 164 and 58 news articles and special reports, respectively. In both cases, the average length is about 1,000 tokens. Topics vary widely and include: politics, economics, culture, society, travel, sports, health, fashion, etc. As the name suggests, the IHT targets an international audience while the NYT, although of international repute, is thought of as a newspaper of record in the U.S. Thus, even though the two newspapers are owned by the same company, the role of each may influence not only the treatment and perspective provided in the texts, but also the style of discourse and expression. Either collection might be used with advanced-level students.

With this outline of the collections and corpus in mind, we move on to the analysis of the ELT corpus by means of the BNC-ELT list. Results for all eight collections are presented throughout. However, we will focus the discussion on CLT-01 and CLT-08 as these collections are similar in size while clearly distinct in difficulty. This narrow focus will allow us to formulate (weak) propositions regarding the insights into “difficulty” that profiling affords. Once the data has been presented and discussed, we will proceed to take into account the results obtained from the other collections. The reader is encouraged to consider all results as analyses are presented.

The reason why the discussion elaborates on the assessment of difficulty by means of frequency profiling is because we find that it is a relationship that is established intuitively yet not addressed in the literature on vocabulary frequency lists. There seems to be an assumption or tendency to assume, for instance, that frequent words are “easier” than infrequent ones. Intuitively, it makes sense to consider a word that is rarely used as being, one, of very specialized application and/or, two, of such low occurrence that it is difficult for a learner to obtain repeated exposure to it. The interpretation of results presented hereafter hopes to provide some observations regarding the equivocal relationship between frequency profiling and learning burden.

A profile of the ELT corpus using the BNC-ELT list is presented in Table 5. The lexical material in the ELT corpus belonging to the BNC-ELT list is 98.63% (last column, bottom row). By collection, the extremes are 99.64% coverage of CLT-01 (‘lower-level’ texts) and 97.42% coverage of CLT-8 (‘advanced-level’ texts).

 

Code

CLT-01

CLT-02

CLT-03

CLT-04

CLT-05

CLT-06

CLT-07

CLT-08

ELT Corpus

SL01

91.45%

82.23%

79.27%

80.04%

82.25%

76.18%

76.84%

76.94%

80.43%

SL02

4.23%

7.01%

7.07%

9.59%

7.01%

10.28%

9.47%

9.35%

7.65%

SL03

1.62%

3.72%

4.67%

2.93%

3.94%

3.62%

3.18%

3.32%

3.83%

SL04

0.80%

1.78%

2.43%

2.24%

1.63%

2.71%

2.80%

2.95%

2.13%

SL05

0.42%

1.36%

1.39%

1.10%

1.34%

1.36%

1.47%

1.36%

1.32%

SL06

0.34%

0.76%

0.97%

0.94%

0.72%

0.96%

1.03%

0.86%

0.85%

SL07

0.17%

0.51%

0.63%

0.46%

0.53%

0.57%

0.63%

0.64%

0.56%

SL08

0.15%

0.46%

0.39%

0.44%

0.35%

0.82%

0.54%

0.53%

0.42%

SL09

0.13%

0.23%

0.42%

0.21%

0.30%

0.34%

0.41%

0.41%

0.34%

SL10

0.11%

0.33%

0.58%

0.25%

0.26%

0.59%

0.37%

0.27%

0.38%

SL11

0.09%

0.15%

0.28%

0.16%

0.28%

0.22%

0.26%

0.27%

0.25%

SL12

0.09%

0.16%

0.17%

0.15%

0.15%

0.25%

0.21%

0.17%

0.17%

SL13

0.02%

0.15%

0.34%

0.13%

0.17%

0.17%

0.21%

0.22%

0.22%

SL14

0.02%

0.16%

0.08%

0.06%

0.07%

0.23%

0.15%

0.15%

0.09%

BNC-ELT

99.64%

99.03%

98.70%

98.69%

98.99%

98.28%

97.58%

97.42%

98.63%

 

Table 5. Token coverage of the ELT corpus by the BNC-ELT list and sublists.

 

Inspection of the column labeled ‘ELT Corpus’ shows that sublist SL-01 (the first 1,000 families and, thus, the most frequent in the language) accounts for 80.43% of words in the ELT corpus. The second sublist of families, SL-02, accounts for 7.65% of the vocabulary and, together with the first sublist, amounts to 88.08% of all words in the ELT corpus. A clear drop in use is evident as families become more infrequent; a trend that is disrupted in only two occasions (SL-10 and SL-13).

The coverage analysis also reveals that collections deemed more adequate for lower-level learners have higher concentrations of vocabulary in the first sublist (SL-01) than those collections containing advanced material. The data for SL-01 indicates that this sublist accounts for 91.45% of CLT-01 yet for a much smaller share of CLT-06, CLT-07, and CLT-08 (76.18%, 76.84%, and 76.94% respectively). From this data, it is possible to propose that, in general, there might be a connection between lexical frequency and level of difficulty. In other words, a characteristic of advanced texts might reside in the use of a comparatively infrequent vocabulary and, conversely, that a characteristic of beginner texts might reside in the limitation of vocabulary to frequently occurring - i.e. more common - words. The proposition might be intuitively correct but it is of importance to note that the BNC-ELT list provides a means for the quantification of this characteristic.

We now examine the amount of families (and types) from each sublist in the BNC-ELT list that is used by each of the collections in the ELT corpus. The data is first presented globally, that is, regarding the ELT corpus as a whole and without detailing use by collection.

 

Code

Families

Percent

Types

Percent

Tokens

SL-01

1,000

100.00%

4,563

71.88%

930,981

SL-02

994

99.40%

3,822

68.34%

88,559

SL-03

981

98.10%

2,972

65.80%

44,325

SL-04

954

95.40%

2,486

57.99%

24,642

SL-05

902

90.20%

2,018

50.55%

15,271

SL-06

860

86.00%

1,640

46.94%

9,817

SL-07

763

76.30%

1,284

39.24%

6,430

SL-08

698

69.80%

1,149

36.00%

4,837

SL-09

654

65.40%

1,030

33.77%

3,943

SL-10

626

62.60%

925

32.57%

4,386

SL-11

588

58.80%

820

29.35%

2,914

SL-12

481

48.10%

633

24.65%

1,958

SL-13

469

46.90%

627

25.85%

2,493

SL-14

313

31.30%

388

17.44%

1,097

BNC-ELT

10,283

73.45%

24,357

48.14%

1,141,653

 

Table 6. Use of BNC-ELT lists in ELT corpus (global data).

 

Table 6 shows that all families in sublist SL-01 (the most frequent in the language) are used in the ELT corpus. As a family is a collection of inflected and derived forms, it is important to note that 71.88% of all types in SL-01 are found in the ELT corpus and that these types account for 930,981 of all of the tokens (80.43% as shown in Table 5). As in all other analyses, sublists of more infrequent families are used progressively less both at the family and type (word) level, a fact that correlates with the amount of tokens they account for in the ELT corpus. The data, again, supports the validity and adequacy of the BNC-ELT list for the purpose of assessing ELT materials beyond the scope of the GSL.

From this data, it is possible to formulate a second proposition, namely, that there might be a connection between lexical variety and level of difficulty. In other words, one way to characterize advanced texts might be in terms of the use of a comparatively rich vocabulary. Table 7 makes a clear case for this proposition

 

 

CLT-01

CLT-02

CLT-03

CLT-04

CLT-05

CLT-06

CLT-07

CLT-08

ELT Corpus

SL-01

85.00%

83.60%

99.00%

96.50%

98.60%

92.80%

98.40%

95.90%

100.00%

SL-02

47.80%

57.50%

92.40%

78.80%

91.80%

77.50%

94.70%

79.70%

99.40%

SL-03

23.10%

43.70%

83.70%

49.30%

82.30%

50.80%

80.20%

52.90%

98.10%

SL-04

14.20%

29.20%

69.90%

39.10%

67.20%

39.30%

71.80%

42.60%

95.40%

SL-05

7.90%

21.40%

59.40%

26.70%

54.50%

27.70%

58.30%

29.80%

90.20%

SL-06

6.10%

17.30%

48.80%

19.60%

45.70%

20.40%

45.90%

21.40%

86.00%

SL-07

3.60%

11.10%

37.60%

11.70%

36.10%

14.30%

38.00%

17.00%

76.30%

SL-08

3.00%

8.60%

33.60%

12.10%

32.50%

11.70%

30.90%

12.90%

69.80%

SL-09

2.20%

7.40%

31.60%

8.30%

30.40%

8.40%

26.20%

11.80%

65.40%

SL-10

1.80%

6.30%

28.40%

8.60%

26.50%

7.70%

24.30%

9.70%

62.60%

SL-11

1.10%

4.80%

25.50%

6.60%

24.80%

6.70%

24.10%

7.70%

58.80%

SL-12

1.00%

4.20%

17.60%

4.10%

18.70%

5.70%

15.20%

6.00%

48.10%

SL-13

0.70%

5.10%

18.70%

3.80%

18.90%

5.50%

15.20%

6.30%

46.90%

SL-14

0.50%

2.40%

10.50%

2.60%

10.70%

3.80%

11.30%

4.50%

31.30%

BNC-ELT

14.14%

21.61%

46.91%

26.27%

45.62%

26.59%

45.32%

28.44%

73.45%

 

Table 7. Use of BNC-ELT words in ELT corpus (family level data).

 

Collection CLT-01 makes use of 85.00% of the families in sublist SL-01, 47.80% of the families in SL-02, 23.10% of those in SL-03, 14.20% of those in SL-04, and, beyond this point, from 7.90% to 0.50% of the families in the remaining sublists. In contrast, collection CLT-08, makes use of significantly more families in every one of the sublists in the BNC-ELT list, about double for SL-02 and SL-03, triple for SL-04 through SL-06, quadruple for SL-07 through SL-09, and so on.

When considering the use of actual types (words) from each sublist in the BNC-ELT list, the data shown in Table 8 becomes more uniform although it still correlates with trends seen in Table 6 and 7.

From the data shown in Table 3, we know that the amount of types per sublist decreases according to the relative frequency of the sublist. Sublist SL-01 contains 6,384 types and sublist SL-02 contains 5,593 types while sublists

 

 

 

CLT-01

CLT-02

CLT-03

CLT-04

CLT-05

CLT-06

CLT-07

CLT-08

ELT Corpus

SL-01

26.54%

28.39%

48.72%

41.73%

52.03%

37.15%

53.04%

39.74%

71.88%

SL-02

12.05%

17.63%

40.78%

27.96%

44.06%

25.73%

44.09%

27.71%

68.34%

SL-03

6.24%

14.43%

38.90%

16.29%

40.85%

16.67%

32.79%

16.27%

65.80%

SL-04

3.76%

9.07%

28.57%

13.32%

29.44%

12.74%

28.97%

13.58%

57.99%

SL-05

2.33%

6.99%

23.12%

8.59%

23.72%

8.92%

22.62%

9.44%

50.55%

SL-06

2.12%

6.01%

20.89%

6.78%

20.78%

7.04%

18.20%

7.33%

46.94%

SL-07

1.28%

4.22%

15.77%

4.10%

16.44%

5.17%

14.52%

5.93%

39.24%

SL-08

1.07%

3.07%

14.10%

4.42%

14.25%

4.23%

12.06%

4.73%

36.00%

SL-09

0.95%

2.46%

13.90%

2.92%

13.57%

3.05%

10.43%

4.33%

33.77%

SL-10

0.63%

2.64%

12.68%

3.35%

12.43%

3.20%

9.89%

3.70%

32.57%

SL-11

0.47%

1.90%

10.95%

2.58%

11.42%

2.61%

9.81%

2.86%

29.35%

SL-12

0.39%

1.75%

8.14%

1.64%

9.27%

2.49%

6.50%

2.45%

24.65%

SL-13

0.33%

2.39%

9.27%

1.94%

9.93%

2.47%

7.09%

2.76%

25.85%

SL-14

0.22%

1.17%

5.44%

1.30%

5.48%

1.89%

5.53%

2.25%

17.44%

BNC-ELT

6.18%

9.66%

24.94%

13.34%

26.14%

12.70%

24.29%

13.57%

48.14%

 

Table 8. Use of BNC-ELT types in ELT corpus (type level data).

 

SL-13 and SL-14 contain 2,426 and 2,225 types, respectively. Resolving the percentages shown in Table 7, we see that 1,694 types from sublist SL-01 are used in collection CLT-01 while 2,523 types from the same sublist are used in collection CLT-08. Considering all 14 sublists together, collection CLT-01 uses 3,127 types from the BNC-ELT list and this number accounts for 99.64% of the collection (see Table 5). In contrast, collection CLT-08 uses more than double that amount - specifically 6,866 types - from the BNC-ELT list, this amount accounting for 97.42% of the collection. It is easy to see that collection CLT-08 uses a larger vocabulary than collection CLT-01.

As we mentioned, the interpretation of results must be done with caution as the relationship between the frequency of words and their “difficulty” is not necessarily unequivocal. With this in mind, we now take into account the results from all collections. The profiling information obtained does not show a uniform evolution from CLT-01 to CLT-08 (Tables 5, 7, and 8). For example, CLT-03 uses 24.94% of the types in the BNC-ELT list while CLT-8 uses a little more than half that amount (Table 8). When one takes into consideration that CLT-03 (Family movie scripts) makes use of a more varied vocabulary than CLT-08 (New York Times articles), our second proposition proves invalid. It couldn’t be any other way, a New York Times article is almost necessarily more difficult for a learner than a family movie and while, this might not be so in particular instances, it is certainly so in general. Since these two collections contain not one but 19 scripts and 58 articles respectively, we must find reason for the counter-intuitive results.

The explanation and the reason for caution in interpretation reside in, at least, two observations: first, frequency is a measure of probability of occurrence; second, frequency can influence semantic precision and collocational variation. Regarding probability, the word astronaut has a frequency of less than one occurrence per million words (Leech et al., 2001), meaning that it has about 1/1,000,000 chance of occurring in a randomly chosen text (in contrast with the word people which has a chance of occurrence above 1/1,000). If a text (ELT or otherwise) of, say, 300 tokens in length contains several occurrences of the type astronaut, we know that this word is behaving abnormally, i.e. it is defying its probability of occurrence. Profiling allows us to automatically detect these divergences and as Nation (2006) points out, there is cause to consider if the appearance of infrequent types in a text might not imply they are ‘topic words’ or, simply, words of particular import. A text that includes several occurrences of the word astronaut is likely to be related, in some way or other, to space exploration.

The flip side of probability of occurrence is that it increases in tandem with the size of a text, that is, the larger the text the better are the chances that infrequent words appear. Simply put, the collection CLT-03 contains 325,581 tokens while CLT-08 contains 48,701 tokens, that is, CLT-08 amounts to 14.95% the size of CLT-03. And so, the probability of occurrence of infrequent words is much larger in CLT-03 than it is in CLT-08 even though the former collection is deemed to be easier for learners. The data shown in Tables 5, 7, and 8 reflects the effect of size.

Semantic precision and collocational variation generally go hand in hand. A word such as ‘point’ can be used as a noun and as a (transitive and intransitive) verb, each part of speech dictating different collocational partners. Furthermore, the word point has 46 different meanings (Webster’s New World, 2006) and its highly polysemous nature implies a wide range of collocational relationships. In contrast, the word astronaut has but one single meaning and part of speech (Webster’s New World, 2006) implying that its collocational complexity will be limited. Frequency-wise, the word point occurs 484 times per million words as a noun and 142 times per million as a verb, that is, the word point (SL-01) is over 600 times more frequent than the word astronaut (SL-11). Comparatively speaking, therefore, a higher frequency might imply a heavier learning burden. However, there are cases were the converse may also hold. The precision of meaning that infrequent words exhibit can also imply a greater degree of difficulty for a learner because of the fine distinction of meaning they might convey. Examples can be fallacy (SL-08), assiduous (SL-11), adulation (SL14), or maladroit (unlisted, 0.08 times per million) all of which have one or more approximate synonyms of relatively general meaning and applicability in more frequent words, for example, lie (SL-01), persistent (SL-04), praise (SL-03), or unskillful (SL-02), respectively.

These observations have greatly simplified the notion of learning burden as it is outside the scope of the paper. The intention has been to show that profiling should not be used to determine the difficulty of a text in general or its vocabulary in particular unless the results are interpreted with caution. Granted, results of this investigation have shown that, given similarly sized texts, it is possible to use profiling to differentiate the extreme cases, i.e. the easiest from the most difficult.

Summing up, the BNC-ELT list provides detailed and exhaustive information about the lexical composition of the ELT corpus. Unlike the GSL, it leaves minimal uninformative gaps - regardless of the difficulty of the text - ranging from 0.46% to 2.58% for each collection and 1.47% for the entire ELT corpus of 1,157,493 tokens (Table 5). Results also make it possible to formulate two (weak) propositions: first, comparatively more difficult texts demonstrate a tendency to use a larger amount of infrequent vocabulary; second, comparatively more difficult texts demonstrate a tendency to use a wider, more varied vocabulary.

The discussion so far has shown that the extension of coverage provided by the BNC-ELT (over the GSL) is informative. We now turn to the first two sublists (SL-01 and SL-02) to see what to expect when migrating from the GSL to the BNC-ELT. As previously mentioned, Nation (2004) conducted a comparison of the 3,000 most frequent families in the BNC against the GSL and AWL. His results showed that much of the content of the lists was shared and that the coverage provided (of the corpora he employed) by each was quite similar. In general, the 2,000 most frequent families from the BNC provided marginally better coverage than the GSL with the exception of fiction texts, in which case, the coverage provided by the GSL was superior (again marginally) than that of the 2,000 families from the BNC.

 

Code

CLT-01

CLT-02

CLT-03

CLT-04

CLT-05

CLT-06

CLT-07

CLT-08

ELT Corpus

SL-01

91.45%

82.23%

79.27%

80.04%

82.25%

76.18%

76.84%

76.94%

80.43%

SL-02

4.23%

7.01%

7.07%

9.59%

7.01%

10.28%

9.47%

9.35%

7.65%

 

GSL-01

87.25%

84.39%

78.04%

79.33%

83.59%

73.74%

75.32%

75.08%

80.02%

GSL-02

5.11%

6.60%

7.87%

5.24%

6.58%

7.92%

5.65%

5.96%

6.71%

 

Table 9. Coverage of the ELT corpus by the GSL and BNC-ELT first two sublists.

 

The results shown in Table 9 concur with Nation’s. We used the first 2 sublists (SL-01: 1,000 families; 6,348 types. SL-02: 1,000 families; 5,593 types) from the BNC-ELT and the GSL (GSL-01: 998 families; 4,119 types. GSL-02: 988 families; 3,708 types). Overall coverage of the ELT corpus by both lists is strikingly similar despite the fact that the GSL-01 amounts to only 64.88% the size of the SL-01 and GSL-02 amounts to only 66.29% of the size of SL-02. This will not come as a surprise to those familiar with the origin and content of the GSL. It is a remarkably well-manufactured word list. Note that collections CLT-02 (short stories) and CLT-05 (Novels) are marginally better served by the GSL, again in accord with Nation’s data.

In regards to range, it is of interest to see how these two lists (SL-01 + SL-02: 2,000 families; 11,941 types. GSL: 1,986 families; 7,827 types) work across collections. Table 10 shows the percentage of each word list that appears in all 8 collections (right-most column), only 7 collections, only 6 collections, and so on, until we are left with the percentage of words that do not appear in any collection (left-most column).

 

# of collections

0

1

2

3

4

5

6

7

8

SL-01 + SL-02

0.30%

0.60%

0.90%

2.05%

3.80%

7.45%

15.25%

24.70%

44.95%

GSL

0.76%

0.50%

1.21%

4.03%

5.74%

10.32%

12.99%

21.90%

42.55%

 

Table 10. Amount of words from each word list that appears in up to 8 collections (range).

 

Again, results are strikingly similar for both lists, especially when one takes into consideration that only 3.4% of SL-03 is found in all 8 collections, the trend continuing to decline sharply, 0.9% of SL-04, 0.3% of SL-05, 0.1% of SL-06, and zero beyond this point. It is with this data in mind that we echo Nation’s comment regarding the difficulty of finding a replacement for the GSL in regards to a core vocabulary of general service.

Despite the agreement in coverage and range between the GSL and SL-01+SL-02, differences in content exist. Nation (2004) provided information from the GSL perspective, that is, showing how many of its word families could be found in the 3,000 most frequent BNC word families. His analysis also revealed, as we mentioned previously, that 80% of the AWL was also present among these 3,000 word families. Unsurprisingly, our analyses again reveal corresponding results, the only exception regarding the inclusion of academic vocabulary which in our data is lowered to 67.36% (the remainder of the AWL is present in the BNC-ELT but at lower frequency levels). It appears that the reorganization of the 14,000 word families according to the spoken subcorpus of the BNC has, indeed, produced a word list more appropriate for ELT.

 

alright

Christ

kid

score

America

client

lad

Scotland

awful

county

London

switch

bet

Europe

minus

television

bloke

feed

non

thou

bother

France

okay

traffic

brilliant

Germany

pence

video

Britain

guy

pension

wee

budget

hell

quid

x

chap

Jesus

reckon

 

 

Table 11. The 39 families from SL-01 not present in the GSL + AWL.

 

For those familiar with and migrating from the GSL, Table 11 shows the 39 families in SL-01 that are not found in the GSL or AWL. The underlined words are proper nouns that the GSL makers intentionally excluded as were words such as bloke, chap, wee, or pence, on account of lacking universality (Faucett et al, 1936). Differences increase when considering SL-02 in which 212 new families are introduced.

In conclusion, comparison of content and coverage between SL-01 + SL-02 and the GSL do not make a clear case as to which word list might be “better” in regards to the identification or isolation of a vocabulary of general service. However, the scope of the BNC-ELT is so much larger that there is no question that it provides a more detailed characterization and classification when used in vocabulary profiling. Moreover, from the perspective of the ELT practitioner, migrating from the GSL to the BNC-ELT does not involve a sacrifice of established expertise or practices as the GSL can be considered a sublist of the BNC-ELT in terms of content and application.


Bibliography

 

Brown, G. and Yule, G. 1983. Teaching the spoken language. Cambridge: Cambridge University Press.

Carter, R. and McCarthy, M. 1988. Vocabulary and Language Teaching. New York: Longman.

Coxhead, A. 2000. A new Academic Word List. TESOL Quarterly, 34, 2, pp. 213-238.

Ellis, N. 2002. 'Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition.' Studies in Second Language Acquisition 24/ 2: 143-188.

Faucett, L., H. Palmer, E.L.Thorndike, and M. West. 1936. Interim Report on Vocabulary Selection. London: P.S. King and Son, Ltd.

Gilmore, A. 2007. Authentic materials and authenticity in foreign language learning. Language Teaching, 40, pp. 97-118.

Hyland, K. and Tse, P. 2007. Is there an “Academic Vocabulary”? TESOL Quarterly, 41, 2, pp. 235-253.

Leech, G., P. Rayson, and A. Wilson. 2001. Word Frequencies in Written and Spoken English. Harlow: Pearson Education Limited.

Lorge, I. 1949. The Semantic Count of the 570 Commonest Words. New York: Teachers College, Columbia University.

Mochizuki, M. and K. Aizawa. 2000. An affix acquisitional order for EFL learners: An exploratory study. System, 28 pp. 291-304.

Nation, P. and R. Waring. 1997. Vocabulary size, text coverage, and word lists. In Schmitt and McCarthy (eds.).

Nation, I.S.P. 2001. Learning vocabulary in another language. Cambridge: Cambridge University Press.

Nation, I.S.P. 2004. A study of the most frequent word families in the British National Corpus. In Bogaards and Laufer (eds.).

Nation, I.S.P. 2006. How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63, 1, pp. 59-82.

Sakata, N. 2007. How do Japanese EFL learners comprehend derivatives?: A qualitative analysis from the perspective of vocabulary expansion. JACET Journal, 45, pp. 15-29.

Webster’s New World College Dictionary (4th ed.) 2006. Cleveland, OH: Wiley Publishing, Inc.

West, M. 1953. A General Service List of English Words. London: Longman, Green and Co.

Xue, G. and Nation, I.S.P. 1984. A university word list. Language Learning and Communication, 3, pp. 215-229.

 

 



* Nagoya University of Foreign Studies