The Buckingham Journal of Language and Linguistics Volume 3 pp 135-162

 

Functional load: Transcription and analysis of the 10,000 most frequent words in spoken English

 

Leah Gilner* and Franc Morales

 

 

Abstract

 

Not all aspects of a language have equal importance for speakers or for learners. From the point of view of language description, functional load is a construct that attempts to establish quantifiable hierarchies of relevance among elements of a linguistic class. This paper makes use of analyses conducted on the 10-million-word spoken subcorpus of the British National Corpus in order to characterize what amounts to approximately 97% of the phonological forms and components heard and produced by fluent speakers in a range of contexts. Our aim is to provide segmental, sequential, and syllabic level rankings of spoken English that can serve as the basis for reference and subsequent work by language educators and researchers.

 

Introduction

 

It has been posited that there are at least two important reasons why pronunciation is not being taught and why learners are left to their own devices when it comes to this crucial component of spoken interaction. First, there is a lack of understanding regarding what aspects of pronunciation have the most value for learners (Breitkreutz et al., 2002; Jenkins, 2000; MacDonald, 2002). And, second, teaching pronunciation is apparently more prone to marginalization than other aspects of language instruction (Fraser, 2002; Setter and Jenkins, 2005). These two observations are interrelated. After all, if the approach adopted by teacher, program curriculum, or material is unsystematic and lacking a rationale concerning sequencing or selection of priorities, it is understandable that teachers and program avoid pronunciation instruction and that little time and resources are dedicated to it. This problematic fact is compounded by the observation made by Derwing and Munro (2005, p. 383) that there is “little published research on pronunciation teaching and very little reliance on the research that does exist”.

Regardless of the preparation of teachers and the shortcomings of curricula and materials, the absence of priorities in contemporary education does not imply that these do not exist since it is evident that not all elements of a language have equal bearing in its realization. In some languages, for example, vowels dominate word formation while in others consonants do. Moreover, not all vowels play the same role in word formation, as in English, where four vowels do more work than the remainder of the vowel class together. These observations apply to all elements of a language as well as to its realization. The importance of these observations, for both fluent speakers and learners, has been noted by a number of researchers (Catford, 1987; Kitahara, 2008; Stokes and Surenden, 2005).

Additionally, it has been observed that when speakers use language they do so by exercising selection preferences that give prominence to certain features (George, 1997; Leech et al., 2001; Nation, 2004; Sinclair, 1991). Investigation into the frequency of lexical occurrence in language use reveals that those features that work extensively in, for example, word formation are not necessarily prominent in language use, and vice versa. In English, for instance, the segment /ð/ plays a very small role in word formation, there being but a few words that include this segment. However, inspection of language in use shows that this sound is one of the most frequently heard and produced. The study reported in this paper takes into consideration these two modes of quantifying language (i.e., with and without accounting for frequency of occurrence), regarding them as complementary since each is able to offer information that the other one cannot.

The goal of this study is to contribute to the understanding and assistance of the development of perceptive and productive pronunciation skills. To this end, the 10,000 most frequent words in spoken English (as represented by the British National Corpus) have been identified, transcribed, and analyzed. In this manner, the study focuses on words in isolation rather than connected speech. Segments, clusters, and syllables have been investigated based on their role in word formation as well as their frequency of occurrence in language use. The presentation of results makes extensive use of the construct of functional load (FL) because of its roots in phonetic tradition (see Surendran and Niyogi, 2003 for discussion) and its applicability in pronunciation skill instruction and assessment (Brown, 1991; Catford, 1987; Munro and Derwing, 2006).

FL has been variously defined (Catford, 1987; King, 1967; Hockett, 1955) although within common ground. FL can be formulated as a means of quantifying the relative amount of work elements from a linguistic class do in the language. For instance, if one considers the class of vowels in the context of word formation, a measure of FL reveals that the high-front and reduced vowels are used more often in the lexicon than any other vowels and, thus, do more work. Conversely, FL can be conceptualized as the amount of information lost if elements are eliminated from a linguistic class (Surendran, 2003). Regarding phonemic contrasts, for example, FL reveals that the conflation of the segments /d/ and /z/ would make it impossible to distinguish (in isolation) a larger amount of words than the conflation of any other two consonants, thereby making this contrast of greater relevance in production and processing.

The usefulness of FL can be appreciated in the findings from two recent studies, Stokes and Surendran (2005) and Munro and Derwing (2006). The first study tested a range of measures in the prediction of the age of emergence of consonants among English-speaking children, finding that the FL measure was the best indicator. The second study investigated the relationship between FL and speech production in ESL adult learners, concluding that “high functional load errors had a greater impact on listeners’ perceptions of the accentedness and comprehensibility of L2 speech than did low functional load errors” (Munro and Derwing, 2006, p. 529).

The FL rankings used by Munro and Derwing (2006) came from Brown (1991) which, in turn, are based on raw analyses of language undertaken by Denes (1963). One of the motivations for the present investigation is that Denes’ study is one of a kind, thus forcing modern studies (for example, Munro and Derwing, 2006) to use data collected and analyzed some half century ago by a single researcher. A second motivation, and possibly of greater importance, is that inspection of the descriptive study presented in Denes (1963) shows that the size of the language sample used in the analyses was limited to 23,052 tokens (running words) and that the source of the sample was written material from two readers “prepared for teaching English to foreign students” (Denes, 1963, p. 893). The study presented in this paper uses a language sample approximately 400 times larger (9,174,650 running words) and, importantly, the source of the sample is actual spoken language, specifically, spontaneous conversation and task-oriented speech (Leech et al., 2001).

Stokes and Surendran’s (2005) raw analyses are of more recent origin (although mostly from written sources) but they are unavailable. This leaves the field without an up-to-date phonetic description of spoken English. Our interest, therefore, centers on the elicitation of a raw description of spoken language that uses a spoken corpus as its sole source and that is based on a sizeable amount of actual language in use.

 

Methodology

 

The data set was comprised of the 10,000 most frequent unlemmatized words from the analyses conducted by Kilgariff (1995) on the 10-million-word spoken subcorpus of the British National Corpus (BNC). The use of unlemmatized forms ensured that the study is faithful to the actual words produced by fluent speakers and, therefore, those words that learners will ultimately be faced with. Note that Kilgariff’s word list includes certain words that we have excluded from our data set. Specifically, we have dismissed non-words (i.e. er, mm, or ah), unresolvable contracted forms (i.e. ‘s, ‘ll, or ‘ve), proper nouns (i.e. Leicester, Banbury, or Nottinghamshire), and lexical phrases (i.e. a bit, of course, or as well). Together, we estimate the dismissed entries reduced the size of the subcorpus from 10,365,623 tokens to 9,399,232 tokens. In this manner, the occurrence of the 10,000 words used in this study amounts to 97.61% (9,174,650 tokens) of the total running words (tokens) in the subcorpus.

 

Table 1. The vowel system of NAE (shaded areas = +round)

 

 

Front

Central

Back

High

i

 

ə

ʌ

 

u

 

ɪ

ʊ

 

Mid

eɪ

 

 

oʊ

 

ɛ

ɔ

ɔɪ

Low

 

æ   aɪ

aʊ

ɑ

 

Tense

Lax

 

Lax

Tense

 

Table 2. The consonant system of NAE

Manner of Articulation

Place of Articulation

Bilabial

Labialdental

Interdental

Alveolar

Palatal

Velar

Glottal

Plosive

voice

p

 

 

t

 

k

 

+voice

b

 

 

d

 

g

 

Fricative

voice

 

f

θ

s

ʃ

 

h

+voice

 

v

ð

z

ʒ

 

 

Affricate

voice

 

 

 

 

tʃ

 

 

+voice

 

 

 

 

dʒ

 

 

Nasal

+voice

m

 

 

n

 

ŋ

 

Lateral approximant

+voice

 

 

 

l

 

 

 

Approximant

+voice

 

 

 

r

 

 

 

Glide

+voice

w

 

 

 

j

 

 

 

Transcription procedures followed those described in Gilner and Morales (2008).  Each of the 10,000 words was transcribed in broad citation form based on a North American English dialect model (Tables 1 and 2). All transcriptions included syllable boundaries and, if applicable, primary and secondary stress information. The vagaries of syllable boundary identification (Kreidler, 1997; Kreidler, 2004; Ladefoged, 2001) were addressed by consistent application of the Maximum Onset Principle (Anderson, 1982; Pulgram, 1970; Yavaş, 2006), that is, intervocalic consonants were affiliated with syllable-initial positions rather than syllable-final whenever the result was a clustering of consonants in accord with the phonotactic constraints outlined by Kreidler (1997, 2004). Syllables were also consistently transcribed with a vowel nucleus so that syllabic consonants were transcribed as schwa + consonant for the purposes of this study. The transcription procedure was conducted manually and meticulously, each and every word was inspected twice by both authors. Additionally, custom software was developed to facilitate this process and, notably, included a range of background integrity checks aimed at flagging faults and inconsistencies. The size of the task made human error an understandable concern and, therefore, the amount of work invested in securing the accuracy of the transcriptions was substantial.

Once the transcription process was completed, additional custom software was developed to carry out the analyses hereafter presented.

 

General characteristics of the data set

 

As mentioned, the 10,000 words (types) in the data set account for 97.61% (9,174,650 tokens) of the total running words (tokens) in the BNC subcorpus. Table 3 shows the amount of types and tokens for unique orthographic and transcribed forms.

 

Table 3. Number of types and tokens in the data set

 

 

Types

Tokens

Words

10,000

9,174,650

Transcriptions

9,738

9,174,650

 

The number of transcriptions is smaller than the number of (orthographic) words because of the presence of 504 homophones although, naturally, the number of tokens is equal for both orthographic and transcribed forms. These 504 homophones (226 pairs, 12 triplets, 4 quadruplets) are distributed as follows: 332 monosyllabic words (~65.9% of the 504), 114 disyllabic words (~22.6%), 48 trisyllabic words (~9.5%), 8 tetrasyllabic words (~1.6%), and 2 pentasyllabic words (~0.4%).

From this point on, results from analyses will be reported using side-by-side tables. The table on the left will reflect the data set as a collection of words without regard to their frequency of occurrence in language use. This will, for example, allow us to determine the amount of work particular segments do in word formation. The quantities reported in table on the right will take into consideration the frequencies with which the words in the data set occur in the language as it is used. This will, for example, allow us to estimate the amount of work particular segments do in language use.

Results are reported by providing raw quantities together with simple descriptive statistics to assist interpretation. We hope that this approach facilitates subsequent application and work by others. Thus, results from analyses are given by providing actual amounts as these occur in the data and, for ease of interpretation, the percentage share that each element contributes to the whole. Last, we have adopted a measure of FL similar to Catford’s (1987), that is, the element with the highest amount is assigned a FL value of 1 while the FL values of other elements are made proportional to this value.

 

Table 4. Breakdown of data set by number of syllables

 

Types

 

Tokens

Syllable #

Amount

Share

FL

Syllable #

Amount

Share

FL

2

4,100

41.000%

1.00

1

7,281,845

79.369%

1.00

1

2,824

28.240%

0.69

2

1,376,872

15.007%

0.19

3

2,059

20.590%

0.50

3

378,841

4.129%

0.05

4

782

7.820%

0.19

4

111,751

1.218%

0.02

5

209

2.090%

0.05

5

23,581

0.257%

0.00

6

24

0.240%

0.01

6

1,664

0.018%

0.00

7

1

0.010%

0.00

8

49

0.001%

0.00

8

1

0.010%

0.00

7

47

0.001%

0.00

Total

10,000

100.000%

 

Total

9,174,650

100.000%

 

 

Table 4 presents a breakdown of the words in the data set by number of syllables (ranked according to FL). As just mentioned, analyses provide two views, namely, with and without considering frequency of occurrence. We can already appreciate a difference between the word choices made by fluent speakers (Types) and the frequency with which fluent speakers choose to use these words (Tokens). In terms of types, disyllabic words have the highest FL while, in terms of tokens, monosyllabic words have the highest FL.

From the entire lexicon, the 10,000 most frequent words preferred by speakers are largely disyllabics (41.0%), followed by monosyllabics (~28.2%) and trisyllabics (~20.6%). This distribution contrasts with the use speakers make of these words. Monosyllabics clearly dominate the utterances produced (~79.4%). If frequencies in language use were to be uniform (they are not by any means), each monosyllabic word would be used an average of 2,747 times in the collection of samples that makes up the subcorpus while each disyllabic word would be used an average of 340 times. Equally revealing, trisyllabic words amount to ~20.6% percent of the words in the data set but only to ~4.1% of those occurring in language use.

 

Segments

 

The words that form the data set are made up of 59,793 segments, 21,533 vowels and 38,260 consonants, where the counts refer to occurrence of segments in word formation (in word types). For instance, the consonant /n/ occurs twice in the word afternoon and is, therefore, counted twice. When taking into consideration language in use, that is, the frequency of occurrence of the word afternoon, the consonant /n/ receives a value of 3,078 (2 x 1,539 where 1,539 is the frequency of the word afternoon). In this manner, English segments, as they occur in word tokens (i.e. as they occur in the language sample captured by the subcorpus), amount to 29,861,586 instances, 11,747,726 vowels and 18,113,860 consonants.

The FL vowel/consonant ratios are 1:1.78 in word types and 1:1.54 in word tokens. Collectively, consonants do significantly more work in both word formation and language use. There are, of course, more consonants (n = 24) than vowels (n = 15). If we were to assume that all segments were employed with equal frequency (they are not), each vowel would appear an average of ~1,435 times in word types and ~783,182 times in word tokens while each consonant would appear an average of ~1,594 times in word types and ~754,745 times in word tokens.

Naturally, neither individual vowels nor consonants occur with equal frequency in word formation (in word types) or in language use (in word tokens). Table 5 provides a summary of results for vowel segments.

In word formation (in word types), the top four vowels account for ~62.7% of all occurrences. In language use (in word tokens), there is a rearrangement of the segments according to FL, particularly noticeable in the values of central vowels. If clustered by vowel type, front vowels amount to half of all occurrences in both types and tokens (~52.9% and ~52.4%, respectively). The two central vowels, however, drop eight percentage points (from ~26.6% in word types to ~18.8% in word tokens) in favor of back vowels and diphthongs. In other words, back vowels and diphthongs do more work in language use than they do in word formation.

Similar but more uniform trends of distribution can be observed for consonants, possibly due to the larger number of elements in the class. Table 6 shows that the top four consonants account for ~43.7% of all occurrences in word formation (in word types). We also observe a rearrangement of segments

 

Table 5. Frequency of occurrence of vowels

In word types

 

In word tokens

Segment

Amount

Share

FL

Segment

Amount

Share

FL

ə

4,623

21.47%

1.00

ɪ

1,726,282

14.69%

1.00

ɪ

4,523

21.00%

0.98

i

1,624,791

13.83%

0.94

ɛ

2,192

10.18%

0.47

æ

1,152,510

9.81%

0.67

I

2,164

10.05%

0.47

ə

1,122,571

9.56%

0.65

Æ

1,341

6.23%

0.29

ʌ

1,083,276

9.22%

0.63

1,177

5.47%

0.25

ɛ

981,996

8.36%

0.57

ʌ

1,100

5.11%

0.24

u

834,943

7.11%

0.48

ɑ

966

4.49%

0.21

742,069

6.32%

0.43

923

4.29%

0.20

670,444

5.71%

0.39

ɔ

741

3.44%

0.16

ɔ

573,513

4.88%

0.33

731

3.39%

0.16

460,022

3.92%

0.27

U

664

3.08%

0.14

ɑ

383,294

3.26%

0.22

202

0.94%

0.04

219,047

1.86%

0.13

ɔɪ

101

0.47%

0.02

ʊ

148,300

1.26%

0.09

ʊ

84

0.39%

0.02

ɔɪ

24,668

0.21%

0.01

Total

21,533

100.00%

 

Total

11,747,726

100.00%

 

 

according to FL in language use, particularly in the case of the voiced interdental fricative. The disparity of values for /ð/ is well known. Very few words (n = 66 or 0.17% of the total) in the language have this segment but these words are extremely frequent in use (n = 1,036,575 or 5.72% of the total).

The data in Table 6 shows that obstruents do more work than sonorants and that voiced consonants do more work than voiceless consonants. This is so regardless of whether we consider their role in word formation or in language use and, in all four cases, FL values coincide at an approximate 3:2 ratio.

From the point of view of place of articulation, alveolars account for ~63.1% of consonants in word formation (in word types) and ~56.7% in language use (in word tokens). Labials account for ~18.3% and ~19.8%, respectively, while velars/glottals account for ~12.3% and ~10.9%, respectively. Palatals also maintain their presence in both cases and do so at ~5.5%. The significant change takes place in interdentals as already mentioned.

Regarding manner of articulation, plosives account for ~33.7% of consonants in word formation and ~32.5% in language use, fricatives account for ~23.4% and ~26.3% respectively, liquids account for ~20.7% and ~14.9%, nasals for ~17.3% in both cases, affricates for ~2.4% and ~1.5%, and glides for ~2.4% and 7.6%.

 

Segment contrasts (minimal pair analyses)

 

The relative importance of segments in comprehension and intelligibility is highlighted in those cases where they serve to differentiate words and, in particular,   where  a   single  segment  is  the   only   phonetic   element   that

Table 6. Frequency of occurrence of consonants

 

In word types

 

In word tokens

Segment

Amount

Share

FL

Segment

Amount

Share

FL

R

4,931

12.89%

1.00

t

2,371,952

13.09%

1.00

T

4,063

10.62%

0.82

n

2,026,751

11.19%

0.85

N

3,961

10.35%

0.80

r

1,706,548

9.42%

0.72

S

3,771

9.86%

0.76

d

1,259,039

6.95%

0.53

L

3,000

7.84%

0.61

s

1,220,978

6.74%

0.51

K

2,756

7.20%

0.56

ð

1,036,575

5.72%

0.44

D

2,551

6.67%

0.52

l

987,701

5.45%

0.42

Z

1,874

4.90%

0.38

k

881,913

4.87%

0.37

P

1,864

4.87%

0.38

w

849,144

4.69%

0.36

M

1,704

4.45%

0.35

m

752,233

4.15%

0.32

B

1,050

2.74%

0.21

z

691,865

3.82%

0.29

Ŋ

970

2.54%

0.20

j

529,299

2.92%

0.22

F

968

2.53%

0.20

b

529,151

2.92%

0.22

V

830

2.17%

0.17

p

516,677

2.85%

0.22

ʃ

803

2.10%

0.16

v

487,260

2.69%

0.21

G

611

1.60%

0.12

f

459,060

2.53%

0.19

W

573

1.50%

0.12

h

424,178

2.34%

0.18

537

1.40%

0.11

ŋ

353,916

1.95%

0.15

H

379

0.99%

0.08

g

321,165

1.77%

0.14

369

0.96%

0.07

θ

238,255

1.32%

0.10

J

353

0.92%

0.07

ʃ

190,148

1.05%

0.08

Θ

219

0.57%

0.04

136,628

0.75%

0.06

Ð

66

0.17%

0.01

134,397

0.74%

0.06

ʒ

57

0.15%

0.01

ʒ

9,027

0.05%

0.00

Total

38,260

100.00%

 

Total

18,113,860

100.00%

 

differentiates two words. Minimal pairs (MP) abound in the English language due to the large number of monosyllabic words and their frequent use.

For the purposes of this study, MP analyses took into consideration primary stress but not secondary stress information. A search of the data set yielded a total of 14,418 MPs, 3,688 vowel MPs and 10,730 consonant MPs. Following Brown (1991), the frequency of occurrence of the members of each pair was added in order to compute the weight of each MP and, consequently, each contrast. The totals obtained were 21,927,775 occurrences for vowel MPs and 58,120,215 for consonant MPs. The FL of consonant MPs is superior to that of vowel MPs whether as a collection of types (1 to 0.34) or as they appear in language use (1 to 0.38).

Of the 10,000 words, 4,542 participate in at least one MP relationship. There are 15 words that form 30 or more MPs (the maximum case is 34), 300 words form 20 or more MPs, 1,062 words form 10 or more MPs, and 1,952 words form 5 or more MPs. The majority of MPs are formed by monosyllabic words (~84.1%) even though monosyllabic words account for about half (~54.3%) of the 4,542 words that participate in MPs. Out of the 2,824 monosyllabic words in the data set, ~87.3% form MPs in contrast with ~42.2%(1,731 of 4,100) of the disyllabics, ~13.7% of the trisyllabics, ~7.0% of the tetrasyllabics, ~3.8% of the pentasyllabics, and ~8.3% of the hexasyllabics (see Table 4 for reference). The FL of monosyllabic words in the formation of MPs is much higher than any other type of word.

MP analyses of the data set found 99 vowel and 254 consonant contrasts. However, not all contrasts are of equal importance for comprehension and intelligibility since some pairings are formed by segments that are highly dissimilar (i.e. /v/ and /h/). Deciding which contrasts can definitely be dismissed is not always a straightforward task since the learners’ L1 has a bearing on what segments could be problematic. For example, Munro and Derwing (2006) report on the difficulty Chinese L1 learners may have with the contrast /l/ and /n/, a contrast that other English learners do not struggle with. Since we cannot anticipate all possible L1 backgrounds, this paper reports on those contrasts between segments that differ in one distinctive feature, that is, segments that are objectively similar and that are likely to be of relevance to learners regardless of their previous linguistic experience.

Before describing the results for the vowel and consonant contrasts that differ in one distinctive feature, we present in Chart 1 the FL ranking for all consonant contrasts (the smooth curve). The jagged line represents the frequency of occurrence of MPs per contrast. The y-axis represents the FL range and the x-axis represents the 254 consonant contrasts sorted so that the contrasts with more MPs precede those with less MPs (hence the gradual

slope). The chart shows what we have been observing all along, namely, that there is a noticeable disparity between language as a static system where all elements have equal weight and language as a collection of utterances where elements are used with unequal frequency.

Chart 1. Consonant MPs among types and tokens

 

In this particular case (Chart 1), those contrasts that have the greatest number of MPs do not correspond, in general, with those MPs whose constituent words are most frequent. In other words, while some contrasts serve to distinguish a large number of words, these words hardly ever occur. Conversely, some contrasts serve to distinguish but a few words yet these words are very frequent, the high level of activity of these words in language use makes these contrasts important for language users and language learners. This observation, we feel, is one that is important to keep in mind as teachers and materials designers make decisions concerning what should receive the highest priority.

 

Table 7. Vowel MPs and occurrence in language use

 

Number of MPs

 

MP frequencies

Contrast

Amount

Share

FL

Contrast

Amount

Share

FL

i

99

15.87%

1.00

i

965,712

24.31%

1.00

ɪ

ɛ

80

12.82%

0.81

u

589,902

14.85%

0.61

i

ɪ

77

12.34%

0.78

i

ɪ

518,996

13.07%

0.54

ɛ

æ

71

11.38%

0.72

ɛ

æ

508,453

12.80%

0.53

ɛ

ʌ

70

11.22%

0.71

ɛ

ʌ

414,270

10.43%

0.43

ɛ

67

10.74%

0.68

ɪ

ɛ

332,316

8.37%

0.34

u

65

10.42%

0.66

ɔ

ʌ

248,838

6.26%

0.26

ɔ

ʌ

54

8.65%

0.55

ɛ

190,195

4.79%

0.20

ɔ

30

4.81%

0.30

ɔ

172,951

4.35%

0.18

ɔ

ʊ

6

0.96%

0.06

u

ʊ

17,524

0.44%

0.02

u

ʊ

5

0.80%

0.05

ɔ

ʊ

12,978

0.33%

0.01

Total

624

100.00%

 

Total

3,972,135

100.00%

 

 

As mentioned, 99 vowel contrasts were found among the 10,000 words in the data set. Of these, 11 contrasts are between vowels that share all but one distinctive feature and these are shown in Table 7. In the case of, for example, /ɛ/ and /ʌ/ the distinctive feature is [back] while in the case of /i/ and /ɪ/ the distinctive feature is [tense]. The distinctive feature matrices employed for these analyses are based on O’Grady et al. (1993).

The share and FL values in Table 7 have been calculated in relation to only those contrasts that appear in the table rather than the total number of vowel contrasts found (the same applies to Tables 8, 9, 10). All distinctive features (except [reduced]) are represented in the data set as are all the contrasts that are distinguished by a single feature. The feature [reduced] is the exception since the segments /ʌ/ and /ə/ cannot make MPs in this transcription system. There are four contrasts distinguished by the feature [high], four more by the feature [tense], while [low], [back], and [round] distinguish one contrast each. The number of MPs can, then, be characterized by saying that the feature [high] has the highest FL (it accounts for ~40.0% of all MPs), followed by the feature [tense] (~28.7% of the MPs). In regards to MP frequencies, there is an increase in the work done by the feature [high] (~47.9%) while there is a decrease in the amount of work the feature [tense] does (~22.7%). In other words, the preferences exhibited by fluent speakers in language use highlights the necessity for adequate command of the [high] feature to distinguish segments and, consequently, words in production and processing.

 

Table 8. Consonant MPs and occurrence in language use

 

Number of MPs

 

MP frequencies

Contrast

Amount

Share

FL

Contrast

Amount

Share

FL

r

L

189

24.80%

1.00

t

d

1,367,847

42.56%

1.00

t

d

164

21.52%

0.87

r

l

463,540

14.42%

0.34

p

b

84

11.02%

0.44

ð

d

297,359

9.25%

0.22

p

F

77

10.10%

0.41

s

ʃ

195,394

6.08%

0.14

k

g

46

6.04%

0.24

p

b

173,826

5.41%

0.13

s

ʃ

45

5.91%

0.24

k

g

155,282

4.83%

0.11

s

z

39

5.12%

0.21

p

f

153,016

4.76%

0.11

t

θ

34

4.46%

0.18

v

f

129,799

4.04%

0.09

s

θ

24

3.15%

0.13

t

θ

112,067

3.49%

0.08

v

F

21

2.76%

0.11

s

θ

88,772

2.76%

0.06

v

b

15

1.97%

0.08

s

z

38,624

1.20%

0.03

13

1.71%

0.07

v

b

33,807

1.05%

0.02

ð

d

10

1.31%

0.05

4,459

0.14%

0.00

ð

z

1

0.13%

0.01

ð

z

299

0.01%

0.00

Total

762

100.00%

 

Total

3,214,091

100.00%

 

 

Out of the 254 consonant contrasts found in the data set, there are 14 that pair two segments sharing all but one distinctive feature. Table 8 provides FL rankings for these contrasts. As is the case with vowels, the FL values for language use (MP frequencies) are more steeply ranked. Note that the contrast /r/ and /ð/ is not included in Tables 8, 9, and 10 because the single feature that distinguishes this pair of segments is the major class feature [sonorant].

The data set does not contain words that contrast /θ/ and /ð/, /ʒ/ and /z/, or /ʒ/ and /ʃ/ although these are distinguished by a single feature, [voice] in the case of the first two and [strident] in the case of the last contrast. The reason is because at least one of the words that create such MPs is not used frequently enough to appear in the data set (less than 2.4 occurrences per million running words). The seven distinctive features [labial], [round], [coronal], [high], [back], [nasal], and [delayed release] cannot create contrasts on their own given the characteristics of the English consonant system.

 

Table 9. Consonant MPs and occurrence in language use (WI only)

 

Number of MPs

 

MP frequencies

Contrast

Amount

Share

FL

Contrast

Amount

Share

FL

p

B

73

21.10%

1.00

t

d

424,156

26.70%

1.00

p

F

65

18.79%

0.89

ð

d

296,959

18.69%

0.70

t

D

47

13.58%

0.64

s

ʃ

191,450

12.05%

0.45

r

L

42

12.14%

0.58

p

b

171,420

10.79%

0.40

s

ʃ

31

8.96%

0.42

p

f

147,837

9.31%

0.35

k

G

30

8.67%

0.41

k

g

131,999

8.31%

0.31

t

Θ

15

4.34%

0.21

r

l

80,225

5.05%

0.19

v

B

10

2.89%

0.14

s

θ

76,976

4.85%

0.18

v

F

9

2.60%

0.12

t

θ

49,962

3.15%

0.12

s

Θ

8

2.31%

0.11

v

b

7,119

0.45%

0.02

8

2.31%

0.11

v

f

6,989

0.44%

0.02

ð

D

7

2.02%

0.10

2,966

0.19%

0.01

s

Z

1

0.29%

0.01

s

z

462

0.03%

0.00

Total

346

100.00%

 

Total

1,588,520

100.00%

 

 

Of the remaining five features, [voice] distinguishes six contrasts and accounts for ~48.2% of the 762 MPs under consideration while [lateral] distinguishes one contrast and accounts for ~24.8% of the MPs, [continuant] distinguishes four contrasts and accounts for ~17.8% of the MPs, [anterior] distinguishes one contrast and accounts for ~5.9% of the MPs, and [strident] distinguishes two contrasts and accounts for ~3.3% of the MPs. In regards to MP frequencies, there is an increase in the work done by the feature [voice] (~58.2%) while there is a decrease in the amount of work the feature [lateral] does (~14.4%). In other words, adequate command of the feature [voice] is more necessary than other features to distinguish segments and, consequently, words in production and processing as demonstrated by the preferences exhibited by fluent speakers in language use.

The boundaries of words, especially the initial segments, are recognized as playing a determinant role in lexical access (Bent et al., 2007; Dell and Gordon, 2003; Gow et al., 1996; Marslen-Wilson and Zwitserlood, 1989). In the next section we will provide a broad characterization of word boundaries by inspecting word-initial (WI) onsets and word-final (WF) codas. Before moving on, however, it is relevant to ask what role MPs play in relation to the challenges of word identification and comprehensibility. To this end, we have isolated those consonant MPs (and contrasts) from Table 8 that occur in WI and WF positions.

Table 9 provides a FL ranking for those contrasts that are found to have MPs in WI position. Since lexical access is known to rely on WI segments, the proposition is that these contrasts increase the chances of incorrect word identification. In other words, failure to properly articulate or process a word-leading segment may trigger the activation of a MP partner and result in a breakdown of communication.

Comparing the results shown in Tables 8 and 9, one can see that a significant amount of the MPs for the contrasts /p/-/b/, /p/-/f/, /s/-/ʃ/, and /k/-/g/ occur in WI position, ~86.9%, ~84.4%, ~68.8%, and ~65.2%, respectively. MP frequencies for these contrasts are even more striking, ~98.6%, ~96.6%, ~98.0%, and ~85.0%, respectively. It can be safely said that these contrasts exert most of their influence in WI position and are, therefore, of special relevance to word identification and those comprehensibility problems that may result if they are not properly distinguished by learners in production or perception.

Since most of the MPs are formed by monosyllabic words, it is unsurprising that most consonant contrasts are found in either WI (~45.4%) or WF position (~33.9%). Together, 79.3% of all MPs exhibit a contrast at a word boundary. Regarding MP frequencies, these MPs account for ~96.3% (~49.4% and ~46.8%, respectively) of the cumulative total (n = 3,214,091).

Table 10 isolates those consonant contrasts from Table 8 that occur in WF position. In relation to word identification, distinction of words by a single phoneme in WF position implies that two (or more) words are able candidates up to that point. That is to say, the role of that WF segment is of a last chance for correct identification when processing might have already selected a (wrong) candidate due to the higher frequency of one of the words, part of speech and other collocational information, discourse context and expectations, and so on.

Contrasts in WF position (Table 10) are dominated by /t/-/d/ and /r/-l/ in terms of number of MPs (~60.9%) and even more so in terms of MP frequencies (~84.6%). These two contrasts also have the greatest FL values when taking into consideration all positions (Table 8). Both these contrasts mostly occur at word boundaries (from Tables 8, 9, and 10; /t/-/d/ ~88.4% of MPs and ~99.4% of MP frequencies; /r/-/l/ ~53.4% and ~90.2%), particularly in WF position (from Tables 8 and 10; /t/-/d/ ~59.8% of MPs and ~68.4% of MP frequencies; /r/-/l/ ~31.2% and ~72.9%), implying that MPs for these contrasts that do not occur in word boundaries are highly infrequent. Since most MPs (~84.1%) are formed by monosyllabic words, this is unsurprising.

 

Table 10. Consonant MPs and occurrence in language use (WF only)

 

Number of MPs

 

MP frequencies

Contrast

Amount

Share

FL

Contrast

Amount

Share

FL

t

d

98

37.98%

1.00

t

d

935,727

62.16%

1.00

r

L

59

22.87%

0.60

r

l

337,781

22.44%

0.36

s

z

34

13.18%

0.35

v

f

95,030

6.31%

0.10

t

θ

16

6.20%

0.16

t

θ

61,501

4.09%

0.07

s

θ

12

4.65%

0.12

s

z

35,831

2.38%

0.04

k

g

11

4.26%

0.11

k

g

20,594

1.37%

0.02

p

F

8

3.10%

0.08

s

θ

11,107

0.74%

0.01

v

F

7

2.71%

0.07

p

f

3,772

0.25%

0.00

s

ʃ

5

1.94%

0.05

s

ʃ

2,082

0.14%

0.00

p

b

4

1.55%

0.04

1,285

0.09%

0.00

3

1.16%

0.03

p

b

615

0.04%

0.00

ð

d

1

0.39%

0.01

ð

d

123

0.01%

0.00

Total

258

100.00%

 

Total

1,505,448

100.00%

 

 

We conclude the section on MPs by observing that some researchers have gone beyond this level of explanation to suggest that contrasts where MPs are seriously imbalanced in favor of one of the members of the pair are less relevant to the computation of the FL of contrasts (and possibly to learners) than those where the frequency of occurrence of both members of MPs is balanced. Brown (1991) quotes Rischel (1962, p.18-19) as saying: “the functional load of a contrast in the text depends on the existence of minimal pairs of words that are both frequent”, so that when one member is relatively infrequent, the “minimal pair can hardly be said to have any importance” (Brown, 1991, p.219). It is easy to see the logic and relevance of such an observation. However, we feel, imbalance of occurrence does not necessarily rule out the importance of a MP since, after all, the infrequent member is obscured both by its own infrequency and by the dominating frequency of a highly similar word. For these reasons, these observations have not been taken into consideration in the analyses reported here. Our study, however, included additional analyses that explore these observations and an investigation is ongoing.

 

Onsets and codas

 

A presentation of the structure of the English syllable can benefit from first investigating onsets and codas, that is, those consonant segments that precede and follow the vowel nucleus of a syllable. In particular, it is of interest to inspect the manner and frequency with which consonants are sequenced into clusters for this is a well-known source of problems for learners in both production (Celce-Murcia et al., 1996; Gilner and Morales, 2000; Jenkins, 2000; Suenobu, 1992) and comprehension (Altenberg, 2005; Dupoux et al., 1999; Flege, 2003; McAllister et al., 1999; Tarone, 1987).

We begin with a discussion of onsets. Table 11 shows that there are 21,533 onsets in the data set (in word types). This number coincides with the total number of syllables in the 10,000 words since empty onsets are counted too. Taking frequency into consideration, the collection of utterances in the subcorpus contains 11,747,726 occurrences of onsets and, therefore, syllables. Note that results are always going to be influenced by the method of syllabification adopted.

 

Table 11. Breakdown of onsets by length

 

In word types

 

In word tokens

Length

Amount

Share

FL

Length

Amount

Share

FL

1

12,238

56.83%

1.00

1

7,476,015

63.64%

1.00

0

6,989

32.46%

0.57

0

3,719,423

31.66%

0.50

2

2,135

9.92%

0.17

2

529,430

4.51%

0.07

3

171

0.79%

0.01

3

22,858

0.19%

0.00

Total

21,533

100.00%

 

Total

11,747,726

100.00%

 

 

The onset with the highest FL in both types and tokens is a single consonant, followed by the absence of a consonant (the empty onset). Double-segment (CC) and triple-segment (CCC) clusters in onset position amount to ~10.7% of the types and, significantly, to ~4.7% of the tokens. In short, consonant clusters in onset position are relatively rare and even more so in actual speech. The implication is that, from an instructional point of view, frequent words exhibiting CC or CCC clusters in onset position may be of interest in their own right rather than as exemplars of a phonotactic characteristic that, results show, is not abundant in word formation or in language use. The most frequent of these words are: CC clusters, from (~2,265 per million running words), three (~1,786), through (~800), still (~798), probably (~606), start (~454), school (~448), try (~421), etc; CCC clusters, straight (~ 169), street (~143), structure (~83), etc. Patterns can clearly be seen across the onsets of these words but acquisition of the actual exemplars provides learners with the precise words that they are most likely to encounter in production and perception. Whether or not there is agreement on this point, the strongest argument is that, in any case, these are the very words that should be used to illustrate this particular characteristic of the onset of syllables precisely because of their role in language use.

The attentive reader may have noticed that the most frequent words listed above all have the CC or CCC cluster onset in word initial (WI) position. This is unsurprising since most running words (From Table 4, ~79.4%) in the subcorpus are monosyllabic words. Inspection of the 21,533 onsets reveals that although ~46.4% of these onsets fall in WI position, they account for ~78.1% of the 11,747,726 tokens. Table 12 presents results for onsets in WI position.

 

Table 12. Breakdown of onsets by length (WI only)

 

In word types

 

In word tokens

Length

Amount

Share

FL

Length

Amount

Share

FL

1

6,615

66.15%

1.00

1

6,261,142

68.24%

1.00

0

1,734

17.34%

0.26

0

2,475,842

26.99%

0.40

2

1,544

15.44%

0.23

2

422,328

4.60%

0.07

3

107

1.07%

0.02

3

15,338

0.17%

0.00

Total

10,000

100.00%

 

Total

9,174,650

100.00%

 

 

The first observation is that empty onsets in types (word formation) drop by approximately half while all other kinds of onsets increase their share (Tables 11 and 12). However, the values for frequency of occurrence generally hold, with an increase for both single consonant and empty onsets. Fewer empty onsets are doing more work. In an empty onset situation, the vowel nucleus leads the word. The segment /ə/ accounts for ~25.3% of the words in the data set that start with a vowel, /ɪ/ accounts for ~21.1%, /ɛ/ accounts for 15.3%, and /æ/ accounts for ~12.8%. Together, these four segments amount to ~74.5% of all words that start with vowels. In terms of tokens, /ə/ and /ɛ/ are relatively infrequent (~6.7% and ~4.2%, respectively) while /ɪ/ and /æ/ are the most frequent of all vowels (~24.8% and 16.8%, respectively). Of interest, /aɪ/ plays a small role in word formation (~2.7%) but ranks third in terms of frequency (~14.2%).

 

 

 

Chart 2. FL values of single consonant WI onsets.

 

the consonant /k/ has the greatest FL, this value is not significantly greater than those immediately following. Token-wise, the distribution is more pronounced with the consonant /ð/ having the largest FL value.

In word formation (in word types), plosives account for ~39.8% of all single consonant WI onsets, while fricatives account for ~24.5%, liquids for ~14.9%, and nasals for ~10.8%. In terms of frequency of occurrence, the reorganization yields fricatives (~35.6%), plosives (~27.8%), and glides (~19.8%). Nasals maintain their share at ~10.0% while liquids drop to ~6.1%. The observations made about the general distributions of consonants (Table 6) apply to the information shown in Chart 2, that is, obstruents do more work than sonorants and voiced consonants do more work than voiceless consonants.

As mentioned, there are relatively few WI onsets made of a CC cluster and these do not occur frequently in language use. Phonotactic constraints limit which consonants can pair and which can precede and follow, that is, there are restrictions in terms of variety. In particular, analyses show that CC clusters starting with the consonants /s/, /p/, and /k/ (listed in order of frequency) amount to ~61.3% of all WI-CC onsets and account for ~54.7% of all occurrences (all segments drop their share from word formation to language use). WI-CC onsets starting with the consonants /f/, /t/, /θ/ are more active in language use than in word formation.

All CCC clusters in WI onset position necessarily start with the consonant /s/, followed by the plosives /t, /k/, /p/ and ending with a liquid or a glide. Of the five WI-CCC onsets found among the 10,000 words, the cluster /str/ amounts to ~55.1% of those found, accounting for ~69.1% of all tokens. This cluster is, therefore, used often in word formation and the words in which it is found occur frequently.

We now move on to codas. As mentioned, the 10,000 words in the data set are formed by a total of 21,533 syllables, a number that naturally coincides with the number of onsets and codas (since, again, empty onsets and codas are counted too). Similarly, the frequency of occurrence of codas is the same as that of onsets and of syllables. Table 13 presents a breakdown of codas by length. It shows that, as with onsets, single consonant codas have the highest FL values in both types and tokens, followed by the absence of a consonant (the empty coda). Differences between onsets and codas can be observed once we start to look a bit closer.

In word formation (in word types), single consonant onsets account for ~56.8% of all onsets while single consonant codas account for ~66.6% of all codas. Also, empty codas (~18.3%) do a lot less work than empty onsets (~32.5%) while as CC onsets (~9.9%) do less work than CC codas (~13.21%). Quadruple (CCCC) codas exist but their presence is reduced to 5 instances. These do not occur in onset position. In terms of language use (in word tokens), FL values are quite similar (to those of onsets) with the exception of

 

Table 13. Breakdown of codas by length

In word types

 

In word tokens

Length

Amount

Share

FL

Length

Amount

Share

FL

1

14,347

66.63%

1.00

1

6,939,450

59.07%

1.00

0

3,942

18.31%

0.27

0

3,574,740

30.43%

0.52

2

2,845

13.21%

0.20

2

1,130,022

9.62%

0.16

3

394

1.83%

0.03

3

103,139

0.88%

0.01

4

5

0.02%

0.00

4

375

0.00%

0.00

Total

21,533

100.00%

Total

11,747,726

100.00%

1.69

 

CC codas that more than double their presence. Inspection of the data reveals that this increase reflects the use of inflection suffixes.

What held true for onsets regarding their presence in WI position holds true for codas in word-final (WF) position. The numbers are, naturally, identical and ~46.4% of codas fall in WF position, accounting for ~78.1% of the 11,747,726 syllable tokens. Table 14 presents a breakdown of codas in WF position.

 

Table 14. Breakdown of codas by length (WF only)

In word types

 

In word tokens

Length

Amount

Share

FL

Length

Amount

Share

FL

1

5,580

55.80%

1.00

1

4,961,903

54.08%

1.00

2

2,647

26.47%

0.47

0

3,013,293

32.84%

0.61

0

1,377

13.77%

0.25

2

1,096,209

11.95%

0.22

3

391

3.91%

0.07

3

102,870

1.12%

0.02

4

5

0.05%

0.00

4

375

0.00%

0.00

Total

10,000

100.00%

Total

9,174,650

100.00%

 

As is the case with empty onsets, empty codas drop significantly when in word boundary position and, in fact, do less work than double consonant WF codas (in word types). The exposed vowel nuclei that end these words and that do the most work are /i/ (~39.7% in word formation and ~36.4% in language use), /eɪ/ (~16.7% and ~14.0%), /aɪ/ (~11.2% and ~12.5%), /oʊ/ (~11.2% and ~9.3%), and /u/ (~11.1% and ~21.4%). Grouping WF empty codas, front vowels amount to ~75.6% in word formation and ~66.2% in language use while high vowels amount to ~71.5% and ~60.2%, respectively.

As shown in Table 14, single consonant WF codas have noticeably higher FL values in both word types and word tokens than any other kind of WF coda. Chart 3 shows the distribution of segments ranked according to FL in word formation so that the segment with the highest Fl value is placed left-most and the segment with the lowest FL value is placed right-most. Along the x-axis are all consonants, including those that do not occur in WF position (/h/, /w/, /j/). Note that /ð/ and /ʒ/ occur in WF position in two less frequent words each but their respective FL values are too low to be visibly appreciated in Chart 3.

 

Chart 3. FL values of single consonant WF codas.

 

From Chart 3, we can appreciate the dominant role of /t/ in language use despite ranking 7th in terms of FL in word formation. Thus, those WF-C words that end in /t/ are relatively few (n = 410) yet very frequent (1,087,942 occurrences or 12% of all running words in the subcorpus). Problems with these words and with this segment in this position will, therefore, contribute to accentedness and perhaps unintelligible speech. Again, this situation points to the advantages of selecting exemplars in instruction well (that is, of building word lists of frequent words with adequate range), so that even if a particular feature is mastered only in these exemplars, learners will be able to deal with most of the language encountered and required until mastery of the feature is generalized.

In terms of word formation, the sonorants /n/, /ŋ/, /r/, and /l/ have the highest FL values of all WF-C codas and account for ~52.1% of all words that end with a single consonant. As a class, sonorants account for ~55.8% and obstruents for ~44.3% of the 5,580 words that end with a single consonant. Regarding voicing, the large majority of WF-C codas are voiced (~77.7%). Regarding manner of articulation, nasals (~32.1%), plosives (~23.8%), and liquids (~23.7) dominate although fricatives (~18.6%) do substantial work. In terms of language use, obstruents (~55.9%) do more work than sonorants (~44.1%). Regarding voicing, WF-C voiced codas still do most of the work but the voiceless consonants increase their share. Regarding manner of articulation, plosives (~33.1%) overtake nasals (~24.4%) and fricatives (~21.6%) overtake liquids (~19.7%).

The share of work done by CC codas in word formation doubles when considering only those codas that occur in WF position (Tables 13 and 14) and, in fact, the FL value of WF-CC codas is second to that of WF-C codas. In word tokens, however, the amount of work than by WF-CC codas is only slightly greater than when considering CC codas in all positions. The implication is clear. Although most CC codas occur in WF position (2,647 out of 2,845 words) and empty codas in WF position are relatively fewer (1,377 out of 3,942 words), words ending in a vowel are substantially more frequent in use than words ending in CC codas.

Inspection of the WF-CC clusters by final segment reveals that there are only five types of WF-CC codas ending in a sonorant, accounting for ~2.0% of all WF-CC. There are 68 different WF-CC that end in an obstruent and these dominate both word formation and language use (~98.0% and ~98.4%, respectively). In word types, the WF-CC codas with the highest FL values belong to cluster types that mostly correspond to inflectional markers (/rz/ FL = 1.00, /nz/ 0.94, /ts/ 0.56, /lz/ 0.53, /ks/ 0.36, /ns/ 0.36) or that, at least some times, correspond to inflectional markers (/st/ 0.61, /nd/ 0.52, /rd/ 0.47). The exception is the cluster /nt/ which ranks third with an FL value of 0.78. In word tokens, the cluster /nd/ accounts for ~31.0% and the cluster /st/ accounts for ~9.0% of the occurrences of WF-CC words in the subcorpus. Grouping WF-CC codas by the final consonant and looking only at language use, we find that plosives account for ~71.1% and fricatives for ~26.1% of all occurrences of WF-CC words (obstruents account for ~98.0% the occurrences). Affricate, nasal, and liquid ending WF-CC are both few and infrequent.

All 59 WF-CCC cluster types include inflection markers with the exception of four cluster types (i.e., /rld/, /ksθ/, /lfθ/, and /rmθ/). Discounting these four cluster types, the plural and third person marker ends as many clusters as does the past tense marker. The 59 WF-CCC cluster types are distributed among 391 WF-CCC words and, specifically, 24 of them occur in only one word and seven occur in only two words each. The WF-CCC cluster type that does the most work in word formation is /nts/ that occurs in 84 words. When grouped by the final consonant, codas that end in a fricative account for ~77.2% of the WF-CCC words and ~61.9% of the occurrences of these words (n = 102,807) in the subcorpus. There are only five WF-CCCC words (twelfths, attempts, lengths, sixths, and worlds) and each corresponds to one of the five WF-CCCC cluster types (/lfθs/, /mpts/, /ŋkθs/, /ksθs/, and /rldz/, respectively).

 

Syllables

 

As mentioned in the previous section, the words in the data set are composed of 21,533 syllables and these syllables occur 11,747,726 times in the subcorpus. The average length, therefore, is ~2.15 syllables per word and the average occurrence of a syllable in the subcorpus is ~545.6 times.

Naturally, not all 21,533 syllables are unique in terms of their constituent segments and the order in which these segments occur. Inspection of the data set reveals 4,600 different syllable types, that is, unique combinations of segments. Of the 4,600 syllable types, 2,650 correspond to monosyllabic words since there are exactly that many unique monosyllabic transcribed forms (2,824 monosyllabic words minus 174 duplicates from homophones). Table 15 shows a breakdown of syllable types based on their position in words. The table separates syllables based on whether their role can be confined to a specific position in the word (e.g., WF position) or they can be found in multiple positions in word formation (e.g., WI and WF position).

 

Table 15. Breakdown of syllable types based on their position

Syllable types

 

Syllable tokens

Length

Amount

Share

FL

Length

Amount

Share

FL

Monosyllabic only

1,659

36.07%

1.00

Multiple pos (mono)

7,356,448

62.62%

1.00

Multiple pos (mono)

991

21.54%

0.60

Multiple pos (other)

2,137,536

18.20%

0.29

Final only

750

16.30%

0.45

Monosyllabic only

1,503,131

12.80%

0.20

Initial only

589

12.80%

0.36

Final only

358,741

3.05%

0.05

Multiple pos (other)

446

9.70%

0.27

Initial only

340,750

2.90%

0.05

Mid only

165

3.59%

0.10

Mid only

51,120

0.44%

0.01

Total

4,600

100.00%

 

Total

11,747,726

100.00%

 

 

Several general observations can be made based on the information displayed in Table 15. First, ~68.8% of the syllable types (3,163 out of 4,600) occur always in the same position in the word. Of these, 1,659 syllable types correspond to monosyllabic words and the remaining 1,504 to syllable types that only occur in polysyllabic words. Second, there are 991 syllable types corresponding to monosyllabic words that also play a role in word formation. Third, when considering syllable tokens, those syllable types that occur in multiple positions are far more frequent (do far more work) than those that occur in specific ones only.

The breakdown presented in Table 15 does not take into consideration stress. If primary stress is taken into consideration (effectively eliminating monosyllabic words), ~56.1% of all syllable types that can be found in WI-only position receive primary stress compared to ~5.6% of those syllable types only found in word internal positions and ~19.9% of those syllable types found in WF-only position. The remaining ~18.4%of syllable types can take several positions in the word and these syllable types amount, in terms of syllable tokens, to ~58.2% of all occurrences. Significantly, primary-stressed syllable types that occur in WF-only or internal positions are quite infrequent. Primary-stressed syllable types that occur in WI-only position amount to ~32.5% of all occurrences.

 

Table 16. List of syllable shapes found in the subcorpus

In word types

 

In word tokens

Shape

Amount

Share

FL

Shape

Amount

Share

FL

CVC

8,060

37.43%

1.00

CVC

4,172,188

35.51%

1.00

VC

4,956

23.02%

0.61

CV

2,589,865

22.05%

0.62

CV

2,197

10.20%

0.27

VC

2,419,814

20.60%

0.58

CVCC

1,715

7.96%

0.21

V

857,400

7.30%

0.21

CCVC

1,228

5.70%

0.15

CVCC

636,256

5.42%

0.15

V

1,187

5.51%

0.15

VCC

423,509

3.61%

0.10

VCC

754

3.50%

0.09

CCVC

331,083

2.82%

0.08

CCV

529

2.46%

0.07

CCV

124,472

1.06%

0.03

CCVCC

339

1.57%

0.04

CVCCC

77,591

0.66%

0.02

CVCCC

263

1.22%

0.03

CCVCC

66,931

0.57%

0.02

CCCVC

103

0.48%

0.01

VCCC

18,624

0.16%

0.00

VCCC

91

0.42%

0.01

CCCVC

16,365

0.14%

0.00

CCVCCC

38

0.18%

0.00

CCVCCC

6,760

0.06%

0.00

CCCVCC

37

0.17%

0.00

CCCVCC

3,326

0.03%

0.00

CCCV

29

0.13%

0.00

CCCV

3,003

0.03%

0.00

CVCCCC

3

0.01%

0.00

CCVCCCC

184

0.00%

0.00

CCCVCCC

2

0.01%

0.00

CCCVCCC

164

0.00%

0.00

CCVCCCC

1

0.00%

0.00

CVCCCC

115

0.00%

0.00

VCCCC

1

0.00%

0.00

VCCCC

76

0.00%

0.00

Total

21,533

100.00%

 

Total

11,747,726

100.00%

 

If the analysis is conducted based on secondary stress, the relative weight of WI-only and WF-only syllable types is reversed. While the amount of secondary-stressed syllable types in WF-only position is ~43.7% and that of WI-only syllable types is ~33.1%, in terms of occurrence WF-only syllable types amount to ~32.2% of the tokens while WI-only syllable types account for ~16.0% of the tokens. Again, syllable types that occur in multiple positions are the most frequent in terms of tokens (~44.3%) even though there are comparatively fewer in number (~13.3%).

The distribution of unstressed syllables is as follows: WI-only syllable types (~9.1%) and syllable tokens (~1.1%), WF-only syllable types (~36.4%) and syllable tokens (~4.2%), and word internal only syllable types (~8.9%) and syllable tokens (~0.5%). Again, syllable types that occur in multiple positions are more numerous both in terms of types (~45.6%) and, interestingly, tokens (~94.2%). It should be noted that ~70.48% of unstressed syllable types that occur in multiple position correspond to monosyllabic words.

We conclude this overview of syllables by looking at types and tokens of syllable shapes. Table 16 presents all syllable shapes found in both word formation and language use. As expected from the discussion on onsets and codas, the CVC, CV, and VC shapes are the most frequent types of syllable shapes found in the subcorpus.

Together, the CVC, CV, and VC shapes amount to ~70.7% of all syllable shapes employed in word formation and ~78.2% of all occurrences in language use. Interestingly, the VC shape does twice as much work as the CV shape in word formation but both shapes do a similar amount of work in terms of language use. Syllable shapes with CC onsets and codas follow the top three in terms of word formation but are not as frequent in language use as single vowel syllables. Shapes with CCC onsets and codas are rare in both word formation and use. This reinforces the observation made elsewhere that these types of consonant clusters may be best taught by using the few specific and actual high frequency words in which they appear.

 

Closing remarks

 

As previously stated, the purpose of this paper has been to provide an up-to-date description of spoken English relevant to teaching. The BNC spoken subcorpus has provided the language samples to do so while the analyses undertaken have made it possible to quantify segmental, sequential, and syllabic features as they occur in word formation as well as in language use. Employing the construct of functional load as a means of reference has served to highlight the relative importance of the elements within a given linguistic class.

These findings are of immediate pedagogical application in, at least, three ways. First, this description of spoken English is an alternative to intuition worthy of consideration. Second, there are a number of cases where FL provides clear rationale for selection and sequencing of material. Third, while exemplification of any aspect of the pronunciation of English should rely on frequent words (Gilner and Morales, 2008), there are situations when the actual frequent words that exhibit a given feature are few in number. Using these words as illustrative material addresses production and perception problems even the feature itself is not learned beyond these words.

The information reported makes it possible for teachers, curriculum planners, and material designers to make informed decisions regarding what to teach and when. Moreover, researchers have now at their disposal raw data reflecting some of the phonetic characteristics of a spoken corpus of substantial size.

 

References

 

Altenberg, E. P. (2005). The Judgment, Perception, and Production of Consonant Clusters in a Second Language. International Review of Applied Linguistics in Language Teaching (IRAL), 43(1), 53-80.

Anderson, S. R. (1982). The Analysis of French Schwa: Or, How to Get Something from Nothing. Language Learning, 58, 121–138.

Bent, T., Bradlow, A. R., & Smith, B. L. (2007). Phonemic Errors in Different Word Positions and Their Effects on Intelligiblity of Non-Native Speech. In O.S. Bohn & M. J. Munro (Eds.), Language Experience in Second Language Speech Learning: In Honor of James Emil Flege (pp. 331-347). Amsterdam; Philadelphia: John Benjamins Publishing.

Breitkreutz, J. A., Derwing, T. M., & Rossiter, M. J. (2002). Pronunciation Teaching Practices in Canada. TESL Canada Journal, 19, 51-61.

Brown, A. (1991). Teaching English Pronunciation: A Book of Readings. London; New York: Routledge.

Catford, J. C. (1987). Phonetics and the Teaching of Pronunciation. In J. Morley (Ed.), Current Perspectives on Pronunciation: Practices Anchored in Theory (pp. 83-100). Washington DC: TESOL.

Celce-Murcia, M., Brinton, D., & Goodwin, J. (1996). Teaching Pronunciation. Cambridge: Cambridge University Press.

Dell, G. S., & Gordon, J. K. (2003). Neighbors in the Lexicon. In N. O. Schiller & A. Meyer (Eds.), Phonetics and Phonology in Language Comprehension and Production: Differences and Similarities (pp. 8-37). Berlin; New York: Mouton de Gruyter.

Derwing, T. M., & Munro, M. J. (2005). Second Language Accent and Pronunciation Teaching: A Research-Based Approach. TESOL Quarterly, 39(3), 379-397.

Denes, P. B. (1963). On the Statistics of Spoken English. The Journal of the Acoustical Society of America, 35(6), 892-904.

Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., & Mehler, J. (1999). Epenthetic Vowels in Japanese: A Perceptual Illusion? Journal of experimental psychology: human perception and performance, 25(6), 1568-1578.

Flege, J. E. (2003). Assessing Constraints on Second-Language Segmental Production and Perception. In N. O. Schiller & A. Meyer (Eds.), Phonetics and Phonology in Language Comprehension and Production: Differences and Similarities (pp. 319-355). Berlin; New York: Mouton de Gruyter.

Fraser, H. (2002). Change, Challenge, and Opportunity in Pronunciation and Oral Communication. Paper presented at the English Australia Conference. Retrieved November 2002 from http://www-personal.une.edu.au/~hfraser/docs /HFChangeChallengeOpp.pdf.

George, H. V. (1997). Essays in Informational English Grammar with Reference to English Language Teaching. Victoria, AU: La Trobe University.

Gilner, L., & Morales, F. (2000). Interlanguage Development: Phonological Processes and Complexity. Studies in International Relations, Nihon University, Department of Internationals Relations Research Institute Bulletin, 20 (3), 269-282.

Gilner, L., & Morales, F. (2008). Elicitation and Application of a Phonetic Description of the General Service List. System, 36(4), 517-533.

Gow Jr., D. W., Melvold, J., & Manuel, S. (1996). How Word Onsets Drive Lexical Access and Segmentation: Evidence from Acoustics, Phonology and Processing. Paper presented at the International Conference of Spoken Language Process (ICSLP), Philadelphia.

Hockett, C. F. (1955). A Manual of Phonology. Baltimore: Waverly Press.

Jenkins, J. (2000). The Phonology of English as an International Language: New Models, New Norms, New Goals. Oxford: OUP.

Kilgarriff, A. (1995). BNC Database and Word Frequency Lists. Available from http://www.kilgarriff.co.uk/bnc-readme.html.

King, R. D. (1967). Functional Load and Sound Change. Language, 43, 831-852.

Kitahara, M. (2008). Context of Oppositions for an Estimation of Phonemic Functional Load. Journal of the Phonetic Society of Japan, 12(1), 15-23.

Kreidler, C. W. (1997). Describing Spoken English: An Introduction. London; New York: Routledge.

Kreidler, C. W. (2004). The Pronunciation of English: A Course Book in Phonology. Oxford, UK; New York, NY, USA: B. Blackwell.

Ladefoged, P. (2001). A Course in Phonetics. Fort Worth: Harcourt College Publishers.

Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English: Based on the British National Corpus. Harlow: Longman.

Macdonald, S. (2002). Pronunciation - Views and Practices of Reluctant Teachers. Prospect: An Australian Journal of TESOL, 17(3), 3–18.

Marslen-Wilson, W., & Zwitserlood, P. (1989). Accessing Spoken Words: The Importance of Word Onsets. Journal of Experimental Psychology: Human Perception and Performance, 15(3), 576-585.

McAllister, R., Flege, J. E., & Piske, T. (1999). Second Language Comprehension: A Discussion of Some Influencing Factors. Paper presented at the Ninth annual conference on the European Second Language Association (EUROSLA 9), Lund, Sweden.

Munro, M. J., & Derwing, T. M. (2006). The Functional Load Principle in ESL Pronunciation Instruction: An Exploratory Study. System, 34, 520-531.

Nation, I. S. P. (2004). Study of the Most Frequent Word Families in the British National Corpus. In P. Bogaards & B. Laufer (Eds.), Vocabulary in a Second Language (pp. 3-13). Amsterdam: John Benjamins.

O'Grady, W. D., Dobrovolsky, M., & Aronoff, M. (1993). Contemporary Linguistics: An Introduction. New York: St. Martin's Press.

Pulgram, E. (1970). Syllable, Word, Nexus, Cursus. The Hague: Mouton.

Rischel, J. (1962). On Funictional Load in Phonemics. Statistical Methods in Linguistics, 1, 13-23.

Setter, J., & Jenkins, J. (2005). Pronunciation. Language Teaching, 38(1), 1-17.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Stokes, S. F., & Surendran, D. (2005). Articulatory Complexity, Ambient Frequency, and Functional Load as Predictors of Consonant Development in Children. Journal of Speech Language and Hearing Research, 48(3), 577-591.

Suenobu, M. (1992). An Experimental Study of Intelligibility of Japanese English. IRAL, 30(2), 146-156.

Surendran, D. (2003). The Functional Load of Phonological Contrasts. The University of Chicago, Chicago, Illinois.

Surendran, D., & Niyogi, P. (2006). Quantifying the Functional Load of Phonemic Oppositions, Distinctive Features, and Suprasegmentals. In O. Nedergaard Thomsen (Ed.), Competing Models of Linguistic Change (Vol. 279, pp. 43-58). Amsterdam: John Benjamins Publishing.

Tarone, E. (1987). Some Influences on the Syllable Structure of Interlanguage Phonology. In G. Ioup & S. Weinberger (Eds.), Interlanguage Phonology: The Acquisition of a Second Language Sound System (pp. 232-247). Cambridge, MA: Newbury House Publishers.

Yavaş, M. S. (2006). Applied English Phonology. Malden, MA; Oxford: Blackwell Publishing.



* Bunkyo Gakuin University, Japan