Do we really need to know this?

In his 2007 TED Talk Sugata Mitra quoted the late Arthur C. Clarke stating, “A teacher that can be replaced by a machine should be.” Although Clarke’s comment is very true, given we live in the age of information and technology we cannot underestimate the importance of using computers to embellish (not replace) our teaching practice.

Corpus data, which is now available online through various websites, provides ESL teachers essential information that can refine course curriculums and can increase relevance in the classroom. In her 2012 talk for The New School on Corpus Linguistics, Randi Reppen stated that corpus data, “Can provide insight into language where intuitions often fail – or worse – give us the wrong information.” This article describes why corpus research is essential in language classrooms and highlights how corpus data can supplement classroom learning.

Making the case for corpus: how corpus data can help supplement classroom learning and provide learner relevance

Although I had spoken English all my life, I had never even heard of the grammatical rules between ‘will’ and ‘going to’ for making future arrangements. When I first learned these rules in a CELTA teacher training course I was mortified, ashamed and frustrated. I felt grammatically robbed.  I wondered why I had not learned this grammar before.

Shameful of my lack of grammatical knowhow, I briefly became an English grammar fundamentalist – reading grammar books became my favorite hobby and my students were exposed to what I failed to know for so long … the rules of proper (or what I thought to be proper) English. Throughout this English grammar purification phase, I taught and reinforced grammatical rules prescriptively.

On one cold, December morning my pedantic choices were questioned. I had decided to dedicate an entire lesson to teaching the differences between “will, going to and the present continuous” for future plans. During a grammar exercise on this topic, a bright and eager student inquired, “Emilia, do we really need to know this?”  “Of course, Markus”, I retorted. But he continued, “… are you really saying that you would have difficulty understanding what I meant if I said, ‘I am visiting …’ instead of, ‘I am going to visit’ ”… Once again, I was quick to respond “Yes” and continued referring to our grammar book stating, “One is a future plan and one is a future intention. There is a difference.” But he pressed on … “But do people really use this when they speak?” I lied and convincingly said what I really did not believe, stating, “Yes, those who speak English well certainly do.”

I reflected for days, months and still now on Markus’s question and on my response. Initially, I thought listening to native speakers would prove that these rules reflect reality. However, during a summer stay in New York City, I intently listened to random native conversations and heard endless violations of these prescriptive rules for futurity.  Appalled, I turned to the BBC and read official news transcripts from their website hoping that BBC’s English would prove that these rules are indeed used. I was saddened to see the same grammatical violations.  At this point I felt utterly confused. I no longer knew what to teach my students and questioned if I should teach prescriptively (by the book) or descriptively (as used).

Upon reflection I realized those students who questioned my pedantic rules were absolutely right. They did not need to know these rules because these rules did not reflect their reality. Instead of using class time to teach relevant, frequent language items, I had wasted hours teaching rules that were anyhow immaterial.

What is corpus?

While the battle between prescriptive and descriptive teaching ensues in our field, corpus data provides some interesting insight to what we should be teaching. Corpus data is a collection of language usage in various registers and corpus linguistics attempts to capture how language is used in ‘the real world.’ Prior to the internet boom, this data was captured manually. Given computers ability to manage, collect and analyze data more efficiently, corpus data and research has boomed in recent years.

There are various corpus data sets which can be used for research including: Corpus of Contemporary American data (COCA) and British National Corpus (BNC) and Longman Spoken and Written English Corpus (LSWE). According to their website, COCA “contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.”  Similarly, the BNC has over 100 million samples of written and spoken language while Longman’s Spoken and Written English (LSWE) Corpus has about 20 million words from four registers. Together, these tools can provide valuable insight for language teachers.

Corpus to help shape your curriculum

While theoretical grammatical rules may still serve a purpose and by no means should be completely ignored, to avoid being overly prescriptive these rules must be checked with how language is really used. Corpus data helps provide this insight.  Had I used corpus data to research the use of will / going to and the present continuous for futurity, I might have encountered similar results to what Biber, D., Johansson, S., Leech, G., Conrad, S., and Finegan, E. (1999) saw in their study; the prescriptive rules I was preaching do not necessarily apply in either spoken/ written or academic /general discourse.  In fact corpus studies show that regardless of the rules, ‘will’ is used in most situations that express futurity.  (See Image 1)

McEnery and Wilson (2001: 120 in Nordberg, 2010: 94) say that “…non-empirically based teaching materials can be positively misleading and … corpus studies should be used to inform the production of materials, so that the more common choices of usage are given more attention than those which are less common”. While using corpus data as part of my action research, I realized that I spent most of my class time focusing on rather moot grammatical topics while avoiding highly frequent language usage. For example, I realized in all my lower intermediate groups, I never reviewed the 12 most commonly used verbs (say, get, go, know, think, see, make, come, take, want, give, mean) and their irregular usage in the past even though, according to Biber and Conrad these 12 verbs account for about 45% of occurrences of all lexical verbs. (Biber et. al., 1999 and Biber, D. and Conrad, S, 2012)

As a teacher, corpus data helps guide the curriculum where intuition might fail. Biber and Conrad (2012: 3) state, “In many cases, we simply don’t notice the most typical grammatical features because they are so common.” Using corpus data helps us avoid these mistakes.

In addition to course curriculum, corpus data can help raise language awareness. For example, many course books provide very clear cut rules regarding how the 4 conditionals work. However, Maule’s (1988) and later Jones and Waller’s (2011) works show these rules are not reflective of reality. In fact, Maule used his own set of data to prove to students that their books’ grammatical rules were limiting. While some critics, like Penny Ur, questioned the extremity of Maule’s findings, in her response to Maule, Ur (1989) recognized the importance of making our students aware of the many types and forms of conditionals (beyond what is prescriptively presented in course books). Corpus is an invaluable tool to doing this.

Corpus for teaching vocabulary

Corpus data can also be an essential source for teaching vocabulary. For example, for one of my lessons I used corpus data to research how the word ‘actually’ is used in American English. According to COCA, ‘actually’ is ranked as the as the 396th most commonly used word in American English and has over 124,000 occurrences in COCA with the majority of occurrences in speech. COCA also shows that the word ‘actually’ is on the rise. (See Image 2)

Corpus data resources also provide a KWIC (Key Word in Context) that gives teachers and students additional information regarding how a word is used in different registers. For example, using concordance data one sees how the typical patterns, various meanings and common collocations for the word ‘actually’.

Having vocabulary frequency, connotation and patterns helps teachers prepare what to teach and also helps learners become more aware of vocabulary and its various meanings and uses in different registers.   

Keeping it simple

While using corpus data may seem time consuming and daunting, at the very least teachers should incorporate some essential elements from corpus data in the classroom. This may include:

Thinking about selecting corpus certified course books:  Given the importance of corpus data, leading publishers like Cambridge and Oxford have started including corpus data (CEC: Cambridge English Corpus / OUP /BNC: British National Corpus) in their textbooks.  When selecting new a new course book look for the ‘Corpus certified’ symbol on the front cover. While this should not be the only guide to selecting a course book, using a book where the author has thought about language relevance and frequency is clearly important.

Enabling students to use corpus data themselves: While some teachers may hesitate teaching students how to use corpus databases, encouraging students to track vocabulary, to check a word’s use / meaning / form and to derive a word’s connotation by using relevant corpus data sources, will provide students with a more complete insight into the target language.  At a minimum, teachers should consider promoting the use of online databases (like Google) to check word / phrase frequency. Google is not as refined as either COCA / BNC it can easily help provide some additional insight on vocabulary usage.

Avoiding prescriptive teaching:  At the very least, it behooves teachers to try using corpus databases so that they are reminded that there are no black and white rules in English. I did a great disservice to my students when I presented them the ‘hard and fast’ rules for futurity. These rules do not necessarily reflect reality and, as a teacher, being aware of English’s ‘shades of grey’ helps reinforce the fact that I should be a ‘guide that helps foster meaningful communication’ rather than ‘a policewoman who enforces language rules’. Corpus data reminds teachers of this fact and helps guide the teacher’s approach in the classroom.


Through their research Biber, D., Conrad, S. and Reppen, R. (1994: 171 in Thornbury, 2010) found that, “Materials used in the teaching of grammar have commonly been based on intuition… In fact, corpus-based research shows that the actual patterns of function and use in English often differ radically from prior expectations…  Some relatively common linguistic constructions are overlooked in pedagogic grammars, while some relatively rare constructions receive considerable attention.” Knowing this, it is essential teachers consider corpus data to guide their curriculum.


Biber, D., S. Conrad, and R. Reppen, (1994). Corpus-based approaches to issues in applied linguistics. Applied Linguistics, Volume 15(2), p. 171.

Biber, D., Conrad, S., Reppen, R. (1998) Corpus linguistics: investigating language structure and use. Cambridge. Cambridge University Press.

Biber, D., Johansson, S., Leech, G., Conrad, S., and Finegan, E. (1999). Longman grammar of spoken and written English. Harlow: Pearson Education .

Jones, C., Waller, D. (January 2011). If only it were true: the problem with the four conditionals.  ELT Journal, Volume 65(1) pages 24-32. doi: 10.1093/elt/ccp101

Maule, D. (April 1988). Sorry, but if he comes, I go. ELT Journal, Volume 42(2), pages 117-123. doi: 10.1093/elt/42.2.117

Nordberg, T. (2010). Modality as portrayed in Finnish Upper Secondary School EFL Textbooks: A corpus based approach. Department of English: University of Helsinki. Retrieved from

Ur, P. (January 1989). Response to Sorry, but if he comes, I go. ELT Journal, Volume 43(1), pages 73-74. doi: 10.1093/elt/43.1

Online resources

Mitra, S. (2010, September 7th). How kids teach themselves. TED. Audio podcast retrieved from

Reppen, R. (2012, February 22nd). Corpus Linguistics. Video retrieved from:

Thorbury, S. (January 2010). C is for Corpus. A-Z Blog. Retrieved from:

Corpus Data Sources: