1. Theoretical concepts
There are two types of corpus linguistics, corpus-driven and corpus-based approaches. Corpus-driven linguists tend to use a corpus inductively to form hypotheses about language, not making reference to existing linguistic frameworks. Corpus-based linguists tend to use corpora in order to test or refine existing hypotheses taken from other sources.
2. Building and Annotating Corpora
Kennedy (1998: 68) suggests that ‘for the study of prosody’ (rhythm, stress, and intonation), a corpus of 100,000 words will usually be big enough to make generalization for most descriptive purposes. He also says that an analysis of verb-form morphology would require half a million words. For lexicography (the analysis of words and their uses, often for dictionary building), a million words is unlikely to be large enough, as up to half the words will only occur once. Biber (1993) suggests that a million words would be enough for grammatical studies. British National Corpus covers a very wide range of written and spoken language genres.
Sampling, balance, and representativeness are key theoretical concepts in corpus linguistics. Because a corpus ought to be representative of a particular language, language variety, or topic, the texts within it must be chosen and balanced carefully in order to ensure that some texts do not skew the corpus as a whole. Corpora are often annotated with additional information, allowing more complex calculations to be performed on them. Such information can take several forms.
3. Types and Applications of Corpora
A specialized corpus however can be smaller and contains a more restricted set of texts. There could be restrictions on genre, time/place /language variety. Specialized corpora are generally easier than general corpora.
Another distinction involves whether a corpus contains spoken, written or computer-mediated texts. Spoken corpora generally tend to be smaller than written or computer-based corpora, due to complexities surrounding gathering and transcribing data.
Written corpora are generally easier to build (and large achieves of texts that were originally published on paper can be found on the internet, meaning that such texts are already electronically coded). However, unless specifically encoded, formatting information such as font size and colour, as well as pictures can be absented from written corpora.
Corpora of computer mediated texts are expected to become increasingly popular, as societies make more use of electronic forms of communication. Such texts can be very easy to gather – mining programs can store whole websites at a time, although it ought to be pointed out that computer-mediated texts can contain a lot of noise such as spam, hidden keywords designed to make a page be attractive to search engines and navigation menus which may need to be stripped out of individual pages before the text can be included in the corpus.
A third distinction involves the language or languages which a corpus is encoded in. A growing area of corpus linguistics involves the comparison of different language, which is useful in fields such as language testing, language teaching and translation. A multilingual corpus usually contains equal amounts of texts from a number of different languages, often in the same genre. Parallel corpus is a more carefully designed type of multilingual corpus, where the texts are exact equivalents (i.e. translation) of each other. Parallel corpora are often sentence-aligned (i.e. tags are added to the corpus data which act as markers to indicate which sentences are translations of each other).
Finally, a learner corpus is a corpus of a particular language produced by learners of that language. Learner corpora can be useful in allowing teachers to identify common errors at various stages of development, as well as showing over – and under use of lexis or grammar when compared to an equivalent corpus of native speaker language.
4. Corpus Software and Analysis
A related form of frequency analysis involves calculating keywords. Keyword is a word which occurs statistically more frequently in one file of corpus. We could refer to our own knowledge in order to hypothesize explanations for our result. Hypotheses are not always validated upon closer investigation. According to Leech, between 1961 and 1991 both American and British users showed a trend towards decrease in use of modal verbs (have to, need to).
A concordance is simply a list of word or phrase, with a few words of context, so we can see at a glance how the word tends to be used. The examination of concordances also helps to reveal discourse prosodies. Discourse prosodies are often indicative of attitudes. A concordance analysis therefore combines aspect of quantitative and qualitative analyses together. A statistical procedure which helps the information more manageable is collocations. Collocation refers to the statically significant co-occurrence of words.
5. Critical Considerations
- Corpora can be time-consuming, expensive, and difficult to build, requiring careful decisions to be made regarding sampling and representativeness.
- Researchers who are not computer literate may initially find it off-putting to have to engage with analytical software or statistical tests.
- Corpus analysis works best at identifying certain types of patterns (Baker, 2010: 109-110)