The EULE project (Entwicklung von Unterrichtskonzepten zum Lesen lernen im Englischunterricht der Grundschule)

Frequent lemmas in SCETCH and FLOWN

With the help of freely available software packages - the Stanford POS Tagger, Morpha, the Lexical Complexity Analyzer, AntConc - as well as tools standardly available under the Linux operating system and some Perl scripts written for specific purposes, several MS Excel spreadsheets were created that contain information about frequently occurring lemmas in the SCETCH and FLOWN corpora. A lemma is here considered to occur frequently if it occurs at least 10 times in the respective corpus section as a whole and within at least 4 different corpus texts. The lemma spreadsheet for SCETCH can be obtained from here; that for FLOWN from here. Note that the automatic tagging and lemmatisation procedure involves the change of all upper case letters of the words into lower case letters. Conspicuous errors produced by the tagging or lemmatisation procedure have been corrected (such as the lemmatisation of charles and texas under charle and texa respectively). A spreadsheet that lists the differences between the lemma lists contained in the previous two spreadsheets can be obtained from here.

The frequently occurring lemmas in SCETCH represent 107,863 grammatical word tokens. The total number of grammatical words in SCETCH is 128,680. That is, the frequently occurring lemmas account for ca. 84% of all the word tokens in the corpus. The corresponding share in FLOWN is ca. 81% (103,676 grammatical word tokens represented by the frequently occurring lemmas; 128,535 grammatical words in all).

The spreadsheets reveal that among the 1158 frequently occurring lemmas in SCETCH there are 284 (ca. 25%) that do not figure among the 1234 frequently occurring lemmas in FLOWN. The number of frequently occurring FLOWN lemmas that do not figure among the frequently occurring SCETCH lemmas is 360 (ca. 29%). These appear to be quite large shares of mutually 'exclusive' lemmas, an impression that is worthwhile investigating further on the basis of the comparison of other corpora of texts for children and texts for adults. For a validation of this impression would suggest that in order to support the learning of German primary school students to read narrative English children's literature it may not be optimal to base decisions about which lexical items to introduce on considerations relating to narrative adult literature.

As was perhaps to be expected, the great majority of the mutually exclusive lemmas belong to the open classes (nouns, verbs, adjectives, adverbs). There are only very few closed class exclusive items: 1) in SCETCH: the conjunctions nor and either; 2) in FLOWN: the conjunction neither; the prepositions (subsuming subordinating conjunctions) ago, despite and unless; the predeterminer half; the pronouns one and yours; the wh-pronoun whom; the wh-pronoun or -determiner whose. (Interjections are ignored here.)

Last update 2017/01/15