The EULE project (Entwicklung von Unterrichtskonzepten zum Lesen lernen im Englischunterricht der Grundschule)

The SCETCH and FLOWN corpora

One strand of the linguistic work in the project is a corpus-based comparison of lexical and syntactic characteristics of narrative English texts for children with those of narrative English texts for adults. To this end, a small corpus with two corpus sections was compiled, one containing 125 texts aimed at a child readership, the other containing 125 texts aimed at an adult readership.

The corpus section with texts for children is called SCETCH, acronymising Small Corpus of English Texts for Children. Information about the texts contained in SCETCH is laid down in an MS Excel file that can be obtained from here. SCETCH consists of 125 texts with a length of 1000 to 1074 word tokens (with 'word token' here understood as a character sequence matched by the regular expression \w+). By far most of these corpus texts are text extracts from longer stories. A few SCETCH texts contain a short complete story (or several ones) plus an extract from an additional story (or additional stories). The sources for the texts in SCETCH are texts categorised as having a child audience from the British National Corpus (BNC), e-books whose content is not DRM-protected, printed books that were OCR-scanned and unprotected e-books or other files containing text that are freely downloadable from websites where authors offer their work. The selection of the sources for the texts to be contained in SCETCH was rather opportunistic. Beyond availability, no other selection criterion was strictly applied than that the source obviously is, or is claimed to be, a narrative English text, or a collection of such texts, for a child readership. From these sources a part has been randomly extracted that comprises 1000 orthographic word tokens (i.e. a character sequence followed by punctuation or space) plus, potentially, the number of words that were needed to complete a sentence that was cut through by the 1000 words limit. Occasionally, the source consists of a merger of more than one e-book or file whose texts have been authored by the same person and are very similar as far as the content and the age range of the intended readership is concerned.

The corpus section that contains the 125 texts for adults is called FLOWN (rhyming with brown ). This is because the sources of these texts have been taken from the widely known FLOB and FROWN corpora, more specifically, from the text categories K ("General Fiction"), L ("Mystery and Detective Fiction"), M ("Science Fiction"), N ("Adventure and Western") and P ("Romance and Love Story") of these corpora. Information about the texts contained in FLOWN is laid down in an MS Excel file that can be obtained from here. As with the texts for SCETCH, the texts for FLOWN have been randomly extracted from their sources and are cut to a length of 1000+ orthographic word tokens (they are 1005 to 1068 word tokens long, with 'word token' understood as a character sequence matched by the regular expression \w+).

The texts that were extracted from their sources for inclusion in SCETCH and FLOWN were modified and standardised in certain respects. The BNC version available was the XML edition as described online here. The BNC-documents had to be cleared of their XML codes and of the textual material added by the BNC compilers in the header of each document. FLOB and FROWN texts were available from the second edition (1999) of the ICAME CD (see here). Apart from the stripping away of mark-up codes and of the textual material added by the FLOB and FROWN compilers (editorial comments), several decisions had to be taken concerning the treatment of what are called "unusable characters" (i.e. non-ASCII characters) in the manuals accompanying the FLOB and FROWN corpora and of passages or expressions in languages other than English, including transliterations from Greek and misspellings. If a text was considered to be too heavily burdened by such characteristics, it was excluded from consideration for FLOWN. Passages in texts that did show such characteristics to a lesser extent were not discarded and modified in the following ways: Sentences containing foreign expressions that are not commonly used in English and whole sentences in a language other than English were deleted; misspellings were corrected; the position for a character that was coded as 'unusable' was replaced by an ASCII-character (e.g. naïve > naive , señor > senor ). For FLOWN, 125 texts were randomly sampled from 215 narrative FLOB and FROWN texts thus modified. Further modifications of the texts for SCETCH and FLOWBN were carried out in order to standardise their format in view of the syntactic parsing procedure that they are supposed to undergo. A period was inserted after chapter headings and chapter numbers. All kinds of quotation marks were replaced by a pair of single quotation marks. The ellipsis marker (...) was substituted by a period. Paragraph breaks were deleted. Variations in the punctuation of the end of reported speech or thought between cases exemplified by 'You're ill' said the doctor. / 'You're ill,' said the doctor. / 'You're ill', said the doctor. / 'You're ill.' said the doctor. were all standardised to the pattern exemplified by 'You're ill.' said the doctor.

All texts contained in SCETCH and FLOWN were proofread with the aim of correcting obvious typos and ensuring that the modifications just described were carried out consistently. (Lack of capitalisation of the first letter of a sentence after a period was not necessarily corrected, since this has no effect on any of the envisioned computational linguistic applications.)

Last update 2017/01/15