The EULE project (Entwicklung von Unterrichtskonzepten zum Lesen lernen im Englischunterricht der Grundschule)

Syntactic complexity indices in FLOWN and SCETCH: BLLIP Parser vs. Stanford Parser

The syntactic complexity indices that are relevant in the present context are those discussed and computationally employed in work by Xiaofei Lu (2010, 2011, 2014), which I draw on heavily (and provide full references to) in the paper "Lexical and syntactic complexity of narrative English texts for children versus for adults: computational and corpus-based approaches to a comparison" (see here). The computation of syntactic complexity indices for the FLOWN and SCETCH corpora as described in that paper is based on the parsing of the FLOWN and SCETCH texts by the Stanford Parser (version 3.5.2 with the PCFG model for English). I was interested in finding out how well the data for these indices correlate with a new data set for the same indices based on the parsing of the FLOWN and SCETCH texts by the BLLIP Parser (also known as the Charniak or Charniak-Johnson parser). The BLLIP parser model that I used was the WSJ+Gigaword-v2 model (see here for an introduction to BLLIP Parser models). Since, for the original data set, the Stanford Parser was instructed to parse only those sentences from FLOWN and SCETCH that are no longer than 100 words, I had the BLLIP Parser parse exactly those sentences from FLOWN and SCETCH that were also parsed by the Stanford Parser. The programme that retrieves the data for the computation of the syntactic indices was adapted so as to cater for the following differences between the output from the two parsers: the root node for the syntactic trees generated by the Stanford Parser carries the label ROOT whereas it is S1 with the BLLIP Parser; the BLLIP Parser makes use of the labels AUX (auxiliary verb not in the -ing form) and AUXG (auxiliary verb in the -ing form) whereas the Stanford Parser does not.

The following lists provide the following information: List A: abbreviations and definitions (quoted from Lu 2011) for the 11 constructional syntactic complexity indices (i.e. those that are not syntactic length indices, i.e. not MLS (mean length of sentence), MLT (mean length of T-unit) and MLC (mean length of clause)); List B: the Spearman correlation coefficients for the correlation between the 11 constructional syntactic complexity indices for the texts in FLOWN and SCETCH as based on the parsing by the Stanford Parser on the one hand and by the BLLIP Parser on the other (in decreasing order); List C: the minima, maxima, means and standard deviations for each of these 11 indices based on the parsing of the sentences in FLOWN and SCETCH by the Stanford Parser and by the BLLIP Parser. All values are rounded to the third decimal digit.

A: syntactic complexity indices and their definitions

T/S: "number of T-units divided by number of sentences"

VP/T: "number of verb phrases divided by number of T-units"

C/S: "number of clauses divided by number of sentences"

C/T: "number of clauses divided by number of T-units"

DC/C: "number of dependent clauses divided by number of clauses"

DC/T: "number of dependent clauses divided by number of T-units"

CT/T: "number of complex T-units divided by number of T-units"

CN/C: "number of complex nominals divided by number of T-units"

CN/T: "number of complex nominals divided by number of clauses"

CP/C: "number of coordinate phrases divided by number of clauses"

CP/T: "number of coordinate phrases divided by number of T-units"

B: correlations Stanford Parser vs. BLLIP Parser

CP/T: 0.980

C/S: 0.977

CN/T: 0.977

DC/T: 0.976

CP/C: 0.973

CT/T: 0.971

VP/T: 0.970

CN/C: 0.969

DC/C: 0.957

C/T: 0.950

T/S: 0.915

C: Stanford Parser vs. BLLIP Parser: min/max/mean/sd

CP/T: Stanford: 0.020/1.212/0.222/0.138 BLLIP: 0.020/1.267/0.221/0.141

C/S: Stanford: 0.750/2.853/1.325/0.314 BLLIP: 0.776/2.824/1.321/0.306

CN/T: Stanford: 0.233/3.364/0.906/0.454 BLLIP: 0.223/3.533/0.874/0.462

DC/T: Stanford: 0.012/1.212/0.348/0.168 BLLIP: 0.006/1.300/0.336/0.173

CP/C: Stanford: 0.018/0.717/0.182/0.096 BLLIP: 0.019/0.746/0.180/0.099

CT/T: Stanford: 0.012/0.697/0.281/0.112 BLLIP: 0.006/0.733/0.268/0.111

VP/T: Stanford: 1.026/3.121/1.685/0.287 BLLIP: 0.929/3.600/1.635/0.332

CN/C: Stanford: 0.270/2.108/0.730/0.269 BLLIP: 0.241/1.915/0.700/0.271

DC/C: Stanford: 0.016/0.579/0.276/0.085 BLLIP: 0.007/0.574/0.264/0.087

C/T: Stanford: 0.735/2.212/1.205/0.206 BLLIP: 0.753/2.267/1.212/0.214

T/S: Stanford: 0.944/1.706/1.091/0.094 BLLIP: 0.918/1.676/1.083/0.090

As can be seen, the syntactic complexity index values for the united FLOWN and SCETCH corpora based on the parsing by the Stanford Parser on the one hand and by the BLLIP Parser on the other hand correlate very strongly. And the values for the minima, maxima, means and standard deviations are very similar between the Stanford- and BLLIP-based data. Actually, the difference of the Stanford- and BLLIP-based means is not statistically significant at the 0.95 significance level for any index according to Welch two sample t-tests. This suggests that the results of the comparison of FLOWN and SCETCH documented in my "Lexical and syntactic complexity of narrative English texts for children versus for adults: computational and corpus-based approaches to a comparison" (see here) would not differ much if the BLLIP Parser were used instead of the Stanford Parser for the retrieval of the syntactic complexity indices.

Last update 2017/02/12