Chen, X. B., & Meurers, D. (2016).
Natural language processing (NLP) methodologies have been widely adopted for readability assessment and greatly enhanced predictive accuracy. In the present study, we study a well-established feature, the frequency of a word in common language use, and systematically explore how such a word-level feature is best used to characterize the reading levels of texts, a text-level classification problem. While traditionally such word-level features are simply averaged for all words of given text, we show that a richer representation leads to significantly better predictive models.
A basic approach adding a feature for the standard deviation already shows clear gains, and two more complex options systematically integrating more frequency information are explored: (i) encoding separate means for the words of a text according to which frequency band of the language they occur in, and (ii) encoding the mean of each cluster of words obtained by agglomerative hierarchical clustering of the words in the text based on their frequency. The former organizes frequency around general language characteristics, whereas the latter aims to lose as little information as possible about the distribution of word frequencies in a given text. To investigate the generalizability of the results, we compare cross-validation experiments within a corpus with cross-corpus experiments testing on the Common Core State Standards reference texts. We also contrast two different frequency norms and compare frequency with a measure of contextual diversity.
PDF Poster Bib Project Code