Module 2: Categorical data analysis with R

General description

This course offers an introduction to categorical data analysis which is specifically geared towards researchers in the field of corpus linguistics, variational linguistics, sociolinguistics and translation/interpreting studies. Categorical data analysis entails the quantitative analysis of datasets in which the dependent variable (or response variable) is nominal or ordinal in nature (i.e. variables with discrete (un)ordered values), and in which it is tested to what extent one or more independent (or explanatory) variables affect the outcome of the dependent variable. Typical examples include research designs in which the choice between 2 or more synonymous lexemes (dependent variable) is related to independent variables such as genre, region, text topic, intended audience and so on; other examples include the alternation between functionally equivalent morphosyntactic constructions such as the that or genitive alternation in English or variable word order in Dutch.

Various inferential statistical analyses are presented and discussed, with a strong focus on hypothesis testing rather than data exploration; authentic data examples from linguistics and translation studies will be provided. The following topics will be covered:

  • Univariate and bivariate CDA: frequencies, proportions, odds, odds ratio; chi-squared test, Fisher’s exact test, Cramer’s V, kappa test
  • Multivariate CDA (1): Binary logistic regression
  • Multivariate CDA (2): Generalized Linear Mixed Models
  • Multivariate CDA (3): Classification and Regression Trees
  • Multivariate CDA (4): Random Forests

After an introduction and demonstration of each of the topics, students will get the opportunity to apply the analysis techniques to their own data or a data set which is provided at the start of the course.

As of 2024, this module can be followed together with module 10 Multivariate data analysis with R in a complete Expert track for this Summer School.

Target audience

Master students, PhD students or postdocs with a keen interest in using statistical analysis in their research.

Contact the teacher if you are uncertain about the prerequisites: gert.desutter@ugent.be

Course materials

Copies of slides will be provided.

Data sets will be provided.

Recommended but optional handbooks:

  • Agresti, A. (2018). An introduction to categorical data analysis. Wiley.
  • Levshina, N. (2015). How to do linguistics with R: data exploration and statistical analysis. Benjamins.
  • Winter, B. (2019). Statistics for linguistics. An introduction using R. Routledge.

Teacher bio

Gert De Sutter is full professor of Dutch linguistics and translation studies at Ghent University. He is a corpus linguist with a keen interest in quantitative research of syntactic variation in Dutch, German, French and English on the one hand, and of differences between translated language varieties (written translation, audiovisual translation, interpreting) and original, non-translated language varieties on the other hand. For his research, he has been using a variety of statistical techniques (a.o. mixed-effects regression, random forests, conditional inference trees), which he also teaches at international summer schools throughout Europe. He has published widely in various leading journals (International Journal of Corpus Linguistics, Target, Across Language and Cultures, Perspectives).

Schedule

  • Monday 15/07/2024, 9:00-10:30 & 11:00-12:30 & 14:00-15:30 & 16:00-17:30
  • Tuesday 16/07/2024, 9:00-10:30 & 11:00-12:30 & 14:00-15:30 & 16:00-17:30
  • Wednesday 17/07/2024, 9:00-10:30 & 11:00-12:30