Module 10: Multivariate data analysis with R

General description

This module offers an overview of the most important techniques for analyzing multivariate data, i.e. data involving several (correlated) variables. Such multivariate data arise often in studies involving language, e.g. research into language attitudes, reaction times to stimuli or cooccurrence frequencies in corpora. In addition, word embeddings in NLP share various ideas with multivariate statistical techniques so these similarities will also be touched upon. In particular, this module will cover, more or less in order:

  • Principal Components Analysis: loadings, dimension reduction, scree plot, interpretation of PC’s, scores
  • Factor Analysis: difference with PCA, factoring methods (e.g. Principal Axis Factoring vs Maximum Likelihood Estimation), rotation methods (orthogonal such as e.g. Varimax vs oblique such as e.g. Oblimin)
  • Biplots
  • Gifi methods: Non-linear PCA & Homogeneity Analysis
  • Correspondence Analysis: Simple Correspondence Analysis, Multiple Correspondence Analysis, relation to LSA & NMF
  • Multi-dimensional Scaling (& Unfolding, Procrustes Analysis, deriving dissimilarities from similarity measures or distance measures, e.g. cosine distance)
  • Analysis of network data (e.g. correlation networks)
  • Cluster Analysis (i.e. unsupervised learning): Hierarchical (clustering methods, tree cuts, silhouette widths), Non-hierarchical (K-means & Partitioning Around Medoids) & Model-based (EM clustering, Latent Class Analysis, relation to PLSA & NMF)

All techniques will be presented in five classes consisting of a theoretical lecture followed by a practical session in R (or RStudio). The R practicals will introduce the most important R packages for multivariate data analysis, such as e.g. psych, GPARotation, FactoMineR, Gifi, casmacof, igraphcluster, mixtools etc. However, this module will not cover structural equation models (e.g. lavaan).

As of 2024, this module can be followed together with module 2 Categorical data analysis with R in a complete Expert track for this Summer School.

Target audience

This module is suitable for anyone doing research on multivariate linguistic data. Typical use cases are language attitudes, reaction times or cooccurrence frequencies in corpora.

Course prerequisites

This modules assumes all participants to have a good understanding of basic concepts in both statistics and the R programming language. If you have no clue what the following R commands do, then this module is probably not appropriate for you:

aggregate(. ~ Species, data = iris, sd)

apply(iris[, 1:4], 2, kruskal.test, g = iris[, “Species”])

pairs(iris)

A good preparation for this module is module 1 Introduction to R of this Summer School.

Course materials

Copies of slides and scripts with R code will be provided.

Recommended but optional handbook: Mair, Patrick (2018). Modern Psychometrics with R. Cham (Switzerland): Springer Nature.

Teacher bio

Koen Plevoets (koen.plevoets@ugent.be) is the coordinator of the Master in Statistical Data Analysis and a research consultant at the Department of Translation, Interpreting and Communication, both at Ghent University. He holds degrees in Master in Artificial Intelligence and in Master of Statistics, and he defended his Ph.D. in 2008 on the socio-stylistic variation of morphological features in colloquial Belgian Dutch. From 2010 until 2016, he worked on several projects on corpus-based translation and interpreting studies. His research interests are the cognitive load and disfluencies of interpreters. He is the author of the R packages corregp and svs.

Schedule

  • Wednesday 17/07/2024, 14:00-15:30, 16:00-17:30
  • Thursday 18/07/2024, 9:00-10:30, 11:00-12:30, 14:00-15:30, 16:00-17:30
  • Friday 19/07/2024, 9:00-10:30, 11:00-12:30, 14:00-15:30, 16:00-17:30

In addition to these contact hours this module expects about five hours in total for self-study.