| Title: | Functions for Multi-Dimensional Analysis |
|---|---|
| Description: | Multi-Dimensional Analysis (MDA) is an adaptation of factor analysis developed by Douglas Biber (1992) <doi:10.1007/BF00136979>. Its most common use is to describe language as it varies by genre, register, and use. This package contains functions for carrying out the calculations needed to describe and plot MDA results: dimension scores, dimension means, and factor loadings. |
| Authors: | David Brown [aut, cre] (ORCID: <https://orcid.org/0000-0001-7745-6354>), Alex Reinhart [aut] (ORCID: <https://orcid.org/0000-0002-6658-514X>) |
| Maintainer: | David Brown <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.1 |
| Built: | 2026-06-08 11:11:36 UTC |
| Source: | https://github.com/browndw/mda.biber |
Combine scaled vectors of the relevant factor loadings and boxplots of dimension scores.
boxplot_mda(mda_data, n_factor = 1)boxplot_mda(mda_data, n_factor = 1)
mda_data |
An mda data.frame produced by the |
n_factor |
The factor to be plotted. |
A combined plot of scaled vectors and boxplots.
Multi-Dimensional Analysis is a statistical procedure developed by Biber and is commonly used in descriptions of language as it varies by genre, register, and task. The procedure is a specific application of factor analysis, which is used as the basis for calculating a 'dimension score' for each text.
mda_loadings(obs_by_group, n_factors, cor_min = 0.2, threshold = 0.35)mda_loadings(obs_by_group, n_factors, cor_min = 0.2, threshold = 0.35)
obs_by_group |
A data frame containing exactly 1 categorical (factor) variable and multiple continuous (numeric) variables. Each row represents one document/observation. |
n_factors |
The number of factors to be calculated in the factor analysis. |
cor_min |
The correlation threshold for including variables in the factor analysis. Variables whose (absolute) Pearson correlation with any other variable is greater than this threshold will be included in the factor analysis. Set to 0 to disable thresholding. |
threshold |
The loading threshold above which variables should be included in factor score calculations. Set to 0 to include all variables. |
MDA is fundamentally factor analysis using the promax rotation, applied to
the numeric variables in obs_by_group. However, MDA adds two screening steps:
Only variables with a nontrivial correlation with any other variable are
included; the correlation threshold is configurable with the cor_min
argument.
The factor scores are based only on variables whose loadings are greater
(in absolute value) than the threshold argument. (Variables are
standardized to ensure loadings are comparable.)
These two choices eliminate variables that are uncorrelated with others, and essentially enforce sparsity in each factor, ensuring it is loaded only on a smaller set of variables.
An mda data frame containing one row per document, containing
factor scores for each document. Attributes include the number of factors
(n_factors), the correlation threshold (threshold), the factor loadings
(loadings), and the mean factor score for each group (group_means).
Biber (1988). Variation across Speech and Writing. Cambridge University Press.
Biber (1992). "The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings." Computers and the Humanities 26 (5/6), 331-345. doi:10.1007/BF00136979
screeplot_mda(), stickplot_mda(), boxplot_mda()
# Extract the subject area from each document ID and use it as the grouping # variable micusp_biber$doc_id <- factor(substr(micusp_biber$doc_id, 1, 3)) m <- mda_loadings(micusp_biber, n_factors = 2) attr(m, "group_means") heatmap_mda(m)# Extract the subject area from each document ID and use it as the grouping # variable micusp_biber$doc_id <- factor(substr(micusp_biber$doc_id, 1, 3)) m <- mda_loadings(micusp_biber, n_factors = 2) attr(m, "group_means") heatmap_mda(m)
The Michigan Corpus of Upper-Level Student Papers (MICUSP) contains 828 student papers. Here each document is tagged with Biber features using the pseudobibeR package. Type-to-token ratio is calculated using the moving average type-to-token ratio (MATTR).
micusp_bibermicusp_biber
A data frame with 828 rows and 68 columns:
Document ID (from MICUSP)
Rate of past tense per 1,000 tokens
Rate of perfect aspect per 1,000 tokens
Rate of present tense per 1,000 tokens
Rate of place adverbials per 1,000 tokens
Rate of time adverbials per 1,000 tokens
Rate of first person pronouns per 1,000 tokens
Rate of second person pronouns per 1,000 tokens
Rate of third person pronouns per 1,000 tokens
Rate of pronoun 'it' per 1,000 tokens
Rate of demonstrative pronouns per 1,000 tokens
Rate of indefinite pronouns per 1,000 tokens
Rate of proverb 'do' per 1,000 tokens
Rate of wh-questions per 1,000 tokens
Rate of nominalizations per 1,000 tokens
Rate of gerunds per 1,000 tokens
Rate of other nouns per 1,000 tokens
Rate of agentless passives per 1,000 tokens
Rate of by-passives per 1,000 tokens
Rate of 'be' as main verb per 1,000 tokens
Rate of existential 'there' per 1,000 tokens
Rate of that-verb complements per 1,000 tokens
Rate of that-adjective complements per 1,000 tokens
Rate of wh-clauses per 1,000 tokens
Rate of infinitives per 1,000 tokens
Rate of present participles per 1,000 tokens
Rate of past participles per 1,000 tokens
Rate of past participle whiz-deletions per 1,000 tokens
Rate of present participle whiz-deletions per 1,000 tokens
Rate of that-subject clauses per 1,000 tokens
Rate of that-object clauses per 1,000 tokens
Rate of wh-subject clauses per 1,000 tokens
Rate of wh-object clauses per 1,000 tokens
Rate of pied-piping per 1,000 tokens
Rate of sentence relatives per 1,000 tokens
Rate of 'because' per 1,000 tokens
Rate of 'though' per 1,000 tokens
Rate of 'if' per 1,000 tokens
Rate of other adverbial subordinators per 1,000 tokens
Rate of prepositions per 1,000 tokens
Rate of attributive adjectives per 1,000 tokens
Rate of predicative adjectives per 1,000 tokens
Rate of adverbs per 1,000 tokens
Type-token ratio (MATTR)
Mean word length
Rate of conjuncts per 1,000 tokens
Rate of downtoners per 1,000 tokens
Rate of hedges per 1,000 tokens
Rate of amplifiers per 1,000 tokens
Rate of emphatics per 1,000 tokens
Rate of discourse particles per 1,000 tokens
Rate of demonstratives per 1,000 tokens
Rate of possibility modals per 1,000 tokens
Rate of necessity modals per 1,000 tokens
Rate of predictive modals per 1,000 tokens
Rate of public verbs per 1,000 tokens
Rate of private verbs per 1,000 tokens
Rate of suasive verbs per 1,000 tokens
Rate of 'seem' verbs per 1,000 tokens
Rate of contractions per 1,000 tokens
Rate of that-deletions per 1,000 tokens
Rate of stranded prepositions per 1,000 tokens
Rate of split infinitives per 1,000 tokens
Rate of split auxiliaries per 1,000 tokens
Rate of phrasal coordination per 1,000 tokens
Rate of clausal coordination per 1,000 tokens
Rate of synthetic negation per 1,000 tokens
Rate of analytic negation per 1,000 tokens
Michigan Corpus of Upper-Level Student Papers, https://elicorpora.info/main, tagged with the pseudobibeR package.
The scree plot shows each factor along the X axis, and the proportion of common variance explained by that factor on the Y axis. The proportion of common variance explained is given by the factor eigenvalue.
screeplot_mda(obs_by_group, cor_min = 0.2)screeplot_mda(obs_by_group, cor_min = 0.2)
obs_by_group |
A data frame containing 1 categorical (factor) variable and continuous (numeric) variables. |
cor_min |
The correlation threshold for including variables in the factor analysis. |
A wrapper for the nFactors:nScree() and nFactors::plotnScree() functions.
Nothing returned
Stick plots show each group's mean loading along a factor, plotted along a
positive/negative cline. Heatmaps show each variable's loading on a factor.
stickplot_mda() produces just a stick plot, while heatmap_mda() places a
heatmap alongside the stick plot.
stickplot_mda(mda_data, n_factor = 1) heatmap_mda(mda_data, n_factor = 1)stickplot_mda(mda_data, n_factor = 1) heatmap_mda(mda_data, n_factor = 1)
mda_data |
An mda data frame produced by the |
n_factor |
Index of the factor to be plotted. |
ggplot object