Title: | Aggregate Counts of Linguistic Features |
---|---|
Description: | Calculates the lexicogrammatical and functional features described by Biber (1985) <doi:10.1515/ling.1985.23.2.337> and widely used for text-type, register, and genre classification tasks. |
Authors: | David Brown [aut, cre]
|
Maintainer: | David Brown <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2 |
Built: | 2025-02-18 05:27:00 UTC |
Source: | https://github.com/cran/pseudobibeR |
Takes data that has been part-of-speech tagged and dependency parsed and extracts counts of features that have been used in Douglas Biber's research since the late 1980s.
biber( tokens, measure = c("MATTR", "TTR", "CTTR", "MSTTR", "none"), normalize = TRUE ) ## S3 method for class 'spacyr_parsed' biber( tokens, measure = c("MATTR", "TTR", "CTTR", "MSTTR", "none"), normalize = TRUE ) ## S3 method for class 'udpipe_connlu' biber( tokens, measure = c("MATTR", "TTR", "CTTR", "MSTTR", "none"), normalize = TRUE )
biber( tokens, measure = c("MATTR", "TTR", "CTTR", "MSTTR", "none"), normalize = TRUE ) ## S3 method for class 'spacyr_parsed' biber( tokens, measure = c("MATTR", "TTR", "CTTR", "MSTTR", "none"), normalize = TRUE ) ## S3 method for class 'udpipe_connlu' biber( tokens, measure = c("MATTR", "TTR", "CTTR", "MSTTR", "none"), normalize = TRUE )
tokens |
A dataset of tokens created by |
measure |
Measure to use for type-token ratio. Passed to
|
normalize |
If |
Refer to spacyr::spacy_parse()
or udpipe::udpipe_annotate()
for details
on parsing texts. These must be configured to do part-of-speech and
dependency parsing. For spacyr::spacy_parse()
, use the dependency = TRUE
,
tag = TRUE
, and pos = TRUE
arguments; for udpipe::udpipe_annotate()
,
set the tagger
and parser
arguments to "default"
.
Feature extraction relies on a dictionary (included as dict
) and word
lists (word_lists
) to match specific features; see their documentation
and values for details on the exact patterns and words matched by each. The
function identifies other features based on local cues, which are
approximations. Because they rely on probabilistic taggers provided by spaCy
or udpipe, the accuracy of the resulting counts are dependent on the accuracy
of those models. Thus, texts with irregular spellings, non-normative
punctuation, etc. will likely produce unreliable outputs, unless taggers are
tuned specifically for those purposes.
The following features are detected. Square brackets in example sentences indicate the location of the feature.
Verbs in the past tense.
Verbs in the perfect aspect, indicated by "have" as an auxiliary verb (e.g. I [have] written this sentence.)"
Verbs in the present tense.
Place adverbials (e.g., above, beside, outdoors; see list in dict$f_04_place_adverbials
)
Time adverbials (e.g., early, instantly, soon; see dict$f_05_time_adverbials
)
First-person pronouns; see dict$f_06_first_person_pronouns
Second-person pronouns; see dict$f_07_second_person_pronouns
Third-person personal pronouns (excluding it); see dict$f_08_third_person_pronouns
Pronoun it, its, or itself
Pronouns being used to replace a noun (e.g. [That] is an example sentence.)
Indefinite pronouns (e.g., anybody, nothing, someone; see dict$f_11_indefinite_pronouns
)
Pro-verb do
Direct wh- questions (e.g., When are you leaving?)
Nominalizations (nouns ending in -tion, -ment, -ness, -ity, e.g. adjustment, abandonment)
Gerunds (participial forms functioning as nouns)
Total other nouns
Agentless passives (e.g., The task [was done].)
by- passives (e.g., The task [was done by Steve].)
be as main verb
Existential there (e.g., [There] is a feature in this sentence.)
that verb complements (e.g., I said [that he went].)
that adjective complements (e.g., I'm glad [that you like it].)
wh- clauses (e.g., I believed [what he told me].)
Infinitives
Present participial adverbial clauses (e.g., [Stuffing his mouth with cookies], Joe ran out the door.)
Past participial adverbial clauses (e.g., [Built in a single week], the house would stand for fifty years.)
Past participial postnominal (reduced relative) clauses (e.g., the solution [produced by this process])
Present participial postnominal (reduced relative) clauses (e.g., the event [causing this decline])
that relative clauses on subject position (e.g., the dog [that bit me])
that relative clauses on object position (e.g., the dog [that I saw])
wh- relatives on subject position (e.g., the man [who likes popcorn])
wh- relatives on object position (e.g., the man [who Sally likes])
Pied-piping relative clauses (e.g., the manner [in which he was told])
Sentence relatives (e.g., Bob likes fried mangoes, [which is the most disgusting thing I've ever heard of].)
Causative adverbial subordinator (because)
Concessive adverbial subordinators (although, though)
Conditional adverbial subordinators (if, unless)
Other adverbial subordinators (e.g., since, while, whereas)
Total prepositional phrases
Attributive adjectives (e.g., the [big] horse)
Predicative adjectives (e.g., The horse is [big].)
Total adverbs
Type-token ratio (including punctuation), using the statistic chosen in measure
, or TTR if there are fewer than 200 tokens in the smallest document.
Average word length (across tokens, excluding punctuation)
Conjuncts (e.g., consequently, furthermore, however; see dict$f_45_conjuncts
)
Downtoners (e.g., barely, nearly, slightly; see dict$f_46_downtoners
)
Hedges (e.g., at about, something like, almost; see dict$f_47_hedges
)
Amplifiers (e.g., absolutely, extremely, perfectly; see dict$f_48_amplifiers
)
Emphatics (e.g., a lot, for sure, really; see dict$f_49_emphatics
)
Discourse particles (e.g., sentence-initial well, now, anyway; see dict$f_50_discourse_particles
)
Demonstratives (that, this, these, or those used as determiners, e.g. [That] is the feature)
Possibility modals (can, may, might, could)
Necessity modals (ought, should, must)
Predictive modals (will, would, shall)
Public verbs (e.g., assert, declare, mention; see dict$f_55_verb_public
)
Private verbs (e.g., assume, believe, doubt, know; see dict$f_56_verb_private
)
Suasive verbs (e.g., command, insist, propose; see dict$f_57_verb_suasive
)
seem and appear
Contractions
Subordinator that deletion (e.g., I think [he went].)
Stranded prepositions (e.g., the candidate that I was thinking [of])
Split infinitives (e.g., He wants [to convincingly prove] that ...)
Split auxiliaries (e.g., They [were apparently shown] to ...)
Phrasal co-ordination (N and N; Adj and Adj; V and V; Adv and Adv)
Independent clause co-ordination (clause-initial and)
Synthetic negation (e.g., No answer is good enough for Jones.)
Analytic negation (e.g., That isn't good enough.)
A data.frame
of features containing one row per document and one
column per feature. If normalize
is TRUE
, count features are normalized
to the rate per 1,000 tokens.
Biber, Douglas (1985). "Investigating macroscopic textual variation through multifeature/multidimensional analyses." Linguistics 23(2), 337-360. doi:10.1515/ling.1985.23.2.337
Biber, Douglas (1988). Variation across Speech and Writing. Cambridge University Press.
Biber, Douglas (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.
Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100. doi:10.1080/09296171003643098
# Parse the example documents provided with the package biber(udpipe_samples) biber(spacy_samples)
# Parse the example documents provided with the package biber(udpipe_samples) biber(spacy_samples)
For Biber features defined by matching text against dictionaries of word
patterns (such as third-person pronouns or conjunctions), or features that
can be found by matching patterns against text, this gives the dictionary of
patterns for each feature. These are primarily used internally by biber()
,
but are exported so users can examine the feature definitions.
dict
dict
A named list with one entry per feature. The name is the feature
name, such as f_33_pied_piping
; values give a list of terms or patterns.
Patterns are matched to spaCy tokens using quanteda::tokens_lookup()
using the glob
valuetype.
Examples of spaCy and udpipe tagging output from excerpts of several
public-domain texts. Can be passed to biber()
to see examples of its
feature detection.
udpipe_samples spacy_samples
udpipe_samples spacy_samples
An object of class udpipe_connlu
of length 3.
An object of class spacyr_parsed
(inherits from data.frame
) with 1346 rows and 9 columns.
Texts consist of early paragraphs from several public-domain books distributed by Project Gutenberg https://gutenberg.org. Document IDs are the Project Gutenberg book numbers.
See udpipe::udpipe_annotate()
and spacyr::spacy_parse()
for
further details on the data format produced by each package.
For Biber features defined by matching texts against certain exact words,
rather than patterns, this list defines the exact words defining the
features. These lists are primarily used internally by biber()
, but are
exported so users can examine the feature definitions.
word_lists
word_lists
A named list with one entry per word list. Each entry is a vector of words.