In this data demo, we will be using an open dataset from the Language Use & Social Interaction Lab at Texas Tech University. The data from the project, called “Gendered Language Styles”, was made availabe on OSF: https://osf.io/963gp/
I do not own these data, nor was I involved in any way in the design of the study or the data collection. I am also not in any way affiliated with the authors of this project.
According to the lab website of the authors…“In this project, we manipulated the language of text prompts to be more feminine or more masculine in style and topic, as well as the labeled sex of the author. Participants read these prompts and rated the author.” (https://www.depts.ttu.edu/psy/lusi/research.php)
The authors conducted four studies, where they gathered different information from participants (summarized in key+measures.xlsx).
The authors posed two main research questions on their lab website:
3….Additionally, as psycholinguists, I think it is interesting for us to examine what sociolinguistic aspects of the text predict whether the writing style is perceived to be feminine vs. masculine (Q3).
We will aim to answer these three questions in this analysis.
Life course feminine About two years now, we’ve really just been bearing down and keeping stable, because we had loans to pay off and unfortunate but necessary purchases to make, and a new baby arriving just then. Now though, I’ve been thinking back about my old fantasies of living in Sweden. I even mentioned this, talking with my spouse and kids, and as expected, everyone was immediately thrilled; they love adventures, and have always been really supportive. I’m torn because, while I am really excited about just going and doing something so new and unknown, I also feel I should be considering things like stability, roots, and prospects, which is what we have now, even though moving has really always been the plan. Basically, I’m the deciding factor here, and that’s pretty scary. If it were just me, I would be leaving right away, but because I have my family and the foundation that’s here already to think about, I almost want to slap myself for even entertaining these silly dreams. When I really think about it, both options seem like the clearly right choice. Doing something long term rewarding, and doing something pragmatic and reasonable sound like great options, but I can only do one, which is why this decision seems just about impossible.
Life course masculine The last couple of years were ‘bear down to stabilize’ years, with some loans to pay off, some unfortunately necessary large purchases, and a new baby on top of all that. With a lot of that behind us, I feel a bit of room to breathe. In this mind set, some old dreams of living in Sweden are coming up to the surface. Of course, everyone was thrilled at the idea. Part of me wants to go all in; this was always the plan from the start. Another part looks at what we have in place at the moment: stability, roots, prospects, etc.. I almost want to slap myself for continuing to hold on to these silly fantasies. Support from the spouse and kids is firmly in place; they love a good adventure, particularly into the unknown. I’m the deciding factor in all of this, which is scary. With nothing else to consider, I’d be on the road in a heartbeat, but with a family and the partial foundation we put down, the decision seems impossible. In the best case, it’s between something rewarding in the long term, and something pragmatically reasonable. In a sense, each side seems like the right choice, depending on which one I’m focusing on at the time.
Relational feminine A coworker and I have been working closely on a big project assignment. When we take breaks, they talk about their children, one being autistic. I think generally they sound really attentive and caring, basically like a good parent, but with this particular child, they apparently sometimes have to ‘startle’ them by hitting them, because, as they say, this is the only way they can even start calming them. I would usually keep to myself here, but this seems abusive, or just about. I did actually bring my concerns to them once, but they just said someone who didn’t have an autistic child simply wouldn’t understand. I could imagine that reporting them to CPS might just make things worse, particularly if their child is actually where they should be now. Our working relationship would also be destroyed. Then again, I do feel responsible to do something, if only gather additional information, and I don’t think the actual project would suffer, even if we couldn’t work together. I might even get input from someone actually familiar with autistic children, but this might all be me just getting overly nosy. Basically, however I think about it, everything seems really unclear, and I’m not sure what would really be right.
Relational masculine I started an assignment on a big project, working closely with a coworker. They talk a lot about their children during breaks, one of which has some form of autism. For the most part, they seem like a good parent; attentive, caring. But with this particular child, they say that, on occasion, the only way to get them to calm down is to ‘startle’ them by hitting them. I normally stay out of this kind of thing, but this seems at least on the verge of abuse. At one point, I brought up these concerns to them. They said someone without an autistic child wouldn’t understand, leaving it at that. On the one hand, reporting them to CPS could make the situation worse, particularly in the case that the child is in the best place for them. The working relationship would also be destroyed. On the other, I feel somewhat responsible to at least look more into this, and the project itself wouldn’t suffer without us specifically working on it. I want to talk to someone more familiar with dealing with autistic children for their input. All of this might be overly nosy on my part. From any direction, the most right course of action seems unclear.
…build a data analysis pipeline to answer the three research questions using Tidyverse in R
Figure taken from “R for Data Science”
library(tidyverse)
library(convenience)
library(ez)
library(lmerTest)
Read in the data (“data.csv”)
df = read.csv("data.csv")
How many observations (rows)?
df %>% nrow()
## [1] 1990
How many variables (columns)?
df %>% ncol()
## [1] 225
Take a look at how many NAs are in each column of df. I’m not going to display the output here because it is very long.
map(df, ~sum(is.na(.))) #using purrr
df %>% summarise_all(funs(sum(is.na(.)))) #using dplyr
Familiarize yourself with the column names in the df
df %>% names()
## [1] "reply" "Study" "ParticipantID"
## [4] "demo_age" "demo_sex" "demo_sexor"
## [7] "demo_attraction" "demo_eth" "demo_first_lang"
## [10] "demo_langs" "demo_edu" "demo_pol"
## [13] "demo_sage" "demo_gender" "BSRIFem"
## [16] "BSRIMascPos" "BSRIMascNeg" "SDS"
## [19] "SexismHostile" "SexismBenevolent" "IDWG"
## [22] "EQ" "SQ" "TextQuality"
## [25] "PP1" "PP2" "Behavioral"
## [28] "sds_smile" "sds_preach" "sds_comit"
## [31] "sds_lie" "BSRI_Warm" "BSRI_Gentle"
## [34] "BSRI_Affectionate" "BSRI_Sympathetic" "BSRI_Sensitive"
## [37] "BSRI_Tender" "BSRI_leaderLike" "BSRI_leaderability"
## [40] "BSRI_strongpers" "BSRI_actleader" "BSRI_Dominant"
## [43] "BSRI_Forceful" "BSRI_defend" "BSRI_Decisive"
## [46] "BSRI_attention" "BSRI_aggressive" "BSRI_fem"
## [49] "BSRI_masc" "genfem_act" "genfem_ability"
## [52] "genfem_dictates" "genfem_equal" "genfem_vilify"
## [55] "genfem_important" "genfem_refined" "genfem_behindEveryMan"
## [58] "genfem_notEveryMan" "genfem_cherish" "genfem_offended"
## [61] "genfem_favors" "genfem_control" "IDWG_true"
## [64] "IDWG_offence" "IDWG_responsible" "IDWG_worry"
## [67] "IDWG_camaraderie" "IDWG_important" "ESQ_reaction"
## [70] "ESQ_awkward" "ESQ_understand" "ESQ_intuit"
## [73] "ESQ_masking" "ESQ_machine" "ESQ_building"
## [76] "ESQ_train" "ESQ_road" "ESQ_river"
## [79] "prompt" "Style" "Topic"
## [82] "Label" "Text" "ck_sex"
## [85] "ck_region" "ck_age" "ck_topic"
## [88] "fu_readable" "fu_informative" "fu_compelling"
## [91] "fu_grammatical" "fu_honest" "fu_interesting"
## [94] "fu_wellwritten" "fu_thoughtful" "fu_recommend"
## [97] "pp_fem" "pp_masc" "pp_good"
## [100] "pp_control" "pp_educated" "pp_intelligent"
## [103] "pp_anxious" "pp_tentative" "pp_independent"
## [106] "pp_interested" "pp_average" "pp_likable"
## [109] "pp_wellIntentioned" "pp_powerless" "pp_capable"
## [112] "pp_knowledge" "pp_competent" "relativeSes"
## [115] "socialClass" "pp_success" "pp_similar"
## [118] "pp_relatable" "pp_like" "pp_complex"
## [121] "pp_olike" "pp_noticable" "pp_typical"
## [124] "pp_typicalOfSex" "beh_hearback" "beh_meet"
## [127] "beh_turnsout" "beh_tellSpec" "beh_moreAdv"
## [130] "beh_tellGen" "beh_email" "beh_contact"
## [133] "WC" "Analytic" "Clout"
## [136] "Authentic" "Tone" "WPS"
## [139] "Sixltr" "Dic" "function."
## [142] "pronoun" "ppron" "i"
## [145] "we" "you" "shehe"
## [148] "they" "ipron" "article"
## [151] "prep" "auxverb" "adverb"
## [154] "conj" "negate" "verb"
## [157] "adj" "compare" "interrog"
## [160] "number" "quant" "affect"
## [163] "posemo" "negemo" "anx"
## [166] "anger" "sad" "social"
## [169] "family" "friend" "female"
## [172] "male" "cogproc" "insight"
## [175] "cause" "discrep" "tentat"
## [178] "certain" "differ" "percept"
## [181] "see" "hear" "feel"
## [184] "bio" "body" "health"
## [187] "sexual" "ingest" "drives"
## [190] "affiliation" "achieve" "power"
## [193] "reward" "risk" "focuspast"
## [196] "focuspresent" "focusfuture" "relativ"
## [199] "motion" "space" "time"
## [202] "work" "leisure" "home"
## [205] "money" "relig" "death"
## [208] "informal" "swear" "netspeak"
## [211] "assent" "nonflu" "filler"
## [214] "AllPunc" "Period" "Comma"
## [217] "Colon" "SemiC" "QMark"
## [220] "Exclam" "Dash" "Quote"
## [223] "Apostro" "Parenth" "OtherP"
Make a subset of the dataframe to include first 3 columns, demographic columns, factor score columns, condition columns, follow-up check columns, text perception columns, and person perception columns. Make sure to also grab relativeSes.
Call this ‘df.sub’
df.sub = df %>% select(reply, Study, ParticipantID,
contains("demo"),
BSRIFem, BSRIMascPos, BSRIMascNeg, SDS, SexismHostile, SexismBenevolent, IDWG, EQ, SQ, TextQuality, PP1, PP2, Behavioral,
prompt, Style, Topic, Label,
contains("ck"),
contains("fu"),
contains("pp"), relativeSes, socialClass)
View first 10 rows of subset
df.sub %>% head()
## reply Study ParticipantID demo_age demo_sex demo_sexor
## 1 First 1 1070688528 29 Male Straight or Heterosexual
## 2 First 1 1127145579 32 Female Straight or Heterosexual
## 3 First 1 1264874158 26 Female Straight or Heterosexual
## 4 First 1 1284195027 25 Female Bisexual
## 5 First 1 1344464075 27 Male Straight or Heterosexual
## 6 First 1 1361139138 44 Female Straight or Heterosexual
## demo_attraction demo_eth demo_first_lang
## 1 <NA> White English
## 2 <NA> White English
## 3 <NA> White English
## 4 <NA> Hispanic or Latin English
## 5 <NA> White english
## 6 <NA> Black or African-American, White english
## demo_langs demo_edu demo_pol
## 1 English, Portuguese, Spanish Post-graduate degree Very conservative
## 2 English only College degree Very liberal
## 3 English Some college Somewhat liberal
## 4 English Some college Very liberal
## 5 english College degree Somewhat conservative
## 6 English Some college Somewhat conservative
## demo_sage demo_gender BSRIFem BSRIMascPos BSRIMascNeg SDS SexismHostile
## 1 <NA> <NA> NA NA NA NA NA
## 2 <NA> <NA> NA NA NA NA NA
## 3 <NA> <NA> NA NA NA NA NA
## 4 <NA> <NA> NA NA NA NA NA
## 5 <NA> <NA> NA NA NA NA NA
## 6 <NA> <NA> NA NA NA NA NA
## SexismBenevolent IDWG EQ SQ TextQuality PP1 PP2 Behavioral
## 1 NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA
## prompt Style Topic Label ck_sex ck_region ck_age
## 1 Masculine - Relational Masculine Relational <NA> Female <NA> NA
## 2 Masculine - Relational Masculine Relational <NA> Female <NA> NA
## 3 Masculine - Relational Masculine Relational <NA> Female <NA> NA
## 4 Feminine - Relational Feminine Relational <NA> Male <NA> NA
## 5 Feminine - Relational Feminine Relational <NA> Female <NA> NA
## 6 Feminine - Relational Feminine Relational <NA> Female <NA> NA
## ck_topic beh_hearback BSRI_Forceful fu_readable fu_informative fu_compelling
## 1 <NA> NA NA NA NA NA
## 2 <NA> NA NA NA NA NA
## 3 <NA> NA NA NA NA NA
## 4 <NA> NA NA NA NA NA
## 5 <NA> NA NA NA NA NA
## 6 <NA> NA NA NA NA NA
## fu_grammatical fu_honest fu_interesting fu_wellwritten fu_thoughtful
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## fu_recommend function. focusfuture pp_fem pp_masc pp_good pp_control
## 1 <NA> 68.21 4.62 9 0 NA 5
## 2 <NA> 66.42 3.73 6 4 NA 4
## 3 <NA> 58.33 2.31 6 4 NA 4
## 4 <NA> 63.26 1.86 5 5 NA 2
## 5 <NA> 65.55 2.87 5 3 NA 7
## 6 <NA> 67.66 5.97 9 1 NA 6
## pp_educated pp_intelligent pp_anxious pp_tentative pp_independent
## 1 NA NA NA NA 8
## 2 NA NA NA NA 5
## 3 NA NA NA NA 6
## 4 NA NA NA NA 4
## 5 NA NA NA NA 10
## 6 NA NA NA NA 8
## pp_interested pp_average pp_likable pp_wellIntentioned pp_powerless
## 1 8 NA NA NA NA
## 2 8 NA NA NA NA
## 3 7 NA NA NA NA
## 4 7 NA NA NA NA
## 5 7 NA NA NA NA
## 6 6 NA NA NA NA
## pp_capable pp_knowledge pp_competent pp_success pp_similar pp_relatable
## 1 NA NA NA 3 NA NA
## 2 NA NA NA 4 NA NA
## 3 NA NA NA 7 NA NA
## 4 NA NA NA 5 NA NA
## 5 NA NA NA 3 NA NA
## 6 NA NA NA 6 NA NA
## pp_like pp_complex pp_olike pp_noticable pp_typical pp_typicalOfSex ppron
## 1 2 0 2 2 4 NA 10.77
## 2 5 8 4 8 6 NA 13.81
## 3 6 7 7 4 6 NA 10.65
## 4 5 7 7 5 6 NA 13.02
## 5 3 2 5 2 3 NA 13.40
## 6 10 7 7 8 7 NA 10.95
## relativeSes socialClass
## 1 -3 NA
## 2 -2 NA
## 3 2 NA
## 4 5 NA
## 5 -2 NA
## 6 4 NA
There is always some cleaning that is necessary before beginning data analysis.
When participants are given the opportunity to type their responses, there are a myriad of issues that can occur. For example, spelling mistakes, capitalization, abbreviations, etc.
This is the case for demo_first_lang. View all unique values in this column.
df.sub %>% distinct(demo_first_lang) # Same as unique(df.sub$demo_first_lang)
## demo_first_lang
## 1 English
## 2 english
## 3 Spanish
## 4 english
## 5 English
## 6 ENGLISH
## 7 Englisgh
## 8 Russian
## 9 Eneglish
## 10 Korean
## 11 Slovak
## 12 Arabic
## 13 Hindi
## 14 English/Malay bilingual
## 15 Urdu
## 16 Marathi
## 17 <NA>
## 18 Chinese
## 19 Mandarin
## 20 english and spanish, now i speak only english fluently
## 21 Pashto
## 22 Swedish
## 23 Urdu
## 24 French
## 25 Other
Clean up this column by trimming whitespace, lowercasing, and fixing typos.
Call this ‘clean’
clean = df.sub %>% mutate(demo_first_lang = trimws(demo_first_lang)) %>%
mutate(demo_first_lang = tolower(demo_first_lang)) %>%
mutate(demo_first_lang = gsub(pattern = "englisgh", replacement = "english", demo_first_lang)) %>%
mutate(demo_first_lang = gsub(pattern = "eneglish", replacement = "english", demo_first_lang))
Check your work.
clean %>% distinct(demo_first_lang)
## demo_first_lang
## 1 english
## 2 spanish
## 3 russian
## 4 korean
## 5 slovak
## 6 arabic
## 7 hindi
## 8 english/malay bilingual
## 9 urdu
## 10 marathi
## 11 <NA>
## 12 chinese
## 13 mandarin
## 14 english and spanish, now i speak only english fluently
## 15 pashto
## 16 swedish
## 17 french
## 18 other
Take a look at the unique values of demo_sex.
clean %>% distinct(demo_sex)
## demo_sex
## 1 Male
## 2 Female
## 3 Other
## 4 Prefer not to say
## 5 Male; Female
## 6 Female; Other; Transgender
## 7 <NA>
## 8 Other; agender
## 9 Male; Prefer not to say
## 10 Male; Female; Other; Non-binary gender, biologically female
## 11 Other; non binary
## 12 Male; Female; Other; Non-Binary They/Them pronouns
It seems like this question actually reflects gender identity, given the responses. Also, it seems like this is a multiple selection question (selections separated by semicolon).
It’s super important to be inclusive in the way gender is assessed in questionnaires because you want to capture the reality of your participants (see resources). Nevertheless, any multiple selection question poses challenges for data science. This is because it presents a large permutation of possible selections, resulting in categories that have maybe one or two respondents in them and others that have potentially hundreds of respondents in them.
I don’t think anyone has figured out the best way to balance these demands (if you know, please let me know!). So what I think we will do here is group responses that are not “Female” or “Male” into “Other”.
You may be wondering why we don’t use demo_gender. Well, this was seemingly only assessed in Study 4. So we are going to replace this with our own column (again because I think the question was about gender identity).
clean = clean %>% mutate(demo_gender = ifelse(demo_sex == "Female", "Female",
ifelse(demo_sex == "Male", "Male", "Other")))
Check your work.
clean %>% distinct(demo_gender)
## demo_gender
## 1 Male
## 2 Female
## 3 Other
## 4 <NA>
Let’s say we want to limit our analysis to L1 English participants.
clean = clean %>% filter(demo_first_lang == "english")
Check your work.
clean %>% distinct(demo_first_lang)
## demo_first_lang
## 1 english
In studies 3 and 4, each participant gave two responses (coded in reply). I think they received some sort of feedback between responses, but it’s unclear. So let’s just keep the first response of everyone.
clean = clean %>% filter(reply=="First")
Check your work.
clean %>% distinct(reply)
## reply
## 1 First
Visualize the distribution of demo_age with a dotplot.
clean %>% ggplot(aes(demo_age)) + geom_dotplot(binwidth=5, color = "black", fill = "#E7B800", alpha=0.5) + theme_bw() + ylab("Count") + xlab("Age")
It looks like there is someone with an age of over 300, which is…impossible! Filter out this value, and then check your work.
clean = clean %>% filter(demo_age < 100)
clean %>% summarise(max(demo_age))
## max(demo_age)
## 1 83
Check to make sure there is only one row per participant.
clean %>% group_by(ParticipantID) %>% summarise(count=n()) %>% arrange(desc(count))
## # A tibble: 1,206 x 2
## ParticipantID count
## <dbl> <int>
## 1 1000322124 1
## 2 1011291035 1
## 3 1012091624 1
## 4 1028387854 1
## 5 1035132554 1
## 6 1041903476 1
## 7 1046231287 1
## 8 1057053640 1
## 9 1070688528 1
## 10 1075923140 1
## # … with 1,196 more rows
Perhaps we want to export our clean dataframe to a csv so that in the future we don’t need to run through all the cleaning steps again. What would this look like?
write.csv(clean, "clean.csv")
Our third research question is about whether perceptions of the text relate to perceptions of the author. In particular, we are interested in whether the author is perceived to be feminine or masculine.
In the dataset, these values are represented in separate columns called pp_fem and pp_masc. This makes it hard to summarize or visualize both simultaneously (e.g., in the same figure).
We can apply some Tidyr functions to alter the structure of the data so that we can easily visualize pp_fem and pp_masc on the same figure.
We want to pivot the data so that there is a new column called ‘pp_gender’ with rows for pp_fem and pp_masc.
Make this as a new dataframe, rather than overwriting clean. Call it ‘clean.pivot’
clean.pivot = clean %>% pivot_longer(c("pp_fem", "pp_masc"), names_to = "pp_gender", values_to = "pp_gender_score")
Check that this worked by comparing the number of rows in ‘clean’ and ‘clean.pivot’.
clean %>% nrow()
## [1] 1206
clean.pivot %>% nrow()
## [1] 2412
How many participants do we have?
clean %>% summarise(count=n())
## count
## 1 1206
Calculate the number of participants per gender group.
clean %>% group_by(demo_gender) %>% summarise(count=n())
## # A tibble: 3 x 2
## demo_gender count
## <chr> <int>
## 1 Female 645
## 2 Male 541
## 3 Other 20
Calculate the average age and the standard deviation.
clean %>% summarise(mean.age = mean(demo_age, na.rm=T), sd.age = sd(demo_age, na.rm=T))
## mean.age sd.age
## 1 33.35572 10.5257
Summarize the education of the sample. Arrange the summary table by education level.
education = c("Prefer not to say", "None, to Some junior high school", "Currently in high school", "Some high school", "High school diploma or equivalent", "Currently in college", "Some college", "College degree", "Currently in graduate school", "Post-graduate degree")
clean = clean %>% mutate(demo_edu = factor(demo_edu, levels = education))
clean %>% group_by(demo_edu) %>% summarise(count=n())
## # A tibble: 11 x 2
## demo_edu count
## <fct> <int>
## 1 Prefer not to say 4
## 2 None, to Some junior high school 1
## 3 Currently in high school 2
## 4 Some high school 4
## 5 High school diploma or equivalent 105
## 6 Currently in college 55
## 7 Some college 319
## 8 College degree 488
## 9 Currently in graduate school 20
## 10 Post-graduate degree 149
## 11 <NA> 59
Summarize the political affiliation of the sample. Arrange from very conservative to very liberal.
politic = c("Very conservative", "Somewhat conservative", "Neither liberal nor conservative", "Somewhat liberal", "Very liberal")
clean = clean %>% mutate(demo_pol = factor(demo_pol, levels=politic))
clean %>% group_by(demo_pol) %>% summarise(count=n())
## # A tibble: 5 x 2
## demo_pol count
## <fct> <int>
## 1 Very conservative 55
## 2 Somewhat conservative 196
## 3 Neither liberal nor conservative 257
## 4 Somewhat liberal 452
## 5 Very liberal 246
Let’s calculate some summary statistics related to our first two research questions. Use the convenience package to do this (https://github.com/jasongullifer/convenience).
sem(clean, dv=relativeSes, id=ParticipantID, Style) #Q1
## # A tibble: 2 x 7
## Style mean_relativeSes sd_relativeSes N SEM upper lower
## <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Feminine 0.566 1.26 594 0.0519 0.618 0.514
## 2 Masculine 0.569 1.30 612 0.0525 0.621 0.516
sem(clean, dv=relativeSes, id=ParticipantID, Style, Topic) #Q2
## # A tibble: 4 x 8
## Style Topic mean_relativeSes sd_relativeSes N SEM upper lower
## <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Feminine Life Cour… 0.704 1.18 334 0.0647 0.768 0.639
## 2 Feminine Relational 0.388 1.34 260 0.0833 0.472 0.305
## 3 Masculine Life Cour… 0.870 1.21 353 0.0642 0.934 0.805
## 4 Masculine Relational 0.158 1.31 259 0.0814 0.240 0.0769
We can create a correlation table of our text perception items to see how they pattern with each other, and how they relate to feminine vs. masculine perceptions of the author.
Text perception was not measured in all four studies. So let’s start by looking at how the following variables correlate with pp_fem and pp_masc in STUDY 3:
fu_readable, fu_informative, fu_interesting, fu_wellwritten, fu_thoughtful.
Drop NAs here (and make sure any non-numerics are transformed to numeric).
corr.vars.s3 = clean %>% select(pp_fem, pp_masc, fu_readable, fu_informative, fu_interesting, fu_wellwritten, fu_thoughtful) %>% drop_na() %>% mutate_all(funs(as.numeric(.)))
corr.vars.s3 %>% cor()
## pp_fem pp_masc fu_readable fu_informative
## pp_fem 1.00000000 -0.68098187 0.03581227 0.04992335
## pp_masc -0.68098187 1.00000000 -0.03597360 -0.02044515
## fu_readable 0.03581227 -0.03597360 1.00000000 0.50801965
## fu_informative 0.04992335 -0.02044515 0.50801965 1.00000000
## fu_interesting 0.02025197 0.06491528 0.60864460 0.56546081
## fu_wellwritten 0.10143782 -0.05267171 0.72984941 0.54741956
## fu_thoughtful 0.06787330 -0.01434940 0.61829975 0.56297755
## fu_interesting fu_wellwritten fu_thoughtful
## pp_fem 0.02025197 0.10143782 0.0678733
## pp_masc 0.06491528 -0.05267171 -0.0143494
## fu_readable 0.60864460 0.72984941 0.6182997
## fu_informative 0.56546081 0.54741956 0.5629775
## fu_interesting 1.00000000 0.59113577 0.6458886
## fu_wellwritten 0.59113577 1.00000000 0.6165451
## fu_thoughtful 0.64588864 0.61654507 1.0000000
Now, let’s look at how the following variables correlate with pp_fem and pp_masc in STUDY 4:
fu_readable, fu_informative, fu_honest, fu_recommend.
Drop NAs here (and make sure any non-numerics are transformed to numeric).
corr.vars.s4 = clean %>% select(pp_fem, pp_masc, fu_readable, fu_informative, fu_honest, fu_recommend) %>% drop_na() %>% mutate_all(funs(as.numeric(.)))
corr.vars.s4 %>% cor()
## pp_fem pp_masc fu_readable fu_informative fu_honest
## pp_fem 1.000000000 -0.76370990 0.01994533 -0.08170893 -0.003998515
## pp_masc -0.763709904 1.00000000 0.03042508 0.06851482 0.017831507
## fu_readable 0.019945334 0.03042508 1.00000000 0.43886273 0.458595362
## fu_informative -0.081708930 0.06851482 0.43886273 1.00000000 0.421368272
## fu_honest -0.003998515 0.01783151 0.45859536 0.42136827 1.000000000
## fu_recommend 0.030707347 -0.01708787 0.21013809 0.24831969 0.115439841
## fu_recommend
## pp_fem 0.03070735
## pp_masc -0.01708787
## fu_readable 0.21013809
## fu_informative 0.24831969
## fu_honest 0.11543984
## fu_recommend 1.00000000
It’s super important to look at the distribution of the data. Take a look at this.
Figure taken from Matejka et al. 2017
Use the column prompt to represent topic x style. Calculate the mean for each group, and add a horizontal line at that value.
Reorder prompt so that it is grouped by Topic.
pr.lev = c("Feminine - Life Course","Masculine - Life Course","Feminine - Relational", "Masculine - Relational")
clean = clean %>% mutate(prompt = factor(prompt, levels = pr.lev))
clean %>% group_by(prompt) %>% summarise(mean.ses = mean(relativeSes, na.rm=T))
## # A tibble: 4 x 2
## prompt mean.ses
## <fct> <dbl>
## 1 Feminine - Life Course 0.704
## 2 Masculine - Life Course 0.870
## 3 Feminine - Relational 0.388
## 4 Masculine - Relational 0.158
mean.ses = data.frame(prompt = c("Feminine - Life Course", "Feminine - Relational", "Masculine - Life Course", "Masculine - Relational"), relativeSes = c(0.70, 0.39, 0.87, 0.16))
clean %>% ggplot(aes(prompt, relativeSes)) + geom_jitter(aes(color=prompt)) +
geom_segment(aes(y = relativeSes, yend=relativeSes, x = as.numeric(prompt) - 0.5, xend=as.numeric(prompt) + 0.5), mean.ses) + ylab("Relative SES") + xlab("") + scale_color_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) + theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))
Our correlation tables suggested that fu_wellwritten correlated slightly with pp_fem. Let’s make a scatterplot of this relationship and add a smooth regression line to it. Then, compare to masculine perceptions.
clean %>% ggplot(aes(fu_wellwritten, pp_fem)) + geom_jitter(color = "#E7B800") +
geom_smooth(method="lm", color="black") + ylab("Feminine Perceptions") + xlab("Well Written") + theme_bw() + theme(legend.position="none", axis.text.x = element_text(size=12), axis.title.y=element_text(size=12))
clean %>% ggplot(aes(fu_wellwritten, pp_masc)) + geom_jitter(color = "#E7B800") +
geom_smooth(method="lm", color="black") + ylab("Masculine Perceptions") + xlab("Well Written") + theme_bw() + theme(legend.position="none", axis.text.x = element_text(size=12), axis.title.y=element_text(size=12))
Recall that we created a pivoted dataframe with tidyr called ‘clean.pivot’. Use this dataframe to converge the previous two figures onto a single figure. Decide what visual tool (color, shape, size, etc.) will give you the best contrast between the groups.
clean.pivot %>% ggplot(aes(fu_wellwritten, pp_gender_score)) + geom_jitter(aes(color = pp_gender), alpha=0.6) +
geom_smooth(method="lm", fill = "gray60", color = "black", aes(linetype=pp_gender)) + ylab("Gender Perceptions") + xlab("Well Written") + scale_color_manual(values = c("#C93312", "#F2AD00"), name= "Perceived Gender", breaks=c("pp_fem", "pp_masc"), labels = c("Feminine", "Masculine")) + scale_linetype_discrete(name= "Perceived Gender", breaks=c("pp_fem", "pp_masc"), labels = c("Feminine", "Masculine")) + theme_bw() + theme(legend.position="bottom", axis.text.x = element_text(size=12), axis.title.y=element_text(size=12))
clean %>% ggplot(aes(prompt, relativeSes)) + geom_boxplot(aes(color=prompt)) +
ylab("Relative SES") + xlab("") + scale_color_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) + theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))
Now that we have seen the distribution of the data, let’s visualize the summary statistics (using convenience package).
Visualize the relationship between relativeSes and Style.
conv.q1 = clean %>% sem(dv=relativeSes, id = ParticipantID, Style)
conv.q1 %>% ggplot(aes(Style, mean_relativeSes)) +
geom_bar(aes(fill=Style), color="black", stat="identity", position = "dodge") +
geom_errorbar(aes(ymin=lower, ymax=upper), width = 0.5, position="dodge") +
scale_fill_manual(values = c("#F5CDB4", "#F8AFA8")) +
ylab("Mean Relative SES") + xlab("") +
theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))
Now, add in the topic. You can use the prompt column to get both style and topic.
conv.q2 = clean %>% sem(dv=relativeSes, id = ParticipantID, prompt)
conv.q2 %>% ggplot(aes(prompt, mean_relativeSes)) +
geom_bar(aes(fill=prompt), color="black", stat="identity", position = "dodge") +
geom_errorbar(aes(ymin=lower, ymax=upper), width = 0.5, position="dodge") +
scale_fill_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) +
ylab("Mean Relative SES") + xlab("") +
theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))
Visualize the relationship between relativeSes and prompt with a pointrange figure.
conv.q2 %>% ggplot(aes(prompt, mean_relativeSes)) +
geom_pointrange(aes(color=prompt, ymin=lower, ymax=upper)) +
scale_color_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) +
ylab("Mean Relative SES") + xlab("") +
theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))
Last, but not least, it’s valuable to visualize the distribution of your continuous variables (contributes to checking normality assumptions of different analyses).
Create a histogram of our DV. Is it normally distributed?
clean %>% ggplot(aes(relativeSes)) + geom_histogram(binwidth = 1, color = "black", fill = "#E7B800", alpha=0.5) + ylab("Count") + xlab("Relative SES") + theme_bw()
Perhaps we want to export one of our figures to a png file. For example, export the histogram to a file called ‘ses.hist.png’
histogram = clean %>% ggplot(aes(relativeSes)) + geom_histogram(binwidth = 1, color = "black", fill = "#E7B800", alpha=0.5) + ylab("Count") + xlab("Relative SES") + theme_bw()
ggsave(histogram, "ses.hist.png")
Based on our summary statistics and visualizations, we have certain expectations for what the answers to our three research questions could be.
We have seen evidence that perceptions of socioeconomic status of an author don’t seem to vary across feminine vs. masculine writing styles in general.
However, texts that are relational in nature pattern with far lower perceptions of an author if the writing style is masculine. In other words, authors who write in a stereotypically masculine manner tend to be perceived lower if the topic is relational, as opposed to life course.
Moreoever, our correlations showed that not many of the individual text perception items patterned with perceptions of femininity or masculinity. One potential exception seemed to be how well-written the text appeared. This seemed to positively correlate with a feminine writing style, and not correlate with a masculine writing style.
Now, we will conduct some statistical tests so we can say with greater confidence whether these observed patterns are likely generalizable to a wider population.
We can answer Q1 with a t-test: is the perceived SES of a feminine vs. masculine style text significantly different from each other?
Is this paired or unpaired?
Is this one- or two-sided?
#Unpaired (default)
#Two-sided (default)
t.test(clean$relativeSes~clean$Style)
##
## Welch Two Sample t-test
##
## data: clean$relativeSes by clean$Style
## t = -0.040258, df = 1204, p-value = 0.9679
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1477557 0.1418140
## sample estimates:
## mean in group Feminine mean in group Masculine
## 0.5656566 0.5686275
How would we report these results in a paper?
clean %>% group_by(Style) %>% summarise(mean=mean(relativeSes, na.rm=T), sd=sd(relativeSes, na.rm=T))
## # A tibble: 2 x 3
## Style mean sd
## <fct> <dbl> <dbl>
## 1 Feminine 0.566 1.26
## 2 Masculine 0.569 1.30
There was not a significant difference in the relative socioeconomic status of feminine (M = 0.566, SD = 1.264) vs. masculine (M = 0.569, SD = 1.298) texts; t(1204) = -0.04, p = 0.970.
We can answer Q2 with an ANOVA, because we have more than 2 groups. Use ezANOVA from the ez package.
Is this a one- or two-way ANOVA?
Is this a within or between design?
Is this a repeated measures design?
Should we use Type I, II, or III?
#Two way (comparing two variables)
#Between (each person views one condition)
#Not repeated measures because only one row per subject (and one DV value)
#II (default) because our levels are not ordered, and we want to compare main effects to each other.
ezANOVA(clean
, dv = .(relativeSes)
, wid = .(ParticipantID)
, between = .(Style, Topic)
, type = 2)
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Style 1 1202 0.003896599 9.502366e-01 3.241752e-06
## 2 Topic 1 1202 49.990925442 2.613663e-12 * 3.992914e-02
## 3 Style:Topic 1 1202 7.386548302 6.665802e-03 * 6.107682e-03
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 3 1202 0.5807482 888.9159 0.2617643 0.8529648
If the interaction is statistically significant, calculate the simple main effects, or which groups were or were not significantly different from each other.
clean.life = clean %>% filter(Topic=="Life Course")
ezANOVA(clean.life
, dv = .(relativeSes)
, wid = .(ParticipantID)
, between = .(Style)
, type = 2)
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Style 1 685 3.317284 0.06899104 0.004819411
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 1 685 0.07821472 418.5681 0.1280009 0.7206241
clean.relate = clean %>% filter(Topic=="Relational")
ezANOVA(clean.relate
, dv = .(relativeSes)
, wid = .(ParticipantID)
, between = .(Style)
, type = 2)
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Style 1 517 3.903775 0.04870901 * 0.007494235
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 1 517 0.09529833 470.3479 0.1047506 0.7463324
How would we report these results in a paper?
A two-way ANOVA was run on a sample of 1206 participants to examine the effect of style and topic on perceived relative socioeconomic status of the author. There was a significant interaction between the effects of style and topic on relative socioeconomic status, F(1, 1202) = 7.387, p = 0.006. Simple main effects analysis showed that masculine styles were perceived as lower in socioeconomic status than feminine styles when the topic was relational (p = .049), but there were no differences between style when topic was equal to life course (p = .069).
We can answer Q3 with a simple linear regression from the package ‘lmerTest’ (related to lme4).
reg1 = lm(pp_fem ~ fu_wellwritten, data = clean)
summary(reg1)
##
## Call:
## lm(formula = pp_fem ~ fu_wellwritten, data = clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2417 -1.2417 0.0025 1.7583 3.4910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3869 0.2874 11.784 <2e-16 ***
## fu_wellwritten 0.1221 0.0514 2.376 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.8 on 723 degrees of freedom
## (481 observations deleted due to missingness)
## Multiple R-squared: 0.007746, Adjusted R-squared: 0.006373
## F-statistic: 5.644 on 1 and 723 DF, p-value: 0.01778
How would we report these results in a paper?
A simple linear regression was calculated to predict feminine author perceptions from how well-written the text was perceived to be. A significant regression equation was found (F(1, 723) = 5.644, p = 0.0178), with an R^2 of 0.006. The predicted feminine perception is equal to 3.387 + 0.122. In other words, feminine perceptions of the author increased by 0.122 for each unit increase in how well-written the text was.
Bravo!
A work by Mehrgol Tiv
mehrgoltiv@gmail.com