1 Introducing the dataset

In this data demo, we will be using an open dataset from the Language Use & Social Interaction Lab at Texas Tech University. The data from the project, called “Gendered Language Styles”, was made availabe on OSF: https://osf.io/963gp/

I do not own these data, nor was I involved in any way in the design of the study or the data collection. I am also not in any way affiliated with the authors of this project.

According to the lab website of the authors…“In this project, we manipulated the language of text prompts to be more feminine or more masculine in style and topic, as well as the labeled sex of the author. Participants read these prompts and rated the author.” (https://www.depts.ttu.edu/psy/lusi/research.php)

The authors conducted four studies, where they gathered different information from participants (summarized in key+measures.xlsx).

Demographics of participant (age, gender, political identity, first language, etc.)
Factor scores from various questionnaires
- SDS = Self-rated depression scale ? (Source unclear from datafile)
- BSRI = Bem Sex-Role Inventory
- Sexism = ? (Source unclear from datafile)
- IDWG = ? (Source unclear from datafile)
- ESQ = Emotional State Questionnaire
Conditions (Style, topic, label, text)
Follow-up checks (author sex, author age, etc)
Text perception items (readable, informative, compelling, honest, well-writen, etc.)
Person perception items (feminine, masculine, educated, goodness, social class, relative SES, etc.)
Quasi-behavioral items (how interested would you be in hearing back from the author? Meeting the author? Etc.)
LIWC categories (automated linguistic analysis: https://liwc.wpengine.com/wp-content/uploads/2015/11/LIWC2015_LanguageManual.pdf)

1.1 Research questions

The authors posed two main research questions on their lab website:

How is a person’s socioeconomic status perceived from text written in feminine vs. masculine writing styles?
Does the above pattern vary for different topics of text (i.e., relational vs. life course)?

3….Additionally, as psycholinguists, I think it is interesting for us to examine what sociolinguistic aspects of the text predict whether the writing style is perceived to be feminine vs. masculine (Q3).

We will aim to answer these three questions in this analysis.

1.1.1 Example texts

Life course feminine About two years now, we’ve really just been bearing down and keeping stable, because we had loans to pay off and unfortunate but necessary purchases to make, and a new baby arriving just then. Now though, I’ve been thinking back about my old fantasies of living in Sweden. I even mentioned this, talking with my spouse and kids, and as expected, everyone was immediately thrilled; they love adventures, and have always been really supportive. I’m torn because, while I am really excited about just going and doing something so new and unknown, I also feel I should be considering things like stability, roots, and prospects, which is what we have now, even though moving has really always been the plan. Basically, I’m the deciding factor here, and that’s pretty scary. If it were just me, I would be leaving right away, but because I have my family and the foundation that’s here already to think about, I almost want to slap myself for even entertaining these silly dreams. When I really think about it, both options seem like the clearly right choice. Doing something long term rewarding, and doing something pragmatic and reasonable sound like great options, but I can only do one, which is why this decision seems just about impossible.

Life course masculine The last couple of years were ‘bear down to stabilize’ years, with some loans to pay off, some unfortunately necessary large purchases, and a new baby on top of all that. With a lot of that behind us, I feel a bit of room to breathe. In this mind set, some old dreams of living in Sweden are coming up to the surface. Of course, everyone was thrilled at the idea. Part of me wants to go all in; this was always the plan from the start. Another part looks at what we have in place at the moment: stability, roots, prospects, etc.. I almost want to slap myself for continuing to hold on to these silly fantasies. Support from the spouse and kids is firmly in place; they love a good adventure, particularly into the unknown. I’m the deciding factor in all of this, which is scary. With nothing else to consider, I’d be on the road in a heartbeat, but with a family and the partial foundation we put down, the decision seems impossible. In the best case, it’s between something rewarding in the long term, and something pragmatically reasonable. In a sense, each side seems like the right choice, depending on which one I’m focusing on at the time.

Relational feminine A coworker and I have been working closely on a big project assignment. When we take breaks, they talk about their children, one being autistic. I think generally they sound really attentive and caring, basically like a good parent, but with this particular child, they apparently sometimes have to ‘startle’ them by hitting them, because, as they say, this is the only way they can even start calming them. I would usually keep to myself here, but this seems abusive, or just about. I did actually bring my concerns to them once, but they just said someone who didn’t have an autistic child simply wouldn’t understand. I could imagine that reporting them to CPS might just make things worse, particularly if their child is actually where they should be now. Our working relationship would also be destroyed. Then again, I do feel responsible to do something, if only gather additional information, and I don’t think the actual project would suffer, even if we couldn’t work together. I might even get input from someone actually familiar with autistic children, but this might all be me just getting overly nosy. Basically, however I think about it, everything seems really unclear, and I’m not sure what would really be right.

Relational masculine I started an assignment on a big project, working closely with a coworker. They talk a lot about their children during breaks, one of which has some form of autism. For the most part, they seem like a good parent; attentive, caring. But with this particular child, they say that, on occasion, the only way to get them to calm down is to ‘startle’ them by hitting them. I normally stay out of this kind of thing, but this seems at least on the verge of abuse. At one point, I brought up these concerns to them. They said someone without an autistic child wouldn’t understand, leaving it at that. On the one hand, reporting them to CPS could make the situation worse, particularly in the case that the child is in the best place for them. The working relationship would also be destroyed. On the other, I feel somewhat responsible to at least look more into this, and the project itself wouldn’t suffer without us specifically working on it. I want to talk to someone more familiar with dealing with autistic children for their input. All of this might be overly nosy on my part. From any direction, the most right course of action seems unclear.

1.2 Our goal

…build a data analysis pipeline to answer the three research questions using Tidyverse in R

Understand the structure of the data
Clean the data
Summarize the data
Visualize patterns in the data
Apply inferential statistics to answer the research questions

Figure taken from “R for Data Science”

2 Load Packages

library(tidyverse)
library(convenience)
library(ez)
library(lmerTest)

3 Data structure

Read in the data (“data.csv”)

df = read.csv("data.csv")

How many observations (rows)?

df %>% nrow()

## [1] 1990

How many variables (columns)?

df %>% ncol()

## [1] 225

Take a look at how many NAs are in each column of df. I’m not going to display the output here because it is very long.

map(df, ~sum(is.na(.))) #using purrr

df %>% summarise_all(funs(sum(is.na(.)))) #using dplyr

Familiarize yourself with the column names in the df

df %>% names()

##   [1] "reply"                 "Study"                 "ParticipantID"        
##   [4] "demo_age"              "demo_sex"              "demo_sexor"           
##   [7] "demo_attraction"       "demo_eth"              "demo_first_lang"      
##  [10] "demo_langs"            "demo_edu"              "demo_pol"             
##  [13] "demo_sage"             "demo_gender"           "BSRIFem"              
##  [16] "BSRIMascPos"           "BSRIMascNeg"           "SDS"                  
##  [19] "SexismHostile"         "SexismBenevolent"      "IDWG"                 
##  [22] "EQ"                    "SQ"                    "TextQuality"          
##  [25] "PP1"                   "PP2"                   "Behavioral"           
##  [28] "sds_smile"             "sds_preach"            "sds_comit"            
##  [31] "sds_lie"               "BSRI_Warm"             "BSRI_Gentle"          
##  [34] "BSRI_Affectionate"     "BSRI_Sympathetic"      "BSRI_Sensitive"       
##  [37] "BSRI_Tender"           "BSRI_leaderLike"       "BSRI_leaderability"   
##  [40] "BSRI_strongpers"       "BSRI_actleader"        "BSRI_Dominant"        
##  [43] "BSRI_Forceful"         "BSRI_defend"           "BSRI_Decisive"        
##  [46] "BSRI_attention"        "BSRI_aggressive"       "BSRI_fem"             
##  [49] "BSRI_masc"             "genfem_act"            "genfem_ability"       
##  [52] "genfem_dictates"       "genfem_equal"          "genfem_vilify"        
##  [55] "genfem_important"      "genfem_refined"        "genfem_behindEveryMan"
##  [58] "genfem_notEveryMan"    "genfem_cherish"        "genfem_offended"      
##  [61] "genfem_favors"         "genfem_control"        "IDWG_true"            
##  [64] "IDWG_offence"          "IDWG_responsible"      "IDWG_worry"           
##  [67] "IDWG_camaraderie"      "IDWG_important"        "ESQ_reaction"         
##  [70] "ESQ_awkward"           "ESQ_understand"        "ESQ_intuit"           
##  [73] "ESQ_masking"           "ESQ_machine"           "ESQ_building"         
##  [76] "ESQ_train"             "ESQ_road"              "ESQ_river"            
##  [79] "prompt"                "Style"                 "Topic"                
##  [82] "Label"                 "Text"                  "ck_sex"               
##  [85] "ck_region"             "ck_age"                "ck_topic"             
##  [88] "fu_readable"           "fu_informative"        "fu_compelling"        
##  [91] "fu_grammatical"        "fu_honest"             "fu_interesting"       
##  [94] "fu_wellwritten"        "fu_thoughtful"         "fu_recommend"         
##  [97] "pp_fem"                "pp_masc"               "pp_good"              
## [100] "pp_control"            "pp_educated"           "pp_intelligent"       
## [103] "pp_anxious"            "pp_tentative"          "pp_independent"       
## [106] "pp_interested"         "pp_average"            "pp_likable"           
## [109] "pp_wellIntentioned"    "pp_powerless"          "pp_capable"           
## [112] "pp_knowledge"          "pp_competent"          "relativeSes"          
## [115] "socialClass"           "pp_success"            "pp_similar"           
## [118] "pp_relatable"          "pp_like"               "pp_complex"           
## [121] "pp_olike"              "pp_noticable"          "pp_typical"           
## [124] "pp_typicalOfSex"       "beh_hearback"          "beh_meet"             
## [127] "beh_turnsout"          "beh_tellSpec"          "beh_moreAdv"          
## [130] "beh_tellGen"           "beh_email"             "beh_contact"          
## [133] "WC"                    "Analytic"              "Clout"                
## [136] "Authentic"             "Tone"                  "WPS"                  
## [139] "Sixltr"                "Dic"                   "function."            
## [142] "pronoun"               "ppron"                 "i"                    
## [145] "we"                    "you"                   "shehe"                
## [148] "they"                  "ipron"                 "article"              
## [151] "prep"                  "auxverb"               "adverb"               
## [154] "conj"                  "negate"                "verb"                 
## [157] "adj"                   "compare"               "interrog"             
## [160] "number"                "quant"                 "affect"               
## [163] "posemo"                "negemo"                "anx"                  
## [166] "anger"                 "sad"                   "social"               
## [169] "family"                "friend"                "female"               
## [172] "male"                  "cogproc"               "insight"              
## [175] "cause"                 "discrep"               "tentat"               
## [178] "certain"               "differ"                "percept"              
## [181] "see"                   "hear"                  "feel"                 
## [184] "bio"                   "body"                  "health"               
## [187] "sexual"                "ingest"                "drives"               
## [190] "affiliation"           "achieve"               "power"                
## [193] "reward"                "risk"                  "focuspast"            
## [196] "focuspresent"          "focusfuture"           "relativ"              
## [199] "motion"                "space"                 "time"                 
## [202] "work"                  "leisure"               "home"                 
## [205] "money"                 "relig"                 "death"                
## [208] "informal"              "swear"                 "netspeak"             
## [211] "assent"                "nonflu"                "filler"               
## [214] "AllPunc"               "Period"                "Comma"                
## [217] "Colon"                 "SemiC"                 "QMark"                
## [220] "Exclam"                "Dash"                  "Quote"                
## [223] "Apostro"               "Parenth"               "OtherP"

Make a subset of the dataframe to include first 3 columns, demographic columns, factor score columns, condition columns, follow-up check columns, text perception columns, and person perception columns. Make sure to also grab relativeSes.

Call this ‘df.sub’

df.sub = df %>% select(reply, Study, ParticipantID, 
                       contains("demo"), 
                       BSRIFem, BSRIMascPos, BSRIMascNeg, SDS, SexismHostile, SexismBenevolent, IDWG, EQ, SQ, TextQuality, PP1, PP2, Behavioral, 
                       prompt, Style, Topic, Label, 
                       contains("ck"), 
                       contains("fu"), 
                       contains("pp"), relativeSes, socialClass)

View first 10 rows of subset

df.sub %>% head()

##   reply Study ParticipantID demo_age demo_sex               demo_sexor
## 1 First     1    1070688528       29     Male Straight or Heterosexual
## 2 First     1    1127145579       32   Female Straight or Heterosexual
## 3 First     1    1264874158       26   Female Straight or Heterosexual
## 4 First     1    1284195027       25   Female                 Bisexual
## 5 First     1    1344464075       27     Male Straight or Heterosexual
## 6 First     1    1361139138       44   Female Straight or Heterosexual
##   demo_attraction                         demo_eth demo_first_lang
## 1            <NA>                            White         English
## 2            <NA>                            White         English
## 3            <NA>                            White         English
## 4            <NA>                Hispanic or Latin         English
## 5            <NA>                            White         english
## 6            <NA> Black or African-American, White         english
##                     demo_langs             demo_edu              demo_pol
## 1 English, Portuguese, Spanish Post-graduate degree     Very conservative
## 2                 English only       College degree          Very liberal
## 3                      English         Some college      Somewhat liberal
## 4                      English         Some college          Very liberal
## 5                      english       College degree Somewhat conservative
## 6                      English         Some college Somewhat conservative
##   demo_sage demo_gender BSRIFem BSRIMascPos BSRIMascNeg SDS SexismHostile
## 1      <NA>        <NA>      NA          NA          NA  NA            NA
## 2      <NA>        <NA>      NA          NA          NA  NA            NA
## 3      <NA>        <NA>      NA          NA          NA  NA            NA
## 4      <NA>        <NA>      NA          NA          NA  NA            NA
## 5      <NA>        <NA>      NA          NA          NA  NA            NA
## 6      <NA>        <NA>      NA          NA          NA  NA            NA
##   SexismBenevolent IDWG EQ SQ TextQuality PP1 PP2 Behavioral
## 1               NA   NA NA NA          NA  NA  NA         NA
## 2               NA   NA NA NA          NA  NA  NA         NA
## 3               NA   NA NA NA          NA  NA  NA         NA
## 4               NA   NA NA NA          NA  NA  NA         NA
## 5               NA   NA NA NA          NA  NA  NA         NA
## 6               NA   NA NA NA          NA  NA  NA         NA
##                   prompt     Style      Topic Label ck_sex ck_region ck_age
## 1 Masculine - Relational Masculine Relational  <NA> Female      <NA>     NA
## 2 Masculine - Relational Masculine Relational  <NA> Female      <NA>     NA
## 3 Masculine - Relational Masculine Relational  <NA> Female      <NA>     NA
## 4  Feminine - Relational  Feminine Relational  <NA>   Male      <NA>     NA
## 5  Feminine - Relational  Feminine Relational  <NA> Female      <NA>     NA
## 6  Feminine - Relational  Feminine Relational  <NA> Female      <NA>     NA
##   ck_topic beh_hearback BSRI_Forceful fu_readable fu_informative fu_compelling
## 1     <NA>           NA            NA          NA             NA            NA
## 2     <NA>           NA            NA          NA             NA            NA
## 3     <NA>           NA            NA          NA             NA            NA
## 4     <NA>           NA            NA          NA             NA            NA
## 5     <NA>           NA            NA          NA             NA            NA
## 6     <NA>           NA            NA          NA             NA            NA
##   fu_grammatical fu_honest fu_interesting fu_wellwritten fu_thoughtful
## 1             NA        NA             NA             NA            NA
## 2             NA        NA             NA             NA            NA
## 3             NA        NA             NA             NA            NA
## 4             NA        NA             NA             NA            NA
## 5             NA        NA             NA             NA            NA
## 6             NA        NA             NA             NA            NA
##   fu_recommend function. focusfuture pp_fem pp_masc pp_good pp_control
## 1         <NA>     68.21        4.62      9       0      NA          5
## 2         <NA>     66.42        3.73      6       4      NA          4
## 3         <NA>     58.33        2.31      6       4      NA          4
## 4         <NA>     63.26        1.86      5       5      NA          2
## 5         <NA>     65.55        2.87      5       3      NA          7
## 6         <NA>     67.66        5.97      9       1      NA          6
##   pp_educated pp_intelligent pp_anxious pp_tentative pp_independent
## 1          NA             NA         NA           NA              8
## 2          NA             NA         NA           NA              5
## 3          NA             NA         NA           NA              6
## 4          NA             NA         NA           NA              4
## 5          NA             NA         NA           NA             10
## 6          NA             NA         NA           NA              8
##   pp_interested pp_average pp_likable pp_wellIntentioned pp_powerless
## 1             8         NA         NA                 NA           NA
## 2             8         NA         NA                 NA           NA
## 3             7         NA         NA                 NA           NA
## 4             7         NA         NA                 NA           NA
## 5             7         NA         NA                 NA           NA
## 6             6         NA         NA                 NA           NA
##   pp_capable pp_knowledge pp_competent pp_success pp_similar pp_relatable
## 1         NA           NA           NA          3         NA           NA
## 2         NA           NA           NA          4         NA           NA
## 3         NA           NA           NA          7         NA           NA
## 4         NA           NA           NA          5         NA           NA
## 5         NA           NA           NA          3         NA           NA
## 6         NA           NA           NA          6         NA           NA
##   pp_like pp_complex pp_olike pp_noticable pp_typical pp_typicalOfSex ppron
## 1       2          0        2            2          4              NA 10.77
## 2       5          8        4            8          6              NA 13.81
## 3       6          7        7            4          6              NA 10.65
## 4       5          7        7            5          6              NA 13.02
## 5       3          2        5            2          3              NA 13.40
## 6      10          7        7            8          7              NA 10.95
##   relativeSes socialClass
## 1          -3          NA
## 2          -2          NA
## 3           2          NA
## 4           5          NA
## 5          -2          NA
## 6           4          NA

4 Clean

There is always some cleaning that is necessary before beginning data analysis.

4.1 Standardize

When participants are given the opportunity to type their responses, there are a myriad of issues that can occur. For example, spelling mistakes, capitalization, abbreviations, etc.

This is the case for demo_first_lang. View all unique values in this column.

df.sub %>% distinct(demo_first_lang)  # Same as unique(df.sub$demo_first_lang)

##                                           demo_first_lang
## 1                                                 English
## 2                                                 english
## 3                                                 Spanish
## 4                                                english 
## 5                                                English 
## 6                                                 ENGLISH
## 7                                                Englisgh
## 8                                                 Russian
## 9                                                Eneglish
## 10                                                 Korean
## 11                                                 Slovak
## 12                                                 Arabic
## 13                                                  Hindi
## 14                                English/Malay bilingual
## 15                                                   Urdu
## 16                                                Marathi
## 17                                                   <NA>
## 18                                                Chinese
## 19                                               Mandarin
## 20 english and spanish, now i speak only english fluently
## 21                                                 Pashto
## 22                                                Swedish
## 23                                                  Urdu 
## 24                                                 French
## 25                                                  Other

Clean up this column by trimming whitespace, lowercasing, and fixing typos.

Call this ‘clean’

clean = df.sub %>% mutate(demo_first_lang = trimws(demo_first_lang)) %>% 
  mutate(demo_first_lang = tolower(demo_first_lang)) %>% 
  mutate(demo_first_lang = gsub(pattern = "englisgh", replacement = "english", demo_first_lang)) %>% 
  mutate(demo_first_lang = gsub(pattern = "eneglish", replacement = "english", demo_first_lang))

Check your work.

clean %>% distinct(demo_first_lang)

##                                           demo_first_lang
## 1                                                 english
## 2                                                 spanish
## 3                                                 russian
## 4                                                  korean
## 5                                                  slovak
## 6                                                  arabic
## 7                                                   hindi
## 8                                 english/malay bilingual
## 9                                                    urdu
## 10                                                marathi
## 11                                                   <NA>
## 12                                                chinese
## 13                                               mandarin
## 14 english and spanish, now i speak only english fluently
## 15                                                 pashto
## 16                                                swedish
## 17                                                 french
## 18                                                  other

Take a look at the unique values of demo_sex.

clean %>% distinct(demo_sex)

##                                                        demo_sex
## 1                                                          Male
## 2                                                        Female
## 3                                                         Other
## 4                                             Prefer not to say
## 5                                                  Male; Female
## 6                                    Female; Other; Transgender
## 7                                                          <NA>
## 8                                                Other; agender
## 9                                       Male; Prefer not to say
## 10 Male; Female; Other; Non-binary gender, biologically female 
## 11                                            Other; non binary
## 12           Male; Female; Other; Non-Binary They/Them pronouns

It seems like this question actually reflects gender identity, given the responses. Also, it seems like this is a multiple selection question (selections separated by semicolon).

It’s super important to be inclusive in the way gender is assessed in questionnaires because you want to capture the reality of your participants (see resources). Nevertheless, any multiple selection question poses challenges for data science. This is because it presents a large permutation of possible selections, resulting in categories that have maybe one or two respondents in them and others that have potentially hundreds of respondents in them.

I don’t think anyone has figured out the best way to balance these demands (if you know, please let me know!). So what I think we will do here is group responses that are not “Female” or “Male” into “Other”.

You may be wondering why we don’t use demo_gender. Well, this was seemingly only assessed in Study 4. So we are going to replace this with our own column (again because I think the question was about gender identity).

clean = clean %>% mutate(demo_gender = ifelse(demo_sex == "Female", "Female",
                                                ifelse(demo_sex == "Male", "Male", "Other")))

Check your work.

clean %>% distinct(demo_gender)

##   demo_gender
## 1        Male
## 2      Female
## 3       Other
## 4        <NA>

4.2 Filter

Let’s say we want to limit our analysis to L1 English participants.

clean = clean %>% filter(demo_first_lang == "english")

Check your work.

clean %>% distinct(demo_first_lang)

##   demo_first_lang
## 1         english

In studies 3 and 4, each participant gave two responses (coded in reply). I think they received some sort of feedback between responses, but it’s unclear. So let’s just keep the first response of everyone.

clean = clean %>% filter(reply=="First")

Check your work.

clean %>% distinct(reply)

##   reply
## 1 First

Visualize the distribution of demo_age with a dotplot.

clean %>% ggplot(aes(demo_age)) + geom_dotplot(binwidth=5, color = "black", fill = "#E7B800", alpha=0.5) + theme_bw() + ylab("Count") + xlab("Age")

It looks like there is someone with an age of over 300, which is…impossible! Filter out this value, and then check your work.

clean = clean %>% filter(demo_age < 100)
clean %>% summarise(max(demo_age))

##   max(demo_age)
## 1            83

Check to make sure there is only one row per participant.

clean %>% group_by(ParticipantID) %>% summarise(count=n()) %>% arrange(desc(count))

## # A tibble: 1,206 x 2
##    ParticipantID count
##            <dbl> <int>
##  1    1000322124     1
##  2    1011291035     1
##  3    1012091624     1
##  4    1028387854     1
##  5    1035132554     1
##  6    1041903476     1
##  7    1046231287     1
##  8    1057053640     1
##  9    1070688528     1
## 10    1075923140     1
## # … with 1,196 more rows

4.3 Export

Perhaps we want to export our clean dataframe to a csv so that in the future we don’t need to run through all the cleaning steps again. What would this look like?

write.csv(clean, "clean.csv")

4.4 Tidy

Our third research question is about whether perceptions of the text relate to perceptions of the author. In particular, we are interested in whether the author is perceived to be feminine or masculine.

In the dataset, these values are represented in separate columns called pp_fem and pp_masc. This makes it hard to summarize or visualize both simultaneously (e.g., in the same figure).

We can apply some Tidyr functions to alter the structure of the data so that we can easily visualize pp_fem and pp_masc on the same figure.

We want to pivot the data so that there is a new column called ‘pp_gender’ with rows for pp_fem and pp_masc.

Make this as a new dataframe, rather than overwriting clean. Call it ‘clean.pivot’

clean.pivot = clean %>% pivot_longer(c("pp_fem", "pp_masc"), names_to = "pp_gender", values_to = "pp_gender_score")

Check that this worked by comparing the number of rows in ‘clean’ and ‘clean.pivot’.

clean %>% nrow()

## [1] 1206

clean.pivot %>% nrow()

## [1] 2412

5 Summarize

5.1 Sample size

How many participants do we have?

clean %>% summarise(count=n())

##   count
## 1  1206

5.2 Demographics

5.2.1 Gender

Calculate the number of participants per gender group.

clean %>% group_by(demo_gender) %>% summarise(count=n())

## # A tibble: 3 x 2
##   demo_gender count
##   <chr>       <int>
## 1 Female        645
## 2 Male          541
## 3 Other          20

5.2.2 Age

Calculate the average age and the standard deviation.

clean %>% summarise(mean.age = mean(demo_age, na.rm=T), sd.age = sd(demo_age, na.rm=T))

##   mean.age  sd.age
## 1 33.35572 10.5257

5.2.3 Education

Summarize the education of the sample. Arrange the summary table by education level.

education = c("Prefer not to say", "None, to Some junior high school", "Currently in high school", "Some high school", "High school diploma or equivalent", "Currently in college", "Some college", "College degree", "Currently in graduate school", "Post-graduate degree")

clean = clean %>% mutate(demo_edu = factor(demo_edu, levels = education))
clean %>% group_by(demo_edu) %>% summarise(count=n())

## # A tibble: 11 x 2
##    demo_edu                          count
##    <fct>                             <int>
##  1 Prefer not to say                     4
##  2 None, to Some junior high school      1
##  3 Currently in high school              2
##  4 Some high school                      4
##  5 High school diploma or equivalent   105
##  6 Currently in college                 55
##  7 Some college                        319
##  8 College degree                      488
##  9 Currently in graduate school         20
## 10 Post-graduate degree                149
## 11 <NA>                                 59

5.2.4 Political affiliation

Summarize the political affiliation of the sample. Arrange from very conservative to very liberal.

politic = c("Very conservative", "Somewhat conservative", "Neither liberal nor conservative", "Somewhat liberal", "Very liberal")

clean = clean %>% mutate(demo_pol = factor(demo_pol, levels=politic))
clean %>% group_by(demo_pol) %>% summarise(count=n())

## # A tibble: 5 x 2
##   demo_pol                         count
##   <fct>                            <int>
## 1 Very conservative                   55
## 2 Somewhat conservative              196
## 3 Neither liberal nor conservative   257
## 4 Somewhat liberal                   452
## 5 Very liberal                       246

5.3 Summary statistics

Let’s calculate some summary statistics related to our first two research questions. Use the convenience package to do this (https://github.com/jasongullifer/convenience).

sem(clean, dv=relativeSes, id=ParticipantID, Style) #Q1

## # A tibble: 2 x 7
##   Style     mean_relativeSes sd_relativeSes     N    SEM upper lower
##   <fct>                <dbl>          <dbl> <int>  <dbl> <dbl> <dbl>
## 1 Feminine             0.566           1.26   594 0.0519 0.618 0.514
## 2 Masculine            0.569           1.30   612 0.0525 0.621 0.516

sem(clean, dv=relativeSes, id=ParticipantID, Style, Topic) #Q2

## # A tibble: 4 x 8
##   Style     Topic      mean_relativeSes sd_relativeSes     N    SEM upper  lower
##   <fct>     <fct>                 <dbl>          <dbl> <int>  <dbl> <dbl>  <dbl>
## 1 Feminine  Life Cour…            0.704           1.18   334 0.0647 0.768 0.639 
## 2 Feminine  Relational            0.388           1.34   260 0.0833 0.472 0.305 
## 3 Masculine Life Cour…            0.870           1.21   353 0.0642 0.934 0.805 
## 4 Masculine Relational            0.158           1.31   259 0.0814 0.240 0.0769

5.4 Correlation

We can create a correlation table of our text perception items to see how they pattern with each other, and how they relate to feminine vs. masculine perceptions of the author.

Text perception was not measured in all four studies. So let’s start by looking at how the following variables correlate with pp_fem and pp_masc in STUDY 3:

fu_readable, fu_informative, fu_interesting, fu_wellwritten, fu_thoughtful.

Drop NAs here (and make sure any non-numerics are transformed to numeric).

corr.vars.s3 = clean %>% select(pp_fem, pp_masc, fu_readable, fu_informative, fu_interesting, fu_wellwritten, fu_thoughtful) %>% drop_na() %>% mutate_all(funs(as.numeric(.)))

corr.vars.s3 %>% cor()

##                     pp_fem     pp_masc fu_readable fu_informative
## pp_fem          1.00000000 -0.68098187  0.03581227     0.04992335
## pp_masc        -0.68098187  1.00000000 -0.03597360    -0.02044515
## fu_readable     0.03581227 -0.03597360  1.00000000     0.50801965
## fu_informative  0.04992335 -0.02044515  0.50801965     1.00000000
## fu_interesting  0.02025197  0.06491528  0.60864460     0.56546081
## fu_wellwritten  0.10143782 -0.05267171  0.72984941     0.54741956
## fu_thoughtful   0.06787330 -0.01434940  0.61829975     0.56297755
##                fu_interesting fu_wellwritten fu_thoughtful
## pp_fem             0.02025197     0.10143782     0.0678733
## pp_masc            0.06491528    -0.05267171    -0.0143494
## fu_readable        0.60864460     0.72984941     0.6182997
## fu_informative     0.56546081     0.54741956     0.5629775
## fu_interesting     1.00000000     0.59113577     0.6458886
## fu_wellwritten     0.59113577     1.00000000     0.6165451
## fu_thoughtful      0.64588864     0.61654507     1.0000000

Now, let’s look at how the following variables correlate with pp_fem and pp_masc in STUDY 4:

fu_readable, fu_informative, fu_honest, fu_recommend.

Drop NAs here (and make sure any non-numerics are transformed to numeric).

corr.vars.s4 = clean %>% select(pp_fem, pp_masc, fu_readable, fu_informative, fu_honest, fu_recommend) %>% drop_na() %>% mutate_all(funs(as.numeric(.)))

corr.vars.s4 %>% cor()

##                      pp_fem     pp_masc fu_readable fu_informative    fu_honest
## pp_fem          1.000000000 -0.76370990  0.01994533    -0.08170893 -0.003998515
## pp_masc        -0.763709904  1.00000000  0.03042508     0.06851482  0.017831507
## fu_readable     0.019945334  0.03042508  1.00000000     0.43886273  0.458595362
## fu_informative -0.081708930  0.06851482  0.43886273     1.00000000  0.421368272
## fu_honest      -0.003998515  0.01783151  0.45859536     0.42136827  1.000000000
## fu_recommend    0.030707347 -0.01708787  0.21013809     0.24831969  0.115439841
##                fu_recommend
## pp_fem           0.03070735
## pp_masc         -0.01708787
## fu_readable      0.21013809
## fu_informative   0.24831969
## fu_honest        0.11543984
## fu_recommend     1.00000000

6 Visualize

It’s super important to look at the distribution of the data. Take a look at this.

Figure taken from Matejka et al. 2017

6.1 Scatterplot

Use the column prompt to represent topic x style. Calculate the mean for each group, and add a horizontal line at that value.

Reorder prompt so that it is grouped by Topic.

pr.lev = c("Feminine - Life Course","Masculine - Life Course","Feminine - Relational", "Masculine - Relational")
clean = clean %>% mutate(prompt = factor(prompt, levels = pr.lev))

clean %>% group_by(prompt) %>% summarise(mean.ses = mean(relativeSes, na.rm=T))

## # A tibble: 4 x 2
##   prompt                  mean.ses
##   <fct>                      <dbl>
## 1 Feminine - Life Course     0.704
## 2 Masculine - Life Course    0.870
## 3 Feminine - Relational      0.388
## 4 Masculine - Relational     0.158

mean.ses = data.frame(prompt = c("Feminine - Life Course", "Feminine - Relational", "Masculine - Life Course", "Masculine - Relational"), relativeSes = c(0.70, 0.39, 0.87, 0.16))

clean %>% ggplot(aes(prompt, relativeSes)) + geom_jitter(aes(color=prompt)) + 
  geom_segment(aes(y = relativeSes, yend=relativeSes, x = as.numeric(prompt) - 0.5, xend=as.numeric(prompt) + 0.5), mean.ses) + ylab("Relative SES") + xlab("") + scale_color_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) +  theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))

Our correlation tables suggested that fu_wellwritten correlated slightly with pp_fem. Let’s make a scatterplot of this relationship and add a smooth regression line to it. Then, compare to masculine perceptions.

clean %>% ggplot(aes(fu_wellwritten, pp_fem)) + geom_jitter(color = "#E7B800") + 
  geom_smooth(method="lm", color="black") + ylab("Feminine Perceptions") + xlab("Well Written") +   theme_bw() + theme(legend.position="none", axis.text.x = element_text(size=12), axis.title.y=element_text(size=12))

clean %>% ggplot(aes(fu_wellwritten, pp_masc)) + geom_jitter(color = "#E7B800") + 
  geom_smooth(method="lm", color="black") + ylab("Masculine Perceptions") + xlab("Well Written") +   theme_bw() + theme(legend.position="none", axis.text.x = element_text(size=12), axis.title.y=element_text(size=12))

Recall that we created a pivoted dataframe with tidyr called ‘clean.pivot’. Use this dataframe to converge the previous two figures onto a single figure. Decide what visual tool (color, shape, size, etc.) will give you the best contrast between the groups.

clean.pivot %>% ggplot(aes(fu_wellwritten, pp_gender_score)) + geom_jitter(aes(color = pp_gender), alpha=0.6) + 
  geom_smooth(method="lm", fill = "gray60", color = "black", aes(linetype=pp_gender)) + ylab("Gender Perceptions") + xlab("Well Written") + scale_color_manual(values = c("#C93312", "#F2AD00"), name= "Perceived Gender", breaks=c("pp_fem", "pp_masc"), labels = c("Feminine", "Masculine")) + scale_linetype_discrete(name= "Perceived Gender", breaks=c("pp_fem", "pp_masc"), labels = c("Feminine", "Masculine")) + theme_bw() + theme(legend.position="bottom", axis.text.x = element_text(size=12), axis.title.y=element_text(size=12))

6.2 Boxplot

clean %>% ggplot(aes(prompt, relativeSes)) + geom_boxplot(aes(color=prompt)) + 
  ylab("Relative SES") + xlab("") + scale_color_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) +  theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))

Now that we have seen the distribution of the data, let’s visualize the summary statistics (using convenience package).

6.3 Bar graph

Visualize the relationship between relativeSes and Style.

conv.q1 = clean %>% sem(dv=relativeSes, id = ParticipantID, Style)

conv.q1 %>% ggplot(aes(Style, mean_relativeSes)) + 
  geom_bar(aes(fill=Style), color="black", stat="identity", position = "dodge") + 
  geom_errorbar(aes(ymin=lower, ymax=upper), width = 0.5, position="dodge") + 
  scale_fill_manual(values = c("#F5CDB4", "#F8AFA8")) + 
  ylab("Mean Relative SES") + xlab("") + 
  theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))

Now, add in the topic. You can use the prompt column to get both style and topic.

conv.q2 = clean %>% sem(dv=relativeSes, id = ParticipantID, prompt)

conv.q2 %>% ggplot(aes(prompt, mean_relativeSes)) + 
  geom_bar(aes(fill=prompt), color="black", stat="identity", position = "dodge") + 
  geom_errorbar(aes(ymin=lower, ymax=upper), width = 0.5, position="dodge") + 
  scale_fill_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) + 
  ylab("Mean Relative SES") + xlab("") + 
  theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))

6.4 Pointrange

Visualize the relationship between relativeSes and prompt with a pointrange figure.

conv.q2 %>% ggplot(aes(prompt, mean_relativeSes)) + 
  geom_pointrange(aes(color=prompt, ymin=lower, ymax=upper)) + 
  scale_color_manual(values = c("#F1BB7B", "#FD6467", "#5B1A18", "#D67236")) + 
  ylab("Mean Relative SES") + xlab("") + 
  theme_bw() + theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, size=12), axis.title.y=element_text(size=12))

6.5 Histogram

Last, but not least, it’s valuable to visualize the distribution of your continuous variables (contributes to checking normality assumptions of different analyses).

Create a histogram of our DV. Is it normally distributed?

clean %>% ggplot(aes(relativeSes)) + geom_histogram(binwidth = 1, color = "black", fill = "#E7B800", alpha=0.5) + ylab("Count") + xlab("Relative SES") + theme_bw()

6.6 Export

Perhaps we want to export one of our figures to a png file. For example, export the histogram to a file called ‘ses.hist.png’

histogram = clean %>% ggplot(aes(relativeSes)) + geom_histogram(binwidth = 1, color = "black", fill = "#E7B800", alpha=0.5) + ylab("Count") + xlab("Relative SES") + theme_bw()

ggsave(histogram, "ses.hist.png")

7 Analyze

Based on our summary statistics and visualizations, we have certain expectations for what the answers to our three research questions could be.

We have seen evidence that perceptions of socioeconomic status of an author don’t seem to vary across feminine vs. masculine writing styles in general.

However, texts that are relational in nature pattern with far lower perceptions of an author if the writing style is masculine. In other words, authors who write in a stereotypically masculine manner tend to be perceived lower if the topic is relational, as opposed to life course.

Moreoever, our correlations showed that not many of the individual text perception items patterned with perceptions of femininity or masculinity. One potential exception seemed to be how well-written the text appeared. This seemed to positively correlate with a feminine writing style, and not correlate with a masculine writing style.

Now, we will conduct some statistical tests so we can say with greater confidence whether these observed patterns are likely generalizable to a wider population.

7.1 T-Test

We can answer Q1 with a t-test: is the perceived SES of a feminine vs. masculine style text significantly different from each other?

Is this paired or unpaired?

Is this one- or two-sided?

#Unpaired (default)
#Two-sided (default)

t.test(clean$relativeSes~clean$Style)

## 
##  Welch Two Sample t-test
## 
## data:  clean$relativeSes by clean$Style
## t = -0.040258, df = 1204, p-value = 0.9679
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1477557  0.1418140
## sample estimates:
##  mean in group Feminine mean in group Masculine 
##               0.5656566               0.5686275

How would we report these results in a paper?

clean %>% group_by(Style) %>% summarise(mean=mean(relativeSes, na.rm=T), sd=sd(relativeSes, na.rm=T))

## # A tibble: 2 x 3
##   Style      mean    sd
##   <fct>     <dbl> <dbl>
## 1 Feminine  0.566  1.26
## 2 Masculine 0.569  1.30

There was not a significant difference in the relative socioeconomic status of feminine (M = 0.566, SD = 1.264) vs. masculine (M = 0.569, SD = 1.298) texts; t(1204) = -0.04, p = 0.970.

7.2 ANOVA

We can answer Q2 with an ANOVA, because we have more than 2 groups. Use ezANOVA from the ez package.

Is this a one- or two-way ANOVA?

Is this a within or between design?

Is this a repeated measures design?

Should we use Type I, II, or III?

#Two way (comparing two variables)
#Between (each person views one condition)
#Not repeated measures because only one row per subject (and one DV value)
#II (default) because our levels are not ordered, and we want to compare main effects to each other.

ezANOVA(clean
        , dv = .(relativeSes)
        , wid = .(ParticipantID)
        , between = .(Style, Topic)
        , type = 2)

## $ANOVA
##        Effect DFn  DFd            F            p p<.05          ges
## 1       Style   1 1202  0.003896599 9.502366e-01       3.241752e-06
## 2       Topic   1 1202 49.990925442 2.613663e-12     * 3.992914e-02
## 3 Style:Topic   1 1202  7.386548302 6.665802e-03     * 6.107682e-03
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn  DFd       SSn      SSd         F         p p<.05
## 1   3 1202 0.5807482 888.9159 0.2617643 0.8529648

If the interaction is statistically significant, calculate the simple main effects, or which groups were or were not significantly different from each other.

clean.life = clean %>% filter(Topic=="Life Course")

ezANOVA(clean.life
        , dv = .(relativeSes)
        , wid = .(ParticipantID)
        , between = .(Style)
        , type = 2)

## $ANOVA
##   Effect DFn DFd        F          p p<.05         ges
## 1  Style   1 685 3.317284 0.06899104       0.004819411
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd        SSn      SSd         F         p p<.05
## 1   1 685 0.07821472 418.5681 0.1280009 0.7206241

clean.relate = clean %>% filter(Topic=="Relational")

ezANOVA(clean.relate
        , dv = .(relativeSes)
        , wid = .(ParticipantID)
        , between = .(Style)
        , type = 2)

## $ANOVA
##   Effect DFn DFd        F          p p<.05         ges
## 1  Style   1 517 3.903775 0.04870901     * 0.007494235
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd        SSn      SSd         F         p p<.05
## 1   1 517 0.09529833 470.3479 0.1047506 0.7463324

How would we report these results in a paper?

A two-way ANOVA was run on a sample of 1206 participants to examine the effect of style and topic on perceived relative socioeconomic status of the author. There was a significant interaction between the effects of style and topic on relative socioeconomic status, F(1, 1202) = 7.387, p = 0.006. Simple main effects analysis showed that masculine styles were perceived as lower in socioeconomic status than feminine styles when the topic was relational (p = .049), but there were no differences between style when topic was equal to life course (p = .069).

7.3 Linear Regression

We can answer Q3 with a simple linear regression from the package ‘lmerTest’ (related to lme4).

reg1 = lm(pp_fem ~ fu_wellwritten, data = clean)
summary(reg1)

## 
## Call:
## lm(formula = pp_fem ~ fu_wellwritten, data = clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2417 -1.2417  0.0025  1.7583  3.4910 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.3869     0.2874  11.784   <2e-16 ***
## fu_wellwritten   0.1221     0.0514   2.376   0.0178 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.8 on 723 degrees of freedom
##   (481 observations deleted due to missingness)
## Multiple R-squared:  0.007746,   Adjusted R-squared:  0.006373 
## F-statistic: 5.644 on 1 and 723 DF,  p-value: 0.01778

How would we report these results in a paper?

A simple linear regression was calculated to predict feminine author perceptions from how well-written the text was perceived to be. A significant regression equation was found (F(1, 723) = 5.644, p = 0.0178), with an R^2 of 0.006. The predicted feminine perception is equal to 3.387 + 0.122. In other words, feminine perceptions of the author increased by 0.122 for each unit increase in how well-written the text was.

Bravo!

A work by Mehrgol Tiv

mehrgoltiv@gmail.com

R Stats for Language Science

[Data Demo]

© Mehrgol Tiv

March 2020

1 Introducing the dataset

1.1 Research questions

1.1.1 Example texts

1.2 Our goal

2 Load Packages

3 Data structure

4 Clean

4.1 Standardize

4.2 Filter

4.3 Export

4.4 Tidy

5 Summarize

5.1 Sample size

5.2 Demographics

5.2.1 Gender

5.2.2 Age

5.2.3 Education

5.2.4 Political affiliation

5.3 Summary statistics

5.4 Correlation

6 Visualize

6.1 Scatterplot

6.2 Boxplot

6.3 Bar graph

6.4 Pointrange

6.5 Histogram

6.6 Export

7 Analyze

7.1 T-Test

7.2 ANOVA

7.3 Linear Regression