1 Introducing the dataset

 

In this data demo, we will be using an open dataset from the Language Use & Social Interaction Lab at Texas Tech University. The data from the project, called “Gendered Language Styles”, was made availabe on OSF: https://osf.io/963gp/

I do not own these data, nor was I involved in any way in the design of the study or the data collection. I am also not in any way affiliated with the authors of this project.

According to the lab website of the authors…“In this project, we manipulated the language of text prompts to be more feminine or more masculine in style and topic, as well as the labeled sex of the author. Participants read these prompts and rated the author.” (https://www.depts.ttu.edu/psy/lusi/research.php)

The authors conducted four studies, where they gathered different information from participants (summarized in key+measures.xlsx).

  • Demographics of participant (age, gender, political identity, first language, etc.)
  • Factor scores from various questionnaires
    • SDS = Self-rated depression scale ? (Source unclear from datafile)
    • BSRI = Bem Sex-Role Inventory
    • Sexism = ? (Source unclear from datafile)
    • IDWG = ? (Source unclear from datafile)
    • ESQ = Emotional State Questionnaire
  • Conditions (Style, topic, label, text)
  • Follow-up checks (author sex, author age, etc)
  • Text perception items (readable, informative, compelling, honest, well-writen, etc.)
  • Person perception items (feminine, masculine, educated, goodness, social class, relative SES, etc.)
  • Quasi-behavioral items (how interested would you be in hearing back from the author? Meeting the author? Etc.)
  • LIWC categories (automated linguistic analysis: https://liwc.wpengine.com/wp-content/uploads/2015/11/LIWC2015_LanguageManual.pdf)

1.1 Research questions

 

The authors posed two main research questions on their lab website:

  1. How is a person’s socioeconomic status perceived from text written in feminine vs. masculine writing styles?
  2. Does the above pattern vary for different topics of text (i.e., relational vs. life course)?

We can add a third question from a psycholinguistics perspective.

  1. How do sociolinguistic aspects of the text predict whether the writing style is perceived to be feminine vs. masculine?

We will aim to answer these three questions in this analysis.


1.1.1 Example texts

 

Life course feminine About two years now, we’ve really just been bearing down and keeping stable, because we had loans to pay off and unfortunate but necessary purchases to make, and a new baby arriving just then. Now though, I’ve been thinking back about my old fantasies of living in Sweden. I even mentioned this, talking with my spouse and kids, and as expected, everyone was immediately thrilled; they love adventures, and have always been really supportive. I’m torn because, while I am really excited about just going and doing something so new and unknown, I also feel I should be considering things like stability, roots, and prospects, which is what we have now, even though moving has really always been the plan. Basically, I’m the deciding factor here, and that’s pretty scary. If it were just me, I would be leaving right away, but because I have my family and the foundation that’s here already to think about, I almost want to slap myself for even entertaining these silly dreams. When I really think about it, both options seem like the clearly right choice. Doing something long term rewarding, and doing something pragmatic and reasonable sound like great options, but I can only do one, which is why this decision seems just about impossible.

Life course masculine The last couple of years were ‘bear down to stabilize’ years, with some loans to pay off, some unfortunately necessary large purchases, and a new baby on top of all that. With a lot of that behind us, I feel a bit of room to breathe. In this mind set, some old dreams of living in Sweden are coming up to the surface. Of course, everyone was thrilled at the idea. Part of me wants to go all in; this was always the plan from the start. Another part looks at what we have in place at the moment: stability, roots, prospects, etc.. I almost want to slap myself for continuing to hold on to these silly fantasies. Support from the spouse and kids is firmly in place; they love a good adventure, particularly into the unknown. I’m the deciding factor in all of this, which is scary. With nothing else to consider, I’d be on the road in a heartbeat, but with a family and the partial foundation we put down, the decision seems impossible. In the best case, it’s between something rewarding in the long term, and something pragmatically reasonable. In a sense, each side seems like the right choice, depending on which one I’m focusing on at the time.

Relational feminine A coworker and I have been working closely on a big project assignment. When we take breaks, they talk about their children, one being autistic. I think generally they sound really attentive and caring, basically like a good parent, but with this particular child, they apparently sometimes have to ‘startle’ them by hitting them, because, as they say, this is the only way they can even start calming them. I would usually keep to myself here, but this seems abusive, or just about. I did actually bring my concerns to them once, but they just said someone who didn’t have an autistic child simply wouldn’t understand. I could imagine that reporting them to CPS might just make things worse, particularly if their child is actually where they should be now. Our working relationship would also be destroyed. Then again, I do feel responsible to do something, if only gather additional information, and I don’t think the actual project would suffer, even if we couldn’t work together. I might even get input from someone actually familiar with autistic children, but this might all be me just getting overly nosy. Basically, however I think about it, everything seems really unclear, and I’m not sure what would really be right.

Relational masculine I started an assignment on a big project, working closely with a coworker. They talk a lot about their children during breaks, one of which has some form of autism. For the most part, they seem like a good parent; attentive, caring. But with this particular child, they say that, on occasion, the only way to get them to calm down is to ‘startle’ them by hitting them. I normally stay out of this kind of thing, but this seems at least on the verge of abuse. At one point, I brought up these concerns to them. They said someone without an autistic child wouldn’t understand, leaving it at that. On the one hand, reporting them to CPS could make the situation worse, particularly in the case that the child is in the best place for them. The working relationship would also be destroyed. On the other, I feel somewhat responsible to at least look more into this, and the project itself wouldn’t suffer without us specifically working on it. I want to talk to someone more familiar with dealing with autistic children for their input. All of this might be overly nosy on my part. From any direction, the most right course of action seems unclear.


1.2 Our goal

 

…build a data analysis pipeline to answer the three research questions using Tidyverse in R

  1. Understand the structure of the data
  2. Clean the data
  3. Summarize the data
  4. Visualize patterns in the data
  5. Apply inferential statistics to answer the research questions

Figure taken from “R for Data Science”

 

2 Load Packages

library(tidyverse)
library(convenience)
library(ez)
library(lmerTest)

3 Data structure

 

Read in the data (“data.csv”)

Your code here.

 

How many observations (rows)?

Your code here.

 

How many variables (columns)?

Your code here.

 

Take a look at how many NAs are in each column of df.

Your code here.

 

Familiarize yourself with the column names in the df

Your code here.

 

Make a subset of the dataframe to include first 3 columns, demographic columns, factor score columns, condition columns, follow-up check columns, text perception columns, and person perception columns. Make sure to also grab relativeSes.

Call this ‘df.sub’

Your code here.

 

View first 10 rows of subset

Your code here.

4 Clean

 

There is always some cleaning that is necessary before beginning data analysis.

4.1 Standardize

 

When participants are given the opportunity to type their responses, there are a myriad of issues that can occur. For example, spelling mistakes, capitalization, abbreviations, etc.

This is the case for demo_first_lang. View all unique values in this column.

Your code here.

 

Clean up this column by trimming whitespace, lowercasing, and fixing typos.

Call this ‘clean’

Your code here.

 

Check your work.

Your code here.

Take a look at the unique values of demo_sex.

Your code here.

It seems like this question actually reflects gender identity, given the responses. Also, it seems like this is a multiple selection question (selections separated by semicolon).

It’s super important to be inclusive in the way gender is assessed in questionnaires because you want to capture the reality of your participants (see resources). Nevertheless, any multiple selection question poses challenges for data science. This is because it presents a large permutation of possible selections, resulting in categories that have maybe one or two respondents in them and others that have potentially hundreds of respondents in them.

I don’t think anyone has figured out the best way to balance these demands (if you know, please let me know!). So what I think we will do here is group responses that are not “Female” or “Male” into “Other”.

You may be wondering why we don’t use demo_gender. Well, this was seemingly only assessed in Study 4. So we are going to replace this with our own column (again because I think the question was about gender identity).

Your code here.

 

Check your work.

Your code here.

4.2 Filter

 

Let’s say we want to limit our analysis to L1 English participants.

Your code here.

 

Check your work.

Your code here.

In studies 3 and 4, each participant gave two responses (coded in reply). I think they received some sort of feedback between responses, but it’s unclear. So let’s just keep the first response of everyone.

Your code here.

 

Check your work.

Your code here.

Visualize the distribution of demo_age with a dotplot.

Your code here.

 

It looks like there is someone with an age of over 300, which is…impossible! Filter out this value, and then check your work.

Your code here.

Check to make sure there is only one row per participant.

Your code here.

4.3 Export

 

Perhaps we want to export our clean dataframe to a csv so that in the future we don’t need to run through all the cleaning steps again. What would this look like?

Your code here.

4.4 Tidy

 

Our third research question is about whether perceptions of the text relate to perceptions of the author. In particular, we are interested in whether the author is perceived to be feminine or masculine.

In the dataset, these values are represented in separate columns called pp_fem and pp_masc. This makes it hard to summarize or visualize both simultaneously (e.g., in the same figure).

We can apply some Tidyr functions to alter the structure of the data so that we can easily visualize pp_fem and pp_masc on the same figure.

We want to pivot the data so that there is a new column called ‘pp_gender’ with rows for pp_fem and pp_masc.

Make this as a new dataframe, rather than overwriting clean. Call it ‘clean.pivot’

Your code here.

 

Check that this worked by comparing the number of rows in ‘clean’ and ‘clean.pivot’.

Your code here.

5 Summarize

5.1 Sample size

 

How many participants do we have?

Your code here.

5.2 Demographics

5.2.1 Gender

 

Calculate the number of participants per gender group.

Your code here.

5.2.2 Age

 

Calculate the average age and the standard deviation.

Your code here.

5.2.3 Education

 

Summarize the education of the sample. Arrange the summary table by education level.

Your code here.

5.2.4 Political affiliation

 

Summarize the political affiliation of the sample. Arrange from very conservative to very liberal.

Your code here.

5.3 Summary statistics

 

Let’s calculate some summary statistics related to our first two research questions. Use the convenience package to do this (https://github.com/jasongullifer/convenience).

Your code here.

5.4 Correlation

 

We can create a correlation table of our text perception items to see how they pattern with each other, and how they relate to feminine vs. masculine perceptions of the author.

Text perception was not measured in all four studies. So let’s start by looking at how the following variables correlate with pp_fem and pp_masc in STUDY 3:

fu_readable, fu_informative, fu_interesting, fu_wellwritten, fu_thoughtful.

Drop NAs here (and make sure any non-numerics are transformed to numeric).

Your code here.

 

Now, let’s look at how the following variables correlate with pp_fem and pp_masc in STUDY 4:

fu_readable, fu_informative, fu_honest, fu_recommend.

Drop NAs here (and make sure any non-numerics are transformed to numeric).

Your code here.

6 Visualize

 

It’s super important to look at the distribution of the data. Take a look at this.

Figure taken from Matejka et al. 2017


6.1 Scatterplot

 

Use the column prompt to represent topic x style. Calculate the mean for each group, and add a horizontal line at that value.

Reorder prompt so that it is grouped by Topic.

Your code here.

Our correlation tables suggested that fu_wellwritten correlated slightly with pp_fem. Let’s make a scatterplot of this relationship and add a smooth regression line to it. Then, compare to masculine perceptions.

Your code here.
Your code here.

Recall that we created a pivoted dataframe with tidyr called ‘clean.pivot’. Use this dataframe to converge the previous two figures onto a single figure. Decide what visual tool (color, shape, size, etc.) will give you the best contrast between the groups.

Your code here.

6.2 Boxplot

Your code here.

Now that we have seen the distribution of the data, let’s visualize the summary statistics (using convenience package).

6.3 Bar graph

 

Visualize the relationship between relativeSes and Style.

Your code here.

Now, add in the topic. You can use the prompt column to get both style and topic.

Your code here.

6.4 Pointrange

 

Visualize the relationship between relativeSes and prompt with a pointrange figure.

Your code here.

6.5 Histogram

 

Last, but not least, it’s valuable to visualize the distribution of your continuous variables (contributes to checking normality assumptions of different analyses).

Create a histogram of our DV. Is it normally distributed?

Your code here.

6.6 Export

 

Perhaps we want to export one of our figures to a png file. For example, export the histogram to a file called ‘ses.hist.png’

Your code here.

7 Analyze

 

Based on our summary statistics and visualizations, we have certain expectations for what the answers to our three research questions could be.

We have seen evidence that perceptions of socioeconomic status of an author don’t seem to vary across feminine vs. masculine writing styles in general.

Your code here.

 

However, texts that are relational in nature pattern with far lower perceptions of an author if the writing style is masculine. In other words, authors who write in a stereotypically masculine manner tend to be perceived lower if the topic is relational, as opposed to life course.

Your code here.

 

Moreoever, our correlations showed that not many of the individual text perception items patterned with perceptions of femininity or masculinity. One potential exception seemed to be how well-written the text appeared. This seemed to positively correlate with a feminine writing style, and not correlate with a masculine writing style.

Your code here.

 

Now, we will conduct some statistical tests so we can say with greater confidence whether these observed patterns are likely generalizable to a wider population.


7.1 T-Test

 

We can answer Q1 with a t-test: is the perceived SES of a feminine vs. masculine style text significantly different from each other?

 

Is this paired or unpaired?

Is this one- or two-sided?

Your code here.

 

How would we report these results in a paper?

Your code here.
Your code here.

7.2 ANOVA

 

We can answer Q2 with an ANOVA, because we have more than 2 groups. Use ezANOVA from the ez package.

 

Is this a one- or two-way ANOVA?

Is this a within or between design?

Is this a repeated measures design?

Should we use Type I, II, or III?

Your code here.

 

If the interaction is statistically significant, calculate the simple main effects, or which groups were or were not significantly different from each other.

Your code here.

 

How would we report these results in a paper?

Your code here.

7.3 Linear Regression

 

We can answer Q3 with a simple linear regression from the package ‘lmerTest’ (related to lme4).

Your code here.

 

How would we report these results in a paper?

Your code here.

Bravo!

 

A work by Mehrgol Tiv

mehrgoltiv@gmail.com