Use R to specify factors, recode variables and begin by-group analyses.
Video
Files
This file contains data on pain score after laparoscopic vs. open hernia repair. Age, gender and primary/recurrent hernia also included. The ultimate aim here is to work out which of these factors are associated with more pain after this operation.
Script
##########################
# Organise your data #
# Ewen Harrison #
# April 2013 #
# www.datasurg.net #
##########################
data<-read.table("lap_hernia.csv", sep=",", header=TRUE)
# This is how to check your data, recode variables and
# begin to analyse group differences
str(data)
# First look and ensure that all your grouped data - categorical -
# are factors - they are not here.
# Check that the continuous data are integers or numeric.
# The data is in a dataframe we have called data.
# To access variables within that dataframe, use the "$" sign.
data$recurrent
summary(data$recurrent)
# Recurrent is a variable describing whether a hernia is
# being repaired for the first time or is recurrent.
# It is a factor, yes/no, and should be specified as such.
# Change a variable to a factor
data$recurrent<-factor(data$recurrent)
# Check
summary(data$recurrent)
# Do the same for others.
data$laparoscopic<-factor(data$laparoscopic)
summary(data$laparoscopic)
# Check full dataset again and note what has changed
str(data)
summary(data)
data$gender
# This variable has a number of different representations of the same thing
# It needs recoded
# Do this by using "<-"
data$gender[data$gender=="female"]<-"f"
data$gender[data$gender=="fem "]<-"f"
data$gender[data$gender=="m ale"]<-"m"
data$gender[data$gender=="male"]<-"m"
# This is important. R uses "NA" for missing data.
# All missing data should be specified NA.
# This often happens automatically, but hasn't happened in this case.
data$gender[data$gender==""]<-NA
summary(data$gender)
# Note that there all counts are now under the correct levels -
# "m" and "f"
# Get rid of unused levels by re-defining as a factor:
data$gender<-factor(data$gender)
# This may all seem like a drag, but when you have had to import
# your data 7 times (as usually happens) because of errors
# that someone else made, just being able to ctrl-R this whole page
# to get back to where you were is amazing, rather than click-click
# which you have to do in SPSS etc.
#---------------------------------------------------------------
# Summarise data by subgroup
# There are lots of ways of doing this, here's a couple.
# By
help(by)
# Use "by" followed by the dependent variable you want to summarie
# then what you want to summarise by
# then what you want the summary to be.
by(data$pain.score, data$gender, mean)
by(data$pain.score, data$gender, sd)
by(data$pain.score, data$gender, median)
#etc.
# Make a group comparison by graph, boxplots are great
# They show the distribution very well.
boxplot(data$pain.score~data$gender)
# Split
# This is often taught but I don't use it that much.
# This splits the dataframe into one containing two dataframes
# defined by the group
data2<-split(data, data$gender)
str(data2)
summary(data2$f)
# Plyr
# This seems intimidating and is.
# It will be very useful in the future, especially with large datasets
# Try this.
# install.packages("plyr") #remove "#" first time to install
library(plyr)
help(package=plyr)
# Plyr takes data in any form and outputs in any form.
# Here the "dd" means take a dataframe and give me one back.
ddply(data, .(gender), summarise, mean=mean(pain.score), sd=sd(pain.score))
# Please post questions or anything that is not clear.

