Data

01/06/201301/06/2013

A map of the world by tweets

With geo-tagging enabled, tweets include information on the location of the user when the tweet was sent. Miguel Rios (@miguelrios) has plotted locations of billions of tweets to create maps of the world. This is pretty amazing stuff – a world map rendered just from twitter posts!

Maps are created using every tweet from 2009 using R and the ggmap package.

Europe by tweets, by Miguel Rios @miguelrios

US by tweets, by Miguel Rios @miguelrios

Post is here with more here on flickr.

19/04/201325/02/2015

Tweets of Surgical Colleges – what does it say about them?

What do the UK and Ireland Surgical Royal Colleges tweet about and how do they compare to the American College of Surgeons?

Twitter allow retrieval of the last 3200 tweets of a given user. Here are all tweets ever sent by the Royal Colleges a few days ago. The American College has tweeted over 6000 times, so only the latest 3200 are included. The Glasgow College is just getting going.

There is a bit of processing first. Charts are generated after removal of “stop words” – all the little words that go in between. Words then have common endings removed (e.g. -ing; stemming) and the most common ending for the group replaced (stem completion).

So what can be said? I was interested in whether Colleges tweet about training. I was pleased to see that the UK colleges do – a fair amount. Terms that are associated with training were less apparent in tweets from the RCSI and ACS.

Frequency of words in tweets from five surgical colleges

The figures below show clustering of terms within tweets, with term frequency increasing from left to right. There are some nice themes that emerge. In the RCSEng tweets there are themes relating to “training”, “events”, “working time”, and “the NHS”. Similar subjects are apparent in RCSEd tweets, with prominence of their medical students surgical skill competition and issues specifically relating to the NHS in Scotland. As the RCPSG have only started tweeting, associations are greatly influenced by individual tweets. The RCSI’s ~~“Transition Year Mini Med School Programme”~~ “MiniMed School Open Lecture Series” (updated 22/04/13) can be seen together with conference promotion. The ACS appear to use Twitter to communicate issues relating to patient health improvement programmes more prominently than other Colleges.

Term association in surgical college tweets (cluster dendrogram)

Network plots illustrate the strength of association of terms (weight of edges) and frequency of terms (font size of vertices). Do the terms in these plots represent the core values of these organisations?

12/04/201312/04/2013

Publication of paediatric cardiac surgery results

The National Institute for Cardiovascular Outcomes Research (NICOR) has published the results of its investigation into mortality after paediatric heart surgery in England 2009-12.

The short report has two main findings – the quality of data collection at Leeds General Infirmary (LGI) was woeful, and differences in mortality between all hospitals are likely to be explained by natural variation.

The ability of an institution to collect and audit its own results can be viewed as a measure of organisational health. As can be seen in the table, the performance of LGI in this respect was terrible, and much worse than other units. A cause for concern in itself.

On the more controversial point of whether the mortality rate in LGI was worse than other centres, no convincing proof of this has been found.

The funnel plot below shows the number of expected deaths along the bottom. Centres performing greater numbers of procedures have a greater number of expected deaths, just by common sense.

These numbers have been corrected for the difference in the types of patients and surgery performed in hospitals – the specific procedure performed, patient age, weight, diagnosis, and previous medical conditions. All these factors impact on the risk of death following surgery.

Any hospital above the black horizontal line has a greater number of deaths than predicted and any hospital below has fewer.

By “the law of averages”, it would be expected that there was a roughly equal spread of hospitals above and below the line.

As can be seen, Alder Hey, Guys, and LGI are all close to triggering an “alert”.

The report rightly states that these units “may deserve additional scrutiny and monitoring of current performance”.

The 3-year risk adjusted mortality rate in LGI is 1.47 times the national average – lower than the “twice the national average” first reported.

The unambiguous message? Data collection and real-time analysis is core business in healthcare. Government and the NHS still do not have a grip of this. There are many more stories of significant differences between hospitals, hidden in poor quality data that no one is looking at.

06/04/201307/04/2013

Mortality after paediatric heart surgery using public domain data

This post comes with some big health warnings.

The recent events in Leeds highlight the difficulties faced in judging the results of surgery by individual hospital. A clear requirement is timely access to data in a form easily digestible by the public.

Here I’ve scraped the publically available data from the central cardiac audit database (CCAD). All the data are available at the links provided and are as presented this afternoon (06/04/2013). Please read the caveats carefully.

Hospital-specific 30-day mortality data are available for certain paediatric heart surgery procedures for 2009-2012. These data are not complete for 2011-12 and there may be missing data for earlier years. There may be important data for procedures not included here that should be accounted for. There is no case-mix adjustment.

All data are included in spreadsheets below as well as the code to run the analysis yourself, to ensure no mistakes have been made. Hopefully these data will be quickly superseded with a quality-assured update.

Mortality by centre

The funnel plot below has been generated by taking all surgical procedures performed from pages such as this and expressing all deaths within 30 days as a proportion of the total procedures performed by hospital.

The red horizontal line is the mean mortality rate for these procedures – 2.3%. The green, blue and red curved lines are decreasingly stringent control limits within which unit results may vary by chance.

Mortality by procedure

The mortality associated with different procedures can be explored with this google motion chart. Note when a procedure is uncommon (to the left of the chart) the great variation seen year to year. These bouncing balls trace out the limits of a funnel plot. They highlight why year to year differences in mortality rates for rare procedures must be interpreted with caution.

[cf]googleVis[/cf]

ccad_public_data_april_2013.xls

ccad_public_data_april_2013_centre

ccad_public_data_april_2013_aggregate

ccad_public_data_april_2013_lookup

Script

####################################
# CCAD public domain data analysis #
# 6 April 2013                     #
# Ewen Harrison                    #
# www.datasurg.net                 #
#################################### 

# Read data
data<-read.table("ccad_public_data_april_2013_centre.csv", sep=",", header=TRUE)

# Correct variable-type
data$centre_code<-factor(data$centre_code)

# Read centre names table
centre<-read.table("ccad_public_data_april_2013_lookup.csv", sep=",", header=TRUE)

# Combine
data<-merge(data, centre, by="centre_code")

# Subset by only procedures termed "Surgery" and remove procedures with no data. 
surg<-subset(data, type=="Surgery" & !is.na(data$centre_code))

# Display data
surg
str(surg)

# install.packages("plyr") # remove "#" first time to install
library(plyr)

# Aggregate data by centre
agg.surg<-ddply(surg, .(centre_code), summarise, observed_mr=sum(death_30d)/sum(count), 
  sum_death=sum(death_30d), count=sum(count))

# Overall mortality rate for procedures lists in all centres
mean.mort<-sum(surg$death_30d)/sum(surg$count)
mean.mort #2.3%

# Generate binomial confidence limits

# install.packages("binom") # remove "#" first time to install
library(binom)
binom_n<-seq(90, 1100, length.out=40)
ci.90<-binom.confint(mean.mort*binom_n, binom_n, conf.level = 0.90, methods = "agresti-coull")
ci.95<-binom.confint(mean.mort*binom_n, binom_n, conf.level = 0.95, methods = "agresti-coull")
ci.99<-binom.confint(mean.mort*binom_n, binom_n, conf.level = 0.997, methods = "agresti-coull")

# Plot chart
# install.packages("ggplot2") # remove "#" first time to install
library(ggplot2)
ggplot()+
	geom_point(data=agg.surg, aes(count,observed_mr))+
	geom_line(aes(ci.90$n, ci.90$lower, colour = "90% CI"))+ #hack to get legend
	geom_line(aes(ci.90$n, ci.90$upper), colour = "green")+
	geom_line(aes(ci.95$n, ci.95$lower, colour = "95% CI"))+ 
	geom_line(aes(ci.95$n, ci.95$upper), colour = "blue")+
	geom_line(aes(ci.99$n, ci.99$lower, colour = "99.7% CI"))+ 
	geom_line(aes(ci.99$n, ci.99$upper), colour = "red")+
	geom_text(data=agg.surg, aes(count,observed_mr,label=centre_code), size=3,vjust=-1)+
	geom_line(aes(x=90:1100, y=mean.mort), colour="red")+
	ggtitle("Observed mortality rate following paediatric heart surgery\nby centre using CCAD public domain data 2009-2012 (incomplete)")+
	scale_x_continuous(name="Number cases performed per centre 2009-2012", limits=c(90,1100))+
	scale_y_continuous(name="Observed mortality rate")+
	scale_colour_manual("",
		breaks=c("90% CI", "95% CI", "99.7% CI"),
		values=c("green", "blue", "red"))+
	theme_bw()+
	theme(legend.position=c(.9, .9))

# Google motion chart
# Load national aggregate data by procedure
agg_data<-read.table("ccad_public_data_april_2013_aggregate.csv", sep=",", header=TRUE)

# check
str(agg_data)

# install.packages("googleVis") # remove "#" first time to install
library(googleVis)
Motion=gvisMotionChart(agg_data, idvar="procedure", timevar="year",
	options=list(height=500, width=600,
		state='{"showTrails":true,"yZoomedDataMax":1,"iconType":"BUBBLE","orderedByY":false,"playDuration":9705.555555555555,"xZoomedIn":false,"yLambda":1,"xAxisOption":"3","nonSelectedAlpha":0.4,"xZoomedDataMin":0,"iconKeySettings":[],"yAxisOption":"5","orderedByX":false,"yZoomedIn":false,"xLambda":1,"colorOption":"2","dimensions":{"iconDimensions":["dim0"]},"duration":{"timeUnit":"Y","multiplier":1},"xZoomedDataMax":833,"uniColorForNonSelected":false,"sizeOption":"3","time":"2000","yZoomedDataMin":0.33};'),
		chartid="Survival_by_procedure_following_congenital_cardiac_surgery_in_UK_2000_2010")
plot(Motion)

03/04/2013

Two simple tests for summary data

Here’s two handy scripts for hypothesis testing of summary data. I seem to use these a lot when checking work:

Chi-squared test of association for categorical data.
Student’s t-test for difference in means of normally distributed data.

The actual equations are straightforward, but get involved when group sizes and variance are not equal. Why do I use these a lot?!

I wrote about a study from Hungary in which the variability in the results seemed much lower than expected. We wondered whether the authors had made a mistake in saying they were showing the standard deviation (SD), when in fact they had presented the standard error of the mean (SEM).

This is a bit of table 1 from the paper. It shows the differences in baseline characteristics between the treated group (IPC) and the active control group (IP). In it, they report no difference between the groups for these characteristics, p>0.05.

But taking “age” as an example and using the simple script for a Student’s t-test with these figures, the answer we get is different. Mean (SD) for group A vs. group B: 56.5 (2.3) vs. 54.8 (1.8), t=4.12, df=98, p=<0.001.

There are lots of similar examples in the paper.

Using standard error of the mean rather than standard deviation gives a non-significant difference as expected.

$latex SEM=SD/\sqrt{n}.$

See here for how to get started with R.

####################
# Chi-sq test from #
# two by two table #
####################

#           Factor 1
# Factor 2  a   |   b
#           c   |   d

a<-32
b<-6
c<-43
d<-9

m<-rbind(c(a,b), c(c,d))
m
chisq.test(m, correct = FALSE)
# Details here
help(chisq.test)

############################
# T-test from summary data #
############################

# install.packages("BSDA") # remove first "#" to install first time only
library(BSDA)
x1<-56.5     # group 1 mean
x1_sd<-2.3   # group 1 standard deviation
n1<-50       # group 1 n
x2<-54.8     # group 2 mean
x2_sd<-1.8   # group 2 standard deviation
n2<-50       # group 2 n

tsum.test(x1, x1_sd, n1, x2, x2_sd, n2, var.equal = TRUE)
# Details here
help(tsum.test)

31/03/201301/04/2013

R function to retrieve pubmed citations from pmid number

This is useful number if you have hundreds of PMIDs and need specific fields from the pubmed/medline citation.

#################################
# Retrieve pubmed citation data #
# Ewen Harrison                 #
# March 2013                    #
# www.datasurg.net              #
#################################

# Sample data
pmid<-c(8339401,8485120,8427851,3418853,3195585,2643302,7947591,8274474,8243847)
#-----------------------------------------------------------------------------------------------------------------------------------
# Batch list of PMIDs into groups of 200
fn_batch<-function(pmid=pmid,n=200){
	require(plyr)	
	b<-list()
	max=ceiling(length(pmid)/n)
	for (i in 1:max){
		b[[i]]<-pmid[(1+((i-1)*n)):(n*i)]
	}
	b[[max]]<-b[[max]][!is.na(b[[max]])]	# drop missing values in the final block
	c<-llply(b, function(a){			# convert from list to comma separted list
		paste(a, collapse=",")
	})
	return(c)
}
#-----------------------------------------------------------------------------------------------------------------------------------
# Run
pmid_batch<-fn_batch(pmid)
#-----------------------------------------------------------------------------------------------------------------------------------
# Function to fetch, parse and extract medline citation data. Use wrapper below.
get_pubmed<-function(f){
	require(RCurl)
	require(XML)
	require(plyr)
		# Post PMID (UID) numbers
	url<-paste("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=", f, "&retmode=XML", sep="")
		# Medline fetch
	g<-llply(url, .progress = progress_tk(label="Fetching and parse Pubmed records ..."), function(x){
		xmlTreeParse(x, useInternalNodes=TRUE)
		})
		# Using given format and xml tree structure, paste here the specific fields you wish to extract
	k<-ldply(g, .progress = progress_tk(label="Creating dataframe ..."), function(x){
		a<-getNodeSet(x, "/PubmedArticleSet/*/MedlineCitation")
		pmid_l<-sapply (a, function(a) xpathSApply(a, "./PMID", xmlValue))
		pmid<-lapply(pmid_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		nct_id_l<-sapply (a, function(a) xpathSApply(a, "./Article/DataBankList/DataBank/AccessionNumberList/AccessionNumber", xmlValue))
		nct_id<-lapply(nct_id_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		year_l<-sapply (a, function(a) xpathSApply(a, "./Article/Journal/JournalIssue/PubDate/Year", xmlValue))
		year<-lapply(year_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		month_l<-sapply (a, function(a) xpathSApply(a, "./Article/Journal/JournalIssue/PubDate/Month", xmlValue))
		month<-lapply(month_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		day_l<-sapply (a, function(a) xpathSApply(a, "./Article/Journal/JournalIssue/PubDate/Day", xmlValue))
		day<-lapply(day_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		year1_l<-sapply (a, function(a) xpathSApply(a, "./Article/ArticleDate/Year", xmlValue))
		year1<-lapply(year1_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		month1_l<-sapply (a, function(a) xpathSApply(a, "./Article/ArticleDate/Month", xmlValue))
		month1<-lapply(month1_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		day1_l<-sapply (a, function(a) xpathSApply(a, "./Article/ArticleDate/Day", xmlValue))
		day1<-lapply(day1_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		medlinedate_l<-sapply (a, function(a) xpathSApply(a, "./Article/Journal/JournalIssue/PubDate/MedlineDate", xmlValue))
		medlinedate<-lapply(medlinedate_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		journal_l<-sapply (a, function(a) xpathSApply(a, "./Article/Journal/Title", xmlValue))
		journal<-lapply(journal_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		title_l<-sapply (a, function(a) xpathSApply(a, "./Article/ArticleTitle", xmlValue))
		title<-lapply(title_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		author_l<-sapply (a, function(a) xpathSApply(a, "./Article/AuthorList/Author[1]/LastName", xmlValue))
		author<-lapply(author_l, function(x) ifelse(class(x)=="list" | class(x)=="NULL", NA, x))
		return(data.frame(nct_pm=unlist(nct_id), pmid=unlist(pmid), year=unlist(year), month=unlist(month), day=unlist(day),
			year1=unlist(year1), month1=unlist(month1), day1=unlist(day1), medlinedate=unlist(medlinedate), journal=unlist(journal), 
			title=unlist(title), author=unlist(author) ))
	})
}

#-----------------------------------------------------------------------------------------------------------------------------------
# Wrapper function uses batched PMID list and get_pubmed to run pubmed search
# Path takes desired name for folder to save data frames referred to as data files.

fn_pubmed<-function(pmid = pmid_batch, path="pmid_data", from=1, to=max){
	require(plyr)
	max<-length(pmid)
	if (file.exists(path)==FALSE){
		dir.create(path)
	}
	for (i in from:to){
		df<-data.frame()
		df<-get_pubmed(pmid[[i]])
		file<-paste(path, "/data", i,".txt", sep="")
		write.table(df, file=file, sep=";")
	}
}
#-----------------------------------------------------------------------------------------------------------------------------------
fn_pubmed(pmid_batch)
#-----------------------------------------------------------------------------------------------------------------------------------
# Merge back saved tables
fn_merge<-function(path="pmid_data"){
	require(plyr)
	data_files<-list.files(path, full.names=T)
	df<-ldply(data_files, function(x){
		df1<-read.csv(x, header=TRUE, sep=";")
		df2<-data.frame(data_files=gsub(pattern=paste(path, "/", sep=""), replacement="",x), df1)
		return(df2)
	})
	return(df)
}
#-----------------------------------------------------------------------------------------------------------------------------------
# Run
data<-fn_merge()