Prediction is very difficult, especially about the future

As Niels Bohr, the Danish physicist, put it, “prediction is very difficult, especially about the future”. Prognostic models are commonplace and seek to help patients and the surgical team estimate the risk of a specific event, for instance, the recurrence of disease or a complication of surgery. “Decision-support tools” aim to help patients make difficult choices, with the most useful providing personalized estimates to assist in balancing the trade-offs between risks and benefits. As we enter the world of precision medicine, these tools will become central to all our practice.

In the meantime, there are limitations. Overwhelming evidence shows that the quality of reporting of prediction model studies is poor. In some instances, the details of the actual model are considered commercially sensitive and are not published, making the assessment of the risk of bias and potential usefulness of the model difficult.

In this edition of HPB, Beal and colleagues aim to validate the American College of Surgeons National Quality Improvement Program (ACS NSQIP) Surgical Risk Calculator (SRC) using data from 854 gallbladder cancer and extrahepatic cholangiocarcinoma patients from the US Extrahepatic Biliary Malignancy Consortium. The authors conclude that the “estimates of risk were variable in terms of accuracy and generally calculator performance was poor”. The SRC underpredicted the occurrence of all examined end-points (death, readmission, reoperation and surgical site infection) and discrimination and calibration were particularly poor for readmission and surgical site infection. This is not the first report of predictive failures of the SRC. Possible explanations cited previously include small sample size, homogeneity of patients, and too few institutions in the validation set. That does not seem to the case in the current study.

The SRC is a general-purpose risk calculator and while it may be applicable across many surgical domains, it should be used with caution in extrahepatic biliary cancer. It is not clear why the calculator does not provide measures of uncertainty around estimates. This would greatly help patients interpret its output and would go a long way to addressing some of the broader concerns around accuracy.

Preserving liver while removing all the cancer

“Radical-but-conservative” parenchymal-sparing hepatectomy (PSH) for colorectal liver metastases (Torzilli 2017) is increasing reported. The PSH revolution has two potential advantages: avoiding postoperative hepatic failure (POHF) and increasing the possibility of re-do surgery in the common event of future recurrence. However, early series reported worse long-term survival and higher positive margin rates with a parenchymal-sparing approach, with a debate ensuing about the significance of the latter in an era where energy-devices are more commonly employed in liver transection. No randomised controlled trials exist comparing PSH with major hepatectomy and case series are naturally biased by selection.

In this issues of HPB, Lordan and colleagues report a propensity-score matched case-control series of PSH vs. major hepatectomy. The results are striking. The PSH approach was associated with less blood transfusion (10.1 vs 27.7%), fewer major complications (3.8 vs 9.2%), and lower rates of POHF (0 vs 5.5%). Unusually, perioperative mortality (0.8 vs 3.8%) was also lower in the PSH group and longer-term oncologic and survival outcomes were similar.

Results of propensity-matched analyses must always be interpreted with selection bias in mind. Residual confounding always exists: the patients undergoing major hepatectomy almost certainly had undescribed differences from the PSH group and may not have been technically suitable for PSH. Matching did not account for year of surgery, so with PSH becoming more common the generally improved outcomes over time will bias in favour of the parenchymal-sparing approach. Yet putting those concerns aside, there are two salient results. Firstly, PSH promises less POHF and in this series, there was none. Secondly, PSH promises greater opportunity for redo liver surgery. There was 50% liver-only recurrence in both groups. Although not reported by the authors, a greater proportion of PSH patients underwent redo surgery (35/119 (29.4%) vs. 23/130 (17.7%) (p=0.03). Perhaps for some patients, the PSH revolution is delivering some of its promised advantages.

Realistic medicine

Realistic medicine is a useful concept describing healthcare that puts patients at the centre of decision making and treatment, with an aim to reduce harm, waste and unwarranted variation. One of the great challenges in medicine today is supporting patients with incurable disease in their treatment choices. Advising patients on interventions that offer reducing benefits in the face of increasing potential harms, when they may feel obliged to “take all treatments going”, requires honesty, candour and data. Realism is a better term than futility, but they are two sides of the same coin.

In HPB, Kim and colleagues examine survival after recurrence of bile duct cancer. The facts of this disease are always sobering: the median survival after diagnosis of recurrence is 7 months. The study is useful in that the authors have sufficient numbers to examine subgroups of those with recurrence to identify which patients may potentially benefit from salvage treatment (which was mostly chemotherapy). For those with poorly differentiated primary tumours, a short time to recurrence, poor performance status and elevated CA19-9, survival was only a handful of months.

This is a pragmatic non-randomised study with inherent selection bias, but the aim was not to determine the potential benefit of salvage treatment (we await the full publication of studies such as BILCAP). Also, the predictive ability of the model was not particularly high (c-statistic= 0.65). However, it does serve to illustrate the important point that for some very unfortunate patients with poor-prognosis recurrence, survival will be short and they may be better advised to focus on priorities other than chemotherapy. As Atul Gawande remarks in Being Mortal, “We’ve been wrong about what our job is in medicine. We think our job is to ensure health and survival. But really it is larger than that. It is to enable well-being.”

Having a low blood count increase complications from liver surgery

A low blood count is common with cancer. There are now more studies showing that this can contribute to complications after surgery. Blood transfusion increases blood count but is best avoided in cancer unless the blood count is very low. This new study in the journal HPB shows the effect of anaemia after liver surgery. Here is the editorial highlight I wrote for the journal.

Preoperative anaemia is common and affects 30-60% of patients undergoing major elective surgery. In major non-cardiac surgery, anaemia is associated with increased morbidity and mortality, as well as higher blood transfusion rates.

The importance of preoperative anaemia in liver resection patients is becoming recognised. In this issue, Tohme and colleagues present an evaluation of the American College of Surgeons’ National Surgical Quality Improvement Program (ACS-NSQIP) database.

Of around 13000 patients who underwent elective liver resection from 2005 to 2012, one third were anaemic prior to surgery. After adjustment, anaemia was associated with major complications after surgery (OR 1.21, 1.09-1.33) but not death.

Patients who are anaemic have different characteristics to those who are not, characteristics that are likely to make them more susceptible to complications. While this analysis extensively adjusts for observed factors, residual confounding almost certainly exists.

The question remains, does anaemia itself contribute to the occurrence of complications, or is it just a symptom of greater troubles? The authors rightly highlight the importance of identifying anaemia prior to surgery, but it remains to be seen whether treatment is possible and whether it will result in better patient outcomes.

Perioperative transfusion is independently associated with major complications. Although there is no additive effect in anaemic patients, the benefits of treating anaemia may be offset by the detrimental effect of transfusion. For those with iron deficiency, treatment with intravenous iron may be of use and is currently being studied in an RCT of all major surgery (preventt.lshtm.ac.uk). Results of studies such as these will help determine causal relationships and whether intervention is possible and beneficial.

Predicting liver failure and death after liver surgery

There have been many attempts to define predictive models for the identification of patients at risk of liver failure after surgery (posthepatectomy liver failure (PHLF)) and death. These have previously been hindered by the lack of a robust definition of PHLF and the two most commonly used definitions – the 50-50 and International Study Group of Liver Surgery (ISGLS) criteria – have now helped with this. These definitions are based on a measure of blood clotting (prothrombin time) and the serum bilirubin concentration, reflecting the synthetic and excretory/detoxifying functions of the liver. One criticism of these is that the criteria are taken on day 5 after surgery, a time-point some have argued is too late to intervene upon.

In a new analysis, Herbert and colleagues present an analysis of 1528 major liver resection patients and examine the changes in serum phosphate levels and creatinine immediate after surgery. It was previously shown a failure of phosphate levels to fall after surgery is associated with liver failure and death (Squires, HPB, 2014). Low serum phosphate after liver resection is well recognised and originally thought to be a consequence of consumption during liver growth (hypertrophy). However, while active take-up of phosphate into the liver after surgery does happen, this is insufficient to fully explain low phosphate levels. The authors point to studies demonstrating a significant increase in the urinary excretion of phosphate following hepatectomy which may also contribute.

Herbert provides a practical definition: creatinine on day 1 post surgery (PoD1) > day of surgery (DoS) and phosphate fails to decrease by 20% from DoS to PoD1. There is a strong association in multivariable analyses with death (Odds ratio 2.53, 1.36–4.71) and PHLF (3.89, 1.85–8.37).

The serum phosphate/creatinine definition identified 52% of those that died, but also 25% that survived without evidence of PHLF. It may be that this can be improved by incorporating other parameters, or my identifying a high risk group a priori. Given the lack of specific therapies beyond that of high quality intensive care, whether death can actually be averted is separate question.

An alternative presentation of the ProPublica Surgeon Scorecard

ProPublica, an independent investigative journalism organisation, have published surgeon-level complications rates based on Medicare data. I have already highlighted problems with the reporting of the data: surgeons are described as having a “high adjusted rate of complications” if they fall in the red-zone, despite there being too little data to say whether this has happened by chance.

4
This surgeon should not be identified as having a “high adjusted rate of complications” as there are too few cases to estimate the complication rate accurately.

I say again, I fully support transparency and public access to healthcare. But the ProPublica reporting has been quite shocking. I’m not aware of them publishing the number of surgeons out of the 17000 that are statistically different to the average. This is a small handful.

ProPublica could have chosen a different approach. This is a funnel plot and I’ve written about them before.

A funnel plot is a summary of an estimate (such as complication rate) against a measure of the precision of that estimate. In the context of healthcare, a centre or individual outcome is often plotted against patient volume. A horizontal line parallel to the x-axis represents the outcome for the entire population and outcomes for individual surgeons are displayed as points around this. This allows a comparison of individuals with that of the population average, while accounting for the increasing certainty surrounding that outcome as the sample size increases. Limits can be determined, beyond which the chances of getting an individual outcome are low if that individual were really part of the whole population.

In other words, a surgeon above the line has a complication rate different to the average.

I’ve scraped the ProPublica data for gallbladder removal (laparoscopic cholecystectomy) from California, New York and Texas for surgeons highlighted in the red-zone. These are surgeons ProPublica says have high complication rates.

As can be seen from the funnel plot, these surgeons are no where near being outliers. There is insufficient information to say whether any of them are different to average. ProPublica decided to ignore the imprecision with which the complication rates are determined. For red-zone surgeons from these 3 states, none of them have complication rates different to average.

ProPublica_lap_chole_funnel
Black line, population average (4.4%), blue line 95% control limit, red line 99% control limit.

How likely is it that a surgeon with an average complication rate (4.4%) will appear in the red-zone just by chance (>5.2%)? The answer is, pretty likely given the small numbers of cases here: anything up to a 25% chance depending on the number of cases performed. Even at the top of the green-zone (low ACR, 3.9%), there is still around a 1 in 6 chance a surgeon will appear to have a high complication rate just by chance.

chance_of_being_in_redzoneProPublica have failed in their duty to explain these data in a way that can be understood. The surgeon score card should be revised. All “warning explanation points” should be removed for those other than the truly outlying cases.

Data

Download

Git

Link to repository.

Code

# ProPublica Surgeon Scorecard 
# https://projects.propublica.org/surgeons

# Laparoscopic cholecystectomy (gallbladder removal) data
# Surgeons with "high adjusted rate of complications"
# CA, NY, TX only

# Libraries needed ----
library(ggplot2)
library(binom)

# Upload dataframe ----
dat = read.csv("http://www.datasurg.net/wp-content/uploads/2015/07/ProPublica_CA_NY_TX.csv")

# Total number reported
dim(dat)[1] # 59

# Remove duplicate surgeons who operate in more than one hospital
duplicates = which(
    duplicated(dat$Surgeon)
)

dat_unique = dat[-duplicates,]
dim(dat_unique) # 27

# Funnel plot for gallbladder removal adjusted complication rate -------------------------
# Set up blank funnel plot ----
# Set control limits
pop.rate = 0.044 # Mean population ACR, 4.4%
binom_n = seq(5, 100, length.out=40)
ci.90 = binom.confint(pop.rate*binom_n, binom_n, conf.level = 0.90, methods = "wilson")
ci.95 = binom.confint(pop.rate*binom_n, binom_n, conf.level = 0.95, methods = "wilson")
ci.99 = binom.confint(pop.rate*binom_n, binom_n, conf.level = 0.99, methods = "wilson")

theme_set(theme_bw(24))
g1 = ggplot()+
    geom_line(data=ci.95, aes(ci.95$n, ci.95$lower*100), colour = "blue")+ 
    geom_line(data=ci.95, aes(ci.95$n, ci.95$upper*100), colour = "blue")+
    geom_line(data=ci.99, aes(ci.99$n, ci.99$lower*100), colour = "red")+ 
    geom_line(data=ci.99, aes(ci.99$n, ci.99$upper*100), colour = "red")+
    geom_line(aes(x=ci.90$n, y=pop.rate*100), colour="black", size=1)+
    xlab("Case volume")+
    ylab("Adjusted complication rate (%)")+
    scale_colour_brewer("", type = "qual", palette = 6)+
    theme(legend.justification=c(1,1), legend.position=c(1,1))
g1

g1 + 
    geom_point(data=dat_unique, aes(x=Volume, y=ACR), colour="black", alpha=0.6, size = 6, 
                         show_guide=TRUE)+
    geom_point(data=dat_unique, aes(x=Volume, y=ACR, colour=State), alpha=0.6, size=4) +
    ggtitle("Funnel plot of adjusted complication rate in CA, NY, TX")


# Probability of being shown as having high complication rate ----
# At 4.4%, what are the changes of being 5.2% by chance?
n <- seq(15, 150, 1)
average = 1-pbinom(ceiling(n*0.052), n, 0.044)
low = 1-pbinom(ceiling(n*0.052), n, 0.039)

dat_prob = data.frame(n, average, low)

ggplot(melt(dat_prob, id="n"))+
    geom_point(aes(x=n, y=value*100, colour=variable), size=4)+
    scale_x_continuous("Case volume", breaks=seq(10, 150, 10))+
    ylab("Adjusted complication rate (%)")+
    scale_colour_brewer("True complication rate", type="qual", palette = 2, labels=c("Average (4.4%)", "Low (3.9%)"))+
    ggtitle("ProPublica chance of being in high complication rate zone by\nchance when true complication rate \"average\" or \"low\"")+
    theme(legend.position=c(1,0), legend.justification=c(1,0))

The problem with ProPublica’s surgeon scorecards

ProPublica is an organisation performing independent, non-profit investigative journalism in the public interest. Yesterday it published an analysis of surgeon-level complications rates based on Medicare data.

Publication of individual surgeons results is well established in the UK. Transparent, easily accessible healthcare data is essential and initiatives like this are welcomed.

It is important that data are presented in a way that can be clearly understood. Communicating risk is notoriously difficult. This is particularly difficult when it is necessary to describe the precision with which a risk has been estimated.

Unfortunately that is where ProPublica have got it all wrong.

There is an inherent difficulty faced when we dealing with individual surgeon data. In order to be sure that a surgeon has a complication rate higher than average, that surgeon needs to have performed a certain number of that particular procedure. If data are only available on a small number of cases, we can’t be certain whether the surgeon’s complication rate is truly high, or just appears to be high by chance.

If you tossed a coin 10 times and it came up with 7 heads, could you say whether the coin was fair or biased? With only 10 tosses we don’t know.

Similarly, if a surgeon performs 10 operations and has 1 complication, can we sure that their true complication rate is 10%, rather than 5% or 20%? With only 10 operations we don’t know.

The presentation of the ProPublica data is really concerning. Here’s why.

For a given hospital, data are presented for individual surgeons. Bands are provided which define “low”, “medium” and “high” adjusted complication rates. If the adjusted complication rate for an individual surgeon falls within the red-zone, they are described as having a “high adjusted rate of complications”.

1How confident can we be that a surgeon in the red-zone truly has a high complication rate? To get a handle on this, we need to turn to an off-putting statistical concept called a “confidence interval”. As it’s name implies, a confidence interval tells us what degree of confidence we can treat the estimated complication rate.

2If the surgeon has done many procedures, the confidence interval will be narrow. If we only have data on a few procedures, the confidence interval will be wide.

To be confident that a surgeon has a high complication rate, the 95% confidence interval needs to entirely lie in the red-zone.

A surgeon should be highlighted as having a high complication rate if and only if the confidence interval lies entirely in the red-zone.

Here is an example. This surgeon performs the procedure to remove the gallbladder (cholecystectomy). There are data on 20 procedures for this individual surgeon. The estimated complication rate is 4.7%. But the 95% confidence interval goes from the green-zone all the way to the red-zone. Due to the small number of procedures, all we can conclude is that this surgeon has either a low, medium, or high adjusted complication rate. Not very useful.

8Here are some other examples.

Adjusted complication rate: 1.5% on 339 procedures. Surgeon has low or medium complication rate. They are unlikely to have a high complication rate.

5Adjusted complication rate: 4.0% on 30 procedures. Surgeon has low or medium or high complication rate. Note due to the low numbers of cases, the analysis correctly suggests an estimated complication rate, despite the fact this surgeon has not had any complications for the 30 procedures.
3Adjusted complication rate: 5.4% on 21 procedures. ProPublica conclusion: surgeon has high adjusted complication rate. Actual conclusion: surgeon has low, medium or high complication rate.
4Adjusted complication rate: 6.6% on 22 procedures. ProPublica conclusion: surgeon has high adjusted complication rate. Actual conclusion: surgeon has medium or high complication rate, but is unlikely to have a low complication rate.
6Adjusted complication rate: 7.6% on 86 procedures. ProPublica conclusion: surgeon has high adjusted complication rate. Actual conclusion: surgeon has high complication rate. This is one of the few examples in the dataset, where the analysis suggest this surgeon does have a high likelihood of having a high complication rate.

7In the UK, only this last example would to highlighted as concerning. That is because we have no idea whether surgeons who happen to fall into the red-zone are truly different to average.

The analysis above does not deal with issues others have highlighted: that this is Medicare data only, that important data may be missing , that the adjustment for patient case mix may be inadequate, and that the complications rates seem different to what would be expected.

ProPublica have not moderated the language used in reporting these data. My view is that the data are being misrepresented.

ProPublica should highlight cases like the last mentioned above. For all the others, all that can be concluded is that there are too few cases to be able to make a judgement on whether the surgeon’s complication rate is different to average.

Bayesian statistics and clinical trial conclusions: Why the OPTIMSE study should be considered positive

Statistical approaches to randomised controlled trial analysis

The statistical approach used in the design and analysis of the vast majority of clinical studies is often referred to as classical or frequentist. Conclusions are made on the results of hypothesis tests with generation of p-values and confidence intervals, and require that the correct conclusion be drawn with a high probability among a notional set of repetitions of the trial.

Bayesian inference is an alternative, which treats conclusions probabilistically and provides a different framework for thinking about trial design and conclusions. There are many differences between the two, but for this discussion there are two obvious distinctions with the Bayesian approach. The first is that prior knowledge can be accounted for to a greater or lesser extent, something life scientists sometimes have difficulty reconciling. Secondly, the conclusions of a Bayesian analysis often focus on the decision that requires to be made, e.g. should this new treatment be used or not.

There are pros and cons to both sides, nicely discussed here, but I would argue that the results of frequentist analyses are too often accepted with insufficient criticism. Here’s a good example.

OPTIMSE: Optimisation of Cardiovascular Management to Improve Surgical Outcome

Optimising the amount of blood being pumped out of the heart during surgery may improve patient outcomes. By specifically measuring cardiac output in the operating theatre and using it to guide intravenous fluid administration and the use of drugs acting on the circulation, the amount of oxygen that is delivered to tissues can be increased.

It sounds like common sense that this would be a good thing, but drugs can have negative effects, as can giving too much intravenous fluid. There are also costs involved, is the effort worth it? Small trials have suggested that cardiac output-guided therapy may have benefits, but the conclusion of a large Cochrane review was that the results remain uncertain.

A well designed and run multi-centre randomised controlled trial was performed to try and determine if this intervention was of benefit (OPTIMSE: Optimisation of Cardiovascular Management to Improve Surgical Outcome).

Patients were randomised to a cardiac output–guided hemodynamic therapy algorithm for intravenous fluid and a drug to increase heart muscle contraction (the inotrope, dopexamine) during and 6 hours following surgery (intervention group) or to usual care (control group).

The primary outcome measure was the relative risk (RR) of a composite of 30-day moderate or major complications and mortality.

OPTIMSE: reported results

Focusing on the primary outcome measure, there were 158/364 (43.3%) and 134/366 (36.6%) patients with complication/mortality in the control and intervention group respectively. Numerically at least, the results appear better in the intervention group compared with controls.

Using the standard statistical approach, the relative risk (95% confidence interval) = 0.84 (0.70-1.01), p=0.07 and absolute risk difference = 6.8% (−0.3% to 13.9%), p=0.07. This is interpreted as there being insufficient evidence that the relative risk for complication/death is different to 1.0 (all analyses replicated below). The authors reasonably concluded that:

In a randomized trial of high-risk patients undergoing major gastrointestinal surgery, use of a cardiac output–guided hemodynamic therapy algorithm compared with usual care did not reduce a composite outcome of complications and 30-day mortality.

A difference does exist between the groups, but is not judged to be a sufficient difference using this conventional approach.

OPTIMSE: Bayesian analysis

Repeating the same analysis using Bayesian inference provides an alternative way to think about this result. What are the chances the two groups actually do have different results? What are the chances that the two groups have clinically meaningful differences in results? What proportion of patients stand to benefit from the new intervention compared with usual care?

With regard to prior knowledge, this analysis will not presume any prior information. This makes the point that prior information is not always necessary to draw a robust conclusion. It may be very reasonable to use results from pre-existing meta-analyses to specify a weak prior, but this has not been done here. Very grateful to John Kruschke for the excellent scripts and book, Doing Bayesian Data Analysis.

The results of the analysis are presented in the graph below. The top panel is the prior distribution. All proportions for the composite outcome in both the control and intervention group are treated as equally likely.

The middle panel contains the main findings. This is the posterior distribution generated in the analysis for the relative risk of the composite primary outcome (technical details in script below).

The mean relative risk = 0.84 which as expected is the same as the frequentist analysis above. Rather than confidence intervals, in Bayesian statistics a credible interval or region is quoted (HDI = highest density interval is the same). This is philosphically different to a confidence interval and says:

Given the observed data, there is a 95% probability that the true RR falls within this credible interval.

This is a subtle distinction to the frequentist interpretation of a confidence interval:

Were I to repeat this trial multiple times and compute confidence intervals, there is a 95% probability that the true RR would fall within these confidence intervals.

This is an important distinction and can be extended to make useful probabilistic statements about the result.

The figures in green give us the proportion of the distribution above and below 1.0. We can therefore say:

The probability that the intervention group has a lower incidence of the composite endpoint is 97.3%.

It may be useful to be more specific about the size of difference between the control and treatment group that would be considered equivalent, e.g. 10% above and below a relative risk = 1.0. This is sometimes called the region of practical equivalence (ROPE; red text on plots). Experts would determine what was considered equivalent based on many factors. We could therefore say:

The probability of the composite end-point for the control and intervention group being equivalent is 22%.

Or, the probability of a clinically relevant difference existing in the composite endpoint between control and intervention groups is 78%

optimise_primary_bayesFinally, we can use the 200 000 estimates of the probability of complication/death in the control and intervention groups that were generated in the analysis (posterior prediction). In essence, we can act like these are 2 x 200 000 patients. For each “patient pair”, we can use their probability estimates and perform a random draw to simulate the occurrence of complication/death. It may be useful then to look at the proportion of “patients pairs” where the intervention patient didn’t have a complication but the control patient did:

Using posterior prediction on the generated Bayesian model, the probability that a patient in the intervention group did not have a complication/death when a patient in the control group did have a complication/death is 28%.

Conclusion

On the basis of a standard statistical analysis, the OPTIMISE trial authors reasonably concluded that the use of the intervention compared with usual care did not reduce a composite outcome of complications and 30-day mortality.

Using a Bayesian approach, it could be concluded with 97.3% certainty that use of the intervention compared with usual care reduces the composite outcome of complications and 30-day mortality; that with 78% certainty, this reduction is clinically significant; and that in 28% of patients where the intervention is used rather than usual care, complication or death may be avoided.

# OPTIMISE trial in a Bayesian framework
# JAMA. 2014;311(21):2181-2190. doi:10.1001/jama.2014.5305
# Ewen Harrison
# 15/02/2015

# Primary outcome: composite of 30-day moderate or major complications and mortality
N1 <- 366
y1 <- 134
N2 <- 364
y2 <- 158
# N1 is total number in the Cardiac Output–Guided Hemodynamic Therapy Algorithm (intervention) group
# y1 is number with the outcome in the Cardiac Output–Guided Hemodynamic Therapy Algorithm (intervention) group
# N2 is total number in usual care (control) group
# y2 is number with the outcome in usual care (control) group

# Risk ratio
(y1/N1)/(y2/N2)

library(epitools)
riskratio(c(N1-y1, y1, N2-y2, y2), rev="rows", method="boot", replicates=100000)

# Using standard frequentist approach
# Risk ratio (bootstrapped 95% confidence intervals) = 0.84 (0.70-1.01) 
# p=0.07 (Fisher exact p-value)

# Reasonably reported as no difference between groups.

# But there is a difference, it just not judged significant using conventional
# (and much criticised) wisdom.

# Bayesian analysis of same ratio
# Base script from John Krushcke, Doing Bayesian Analysis

#------------------------------------------------------------------------------
source("~/Doing_Bayesian_Analysis/openGraphSaveGraph.R")
source("~/Doing_Bayesian_Analysis/plotPost.R")
require(rjags) # Kruschke, J. K. (2011). Doing Bayesian Data Analysis, Academic Press / Elsevier.
#------------------------------------------------------------------------------
# Important
# The model will be specified with completely uninformative prior distributions (beta(1,1,).
# This presupposes that no pre-exisiting knowledge exists as to whehther a difference
# may of may not exist between these two intervention. 

# Plot Beta(1,1)
# 3x1 plots
par(mfrow=c(3,1))
# Adjust size of prior plot
par(mar=c(5.1,7,4.1,7))
plot(seq(0, 1, length.out=100), dbeta(seq(0, 1, length.out=100), 1, 1), 
         type="l", xlab="Proportion",
         ylab="Probability", 
         main="OPTIMSE Composite Primary Outcome\nPrior distribution", 
         frame=FALSE, col="red", oma=c(6,6,6,6))
legend("topright", legend="beta(1,1)", lty=1, col="red", inset=0.05)

# THE MODEL.
modelString = "
# JAGS model specification begins here...
model {
# Likelihood. Each complication/death is Bernoulli. 
for ( i in 1 : N1 ) { y1[i] ~ dbern( theta1 ) }
for ( i in 1 : N2 ) { y2[i] ~ dbern( theta2 ) }
# Prior. Independent beta distributions.
theta1 ~ dbeta( 1 , 1 )
theta2 ~ dbeta( 1 , 1 )
}
# ... end JAGS model specification
" # close quote for modelstring

# Write the modelString to a file, using R commands:
writeLines(modelString,con="model.txt")


#------------------------------------------------------------------------------
# THE DATA.

# Specify the data in a form that is compatible with JAGS model, as a list:
dataList =  list(
    N1 = N1 ,
    y1 = c(rep(1, y1), rep(0, N1-y1)),
    N2 = N2 ,
    y2 = c(rep(1, y2), rep(0, N2-y2))
)

#------------------------------------------------------------------------------
# INTIALIZE THE CHAIN.

# Can be done automatically in jags.model() by commenting out inits argument.
# Otherwise could be established as:
# initsList = list( theta1 = sum(dataList$y1)/length(dataList$y1) , 
#                   theta2 = sum(dataList$y2)/length(dataList$y2) )

#------------------------------------------------------------------------------
# RUN THE CHAINS.

parameters = c( "theta1" , "theta2" )     # The parameter(s) to be monitored.
adaptSteps = 500              # Number of steps to "tune" the samplers.
burnInSteps = 1000            # Number of steps to "burn-in" the samplers.
nChains = 3                   # Number of chains to run.
numSavedSteps=200000           # Total number of steps in chains to save.
thinSteps=1                   # Number of steps to "thin" (1=keep every step).
nIter = ceiling( ( numSavedSteps * thinSteps ) / nChains ) # Steps per chain.
# Create, initialize, and adapt the model:
jagsModel = jags.model( "model.txt" , data=dataList , # inits=initsList , 
        n.chains=nChains , n.adapt=adaptSteps )
# Burn-in:
cat( "Burning in the MCMC chain...\n" )
update( jagsModel , n.iter=burnInSteps )
# The saved MCMC chain:
cat( "Sampling final MCMC chain...\n" )
codaSamples = coda.samples( jagsModel , variable.names=parameters , 
        n.iter=nIter , thin=thinSteps )
# resulting codaSamples object has these indices: 
#   codaSamples[[ chainIdx ]][ stepIdx , paramIdx ]

#------------------------------------------------------------------------------
# EXAMINE THE RESULTS.

# Convert coda-object codaSamples to matrix object for easier handling.
# But note that this concatenates the different chains into one long chain.
# Result is mcmcChain[ stepIdx , paramIdx ]
mcmcChain = as.matrix( codaSamples )

theta1Sample = mcmcChain[,"theta1"] # Put sampled values in a vector.
theta2Sample = mcmcChain[,"theta2"] # Put sampled values in a vector.

# Plot the chains (trajectory of the last 500 sampled values).
par( pty="s" )
chainlength=NROW(mcmcChain)
plot( theta1Sample[(chainlength-500):chainlength] ,
            theta2Sample[(chainlength-500):chainlength] , type = "o" ,
            xlim = c(0,1) , xlab = bquote(theta[1]) , ylim = c(0,1) ,
            ylab = bquote(theta[2]) , main="JAGS Result" , col="skyblue" )

# Display means in plot.
theta1mean = mean(theta1Sample)
theta2mean = mean(theta2Sample)
if (theta1mean > .5) { xpos = 0.0 ; xadj = 0.0
} else { xpos = 1.0 ; xadj = 1.0 }
if (theta2mean > .5) { ypos = 0.0 ; yadj = 0.0
} else { ypos = 1.0 ; yadj = 1.0 }
text( xpos , ypos ,
            bquote(
                "M=" * .(signif(theta1mean,3)) * "," * .(signif(theta2mean,3))
            ) ,adj=c(xadj,yadj) ,cex=1.5  )

# Plot a histogram of the posterior differences of theta values.
thetaRR = theta1Sample / theta2Sample # Relative risk
thetaDiff = theta1Sample - theta2Sample # Absolute risk difference

par(mar=c(5.1, 4.1, 4.1, 2.1))
plotPost( thetaRR , xlab= expression(paste("Relative risk (", theta[1]/theta[2], ")")) , 
    compVal=1.0, ROPE=c(0.9, 1.1),
    main="OPTIMSE Composite Primary Outcome\nPosterior distribution of relative risk")
plotPost( thetaDiff , xlab=expression(paste("Absolute risk difference (", theta[1]-theta[2], ")")) ,
    compVal=0.0, ROPE=c(-0.05, 0.05),
    main="OPTIMSE Composite Primary Outcome\nPosterior distribution of absolute risk difference")

#-----------------------------------------------------------------------------
# Use posterior prediction to determine proportion of cases in which 
# using the intervention would result in no complication/death 
# while not using the intervention would result in complication death 

chainLength = length( theta1Sample )

# Create matrix to hold results of simulated patients:
yPred = matrix( NA , nrow=2 , ncol=chainLength ) 

# For each step in chain, use posterior prediction to determine outcome
for ( stepIdx in 1:chainLength ) { # step through the chain
    # Probability for complication/death for each "patient" in intervention group:
    pDeath1 = theta1Sample[stepIdx]
    # Simulated outcome for each intervention "patient"
    yPred[1,stepIdx] = sample( x=c(0,1), prob=c(1-pDeath1,pDeath1), size=1 )
    # Probability for complication/death for each "patient" in control group:
    pDeath2 = theta2Sample[stepIdx]
    # Simulated outcome for each control "patient"
    yPred[2,stepIdx] = sample( x=c(0,1), prob=c(1-pDeath2,pDeath2), size=1 )
}

# Now determine the proportion of times that the intervention group has no complication/death
# (y1 == 0) and the control group does have a complication or death (y2 == 1))
(pY1eq0andY2eq1 = sum( yPred[1,]==0 & yPred[2,]==1 ) / chainLength)
(pY1eq1andY2eq0 = sum( yPred[1,]==1 & yPred[2,]==0 ) / chainLength)
(pY1eq0andY2eq0 = sum( yPred[1,]==0 & yPred[2,]==0 ) / chainLength)
(pY10eq1andY2eq1 = sum( yPred[1,]==1 & yPred[2,]==1 ) / chainLength)

# Conclusion: in 27% of cases based on these probabilities,
# a patient in the intervention group would not have a complication,
# when a patient in control group did. 

House of God and bile leak after liver resection

While death after liver resection is reported at ever lower levels, complication rates remain stubbornly high. Morbidity is associated with longer intensive care and hospital stay, and poorer oncological outcomes. Variability in the reported rate of complications may partly be due to differences in definitions. The International Study Group for Liver Surgery (ISGLS) has now published definitions in three areas: liver failure and haemorrhage after hepatectomy, and bile leak after liver and pancreas surgery. These have stimulated debate and different predictive models vie for supremacy. In HPB January 2015, the ISGLS use their definition and grading system to prospectively evaluate bile leak after liver resection.

Of 949 patients in 11 centres undergoing liver resection for predominately colorectal liver metastases, 7.3% were diagnosed with a bile leak. Of these, just over half required something done about it. “If you don’t take a temperature you can’t find a fever”, a medical truism from Samuel Shem’s 1978 novel The House of God, equally applies here: grade A bile leaks requiring no/little change in patients’ management are only diagnosed in the presence of an abdominal drain. Of course, a patient without a drain found to have a bile leak, by definition, has a grade B leak. Yet, even in those with seemingly inconsequential grade A bile leaks, a greater number and severity of other complications were seen, together with a longer hospital stay (median 14 versus 7 days on average). Indeed, bile leak was significantly associated with intra-operative blood loss which may explain these poor outcomes.

There is little strong evidence supporting drainage after liver resection, yet in this series drains were used in 64% of patients. In nearly half of patients with a bile leak and a drain, there was no significant change in the clinical course; the authors suggest that up to 94% of patients did not benefit from intra-operative drainage.

In this up-to-date series, the overall complication rate of 38% is striking. Although only 8.8% of complications were classified as severe, this rate is not improving. Interventions to reduce this rate should surely be a priority in seeking to improve long-term liver resection outcomes.

From HPB January 2015

Considerations in the Early Termination of Clinical Trials in Surgery

One of the most difficult situations when running a clinical trial is the decision to terminate the trial early. But it shouldn’t be a difficult decision. With clear stopping rules defined before the trial starts, it should be straightforward to determine when the effect size is large enough that no further patients require to be randomised to definitively answer the question.

Whether there is benefit to leaving a temporary plastic tube drain in the belly after an operation to remove the head of the pancreas is controversial. It may help diagnose and treat the potential disaster that occurs when the join between pancreas and bowel leaks. Others think that the presence of the drain may in fact make a leak more likely.

This question was tackled in an important randomised clinical trial.

A randomised prospective multicenter trial of pancreaticoduodenectomy with and without routine intraperitoneal drainage

The trial was stopped early because there were more deaths in the group who didn’t have a drain. The question that remains: was it the absence of the drain which caused the deaths? As important, was stopping the trial at this point the correct course of action?

My feeling, the lack of a drain was not definitively demonstrated to be the cause of the deaths. And I think the trial was stopped too early. Difficult issues discussed in our letter in Annals of Surgery about it.

Ethics and statistics collide in decisions relating to the early termination of clinical trials. Investigators have a fundamental responsibility to stop a trial where an excess of harm is seen in one of the arms. Decisions on stopping are not straightforward and must balance the potential risk to trial patients against the likelihood that in fact there is no difference in outcome between groups. Indeed, in early termination, the potential loss of generalizable knowledge may itself harm future patients.

We therefore read with interest the article by Van Buren and colleagues (1) and congratulate the authors on the first multicenter randomized trial on the controversial topic of surgical drains after pancreaticoduodenectomy. As the authors report, the trial was stopped by the Data Safety Monitoring Board after only 18% recruitment due to a numerical excess of deaths in the “no-drain” arm.

We would be interested in learning from the process that led to the decision to terminate the trial. A common method to monitor adverse events advocated by the CONSORT group is to define formal sequential stopping rules based on the limit of acceptable adverse event rates (2). These guidelines suggest that authors report the number of planned “looks” at the data, the statistical methods used including any formal stopping rules, and whether these were planned before trial commencement.

This information is often not included in published trial reports, even when early termination has occurred (3). We feel that in the context of important surgical trials, these guidelines should be adhered to.

Early termination can reduce the statistical power of a trial. This can be addressed by examining results as data accumulate, preferably by an independent data monitoring committee. However, performing multiple statistical examinations of accumulating data without appropriate correction can lead to erroneous results and interpretation (4). For example, if accumulating data from a trial are examined at 5 interim analyses that use a P value of 0.05, the overall false-positive rate is nearer to 19% than to the nominal 5%.

Several group sequential statistical methods are available to adjust for multiple analyses (5,6) and their use should be prespecified in the trial protocol. Stopping rules may be formed by 2 broad methods, either using a Bayesian approach to evaluate the proportion of patients with adverse effects or using a hypothesis testing approach with a sequential probability ratio test to determine whether the acceptable adverse effects rate has been exceeded. Data are compared at each interim analysis and decisions based on prespecified criteria. As an example, stopping rules for harm from a recent study used modified Haybittle-Peto boundaries of 3 SDs in the first half of the study and 2 SDs in the second half (7). The study of Van Buren and colleagues is reported to have been stopped after 18% recruitment due to an excess of 6 deaths in the “no-drain” arm. The relative risk of death at 90 days in the “no-drain” group versus the “drain” group was 3.94 (95% confidence interval, 0.87–17.90), equivalent to a difference of 1.78 SD. The primary outcome measure was any grade 2 complication or more and had a relative risk of 1.32 (5% confidence interval, 1.00–1.75), or 1.95 SD.

The decision to terminate a trial early is not based on statistics alone. Judgements must be made using all the available evidence, including the biological and clinical plausibility of harm and the findings of previous studies. Statistical considerations should therefore be used as a starting point for decisions, rather than a definitive rule.

The Data Safety Monitoring Board for the study of Van Buren and colleagues clearly felt that there was no option other than to terminate the trial. However, at least on statistical grounds, this occurred very early in the trial using conservative criteria. The question remains therefore is the totality of evidence convincing that the question posed has been unequivocally answered? We would suggest that this is not the case. In general terms, stopping a clinical trial early is a rare event that sends out a message that, because of the “sensational” effect, may have greater impact on the medical community than intended, making future studies in that area challenging.

1. Van Buren G, Bloomston M, Hughes SJ, et al. A randomised prospective multicenter trial of pancreaticoduodenectomy with and without routine intraperitoneal drainage. Ann Surg. 2014;259: 605–612.

2. Moher D, Hopewell S, Schulz KF, et al. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trial. BMJ. 2010;340:c869.

3. Montori VM, Devereaux PJ, Adhikari NK, et al. Randomized trials stopped early for benefit: a systematic review. JAMA. 2005;294:2203–2209.

4. Geller NL, Pocock SJ. Interim analyses in randomized clinical trials: ramifications and guidelines for practitioners. Biometrics. 1987;43:213–223.

5. Pocock SJ. When to stop a clinical trial. BMJ. 1992;305:235–240.

6. Berry DA. Interim analyses in clinical trials: classical vs. Bayesian approaches. Stat Med. 1985;4:521– 526.

7. Connolly SJ, Pogue J, Hart RG, et al. Effect of clopidogrel added to aspirin in patients with atrial fibrillation. N Engl J Med. 2009;360:2066– 2078.