You can see this in the standard Spiegelhalter paper – http://www.medicine.cf.ac.uk/media/filer_public/2010/10/11/journal_club_-_spiegelhalter_stats_in_med_funnel_plots.pdf – particularly Figure 2 on page 1187. Spiegelhalter basically works backwards from exact binomial confidence intervals to define the control limits, while the APHO tool simply uses Wilson score confidence intervals.

The problem, as you do know, is that a CI is a statement of uncertainty about the population mean given the sample mean. Not, in general, the other way round.

The APHO spreadsheet is quite widely used, and so far our efforts to get it updated have not come to anything. One of my colleagues has been working on post-operative mortality, so she’s been putting a bit more effort into getting the APHO spreadsheet corrected; we even have a replacement Excel spreadsheet as of a couple of weeks ago. Now we just need to continue convincing other people until we can get it corrected.

A comparison with binomial limits might still be interesting, but certainly the more appropriate comparison would come from the model. Comparing to the overall mean is probably daft. If the model were fully adjusting for everything appropriately (in some magic manner), then comparing to carefully chosen target values could be interesting.

Aargh dissertations – actually, I’m also not too far off track myself, though it’s not quite as exciting as that.

]]>Interested in your line of thinking. The funnel plot control limits here are produced in a standard manner based on a population mean and as you know are simply represent the sampling distribution around that mean. Just as how Public Health England would do them http://www.apho.org.uk/default.aspx?RID=39403

However, they are definitely not correct for the purpose they are used for here, as the points are random effects estimates and so are shrunk towards the mean. With the full model, control limits could be simulated. More broadly, comparing the individuals to a population mean is probably not useful anyway. Have now got a full Bayesian model working with cross validation that is probably a more robust way of identifying divergent practice. Coming to a dissertation near you!

]]>I’m going to ignore the topic entirely and instead point out that the control limits on your funnel plot are incorrect: methods designed for making statements about the population mean based on the sample mean don’t work well in the other direction. You need to work backwards a bit instead, and perhaps take account of the discrete nature of the sample data (to get some really interesting and spiky `funnels’).

That said, it works well enough for a blogpost.

Yours pedantically,

Matt

Mr. Marshall, are you under the impression that Dr. Allen is representative of your average reader in terms of his grasp of the analysis and reporting of biostatistics? If so, then his point is made all the more strongly because, I assure you, the average reader does not understand confidence intervals, and is therefore very likely to believe that the data point you present and the band it’s located in represents statistical “fact.” (I know many colleague physicians who struggle with these nuances of biostatistics, so I speak from some perspective on this.) In fact, your follow-up editorial comment on your site, which amounts to “some data is better than no data,” indicates that you continue to miss the point. Bad data, which is what you are in most cases offering on your site, is in fact worse than no data because it misleads and leads to mistakenly confident decision making.

]]>That a hospital participates in NSQIP seems to be insufficient in itself to improve outcomes. At least here:

http://www.ncbi.nlm.nih.gov/m/pubmed/25647205/

Which makes sense. Measurement is not enough if the results are not considered and acted upon.

Collaboration between surgeons and hospitals helps improvement. The building of trust in the early days of collaboration is important. Trust in data sharing and sharing quality improvement strategies.

Where does the public release of outcome data fit into that? I’m not sure. We had a famous incident of a politician stating that it was completely unacceptable that half of hospitals were below average.

From the blog you referenced:

“The right question is – will it leave Bobby better off? I think it will. Instead of choosing based on a sample size of one (his buddy who also had lung surgery), he might choose based on sample size of 40 or 60 or 80. Not perfect. Large confidence intervals? Sure. Lots of noise? Yup. Inadequate risk-adjustment? Absolutely. But, better than nothing? Yes. A lot better.”

As things stand, this is where I disagree. Bobby is currently not better off.

]]>And shouldn’t they be tracking outcomes using their own data anyway? Because most hospitals do not. For example the NSQIP is a better metric, based on clinical data, and 600 hospitals use it to track outcomes… while 3000 do not. (See e.g. https://blogs.sph.harvard.edu/ashish-jha/the-propublica-report-card-a-step-in-the-right-direction/)

]]>