JAMA retraction after miscoding – new Finalfit function to check recoding

Riinu and I are sitting in Frankfurt airport discussing the paper retracted in JAMA this week.

During analysis, the treatment variable coded [1,2] was recoded in error to [1,0]. The results of the analysis were therefore reversed. The lung-disease self-management program actually resulted in more attendances at hospital, rather than fewer as had been originally reported.  

Recode check

Checking of recoding is such an important part of data cleaning – we emphasise this a lot in HealthyR courses – but of course mistakes happen.

Our standard approach is this:

The miscode should be obvious.

check_recode()

However, mistakes may still happen and be missed. So we’ve bashed out a useful function that can be applied to your whole dataset. This is not to replace careful checking, but may catch something that has been missed. 

The function takes a data frame or tibble and fuzzy matches variable names. It produces crosstables similar to above for all matched variables. 

So if you have coded something from sex to sex.factor it will be matched. The match is hungry so it is more likely to match unrelated variables than to miss similar variables. But if you recode death to mortality it won’t be matched. 

Here’s a walk through.

As can be seen, the output takes the form of a list length 2. The first is an index of matched variables. The second is crosstables as tibbles for each variable combination. sex.factor2 can be seen as being miscoded. sex.factor2 and age.factor2 have been matched, but should be ignored.

Numerics are not included by default. To do so:

Miscoding in survival::colon dataset?

When doing this just today, we noticed something strange in our example dataset, survival::colon.

The variable node4 should be a binary recode of nodes greater than 4. But as can be seen, something is not right!

We’re interested in any explanations those working with this dataset might have.

There we are then, a function that may be useful in detecting miscoding. So useful in fact, that we have immediately found probable miscoding in a standard R dataset.