Quantcast
Channel: This Scattrd Corn
Viewing all articles
Browse latest Browse all 13

A Response to Matt Crawford

$
0
0

In response to my last blog post Matt Crawford has written a critique. 

I found it deeply unsettling and uncomfortable to read. It's a bit creepy and he seems to have searched the internet for 10+ year old photos of me at university amongst others and is riddled with ad hominem and name calling. This doesn't surprise me terribly, he's been harassing and abusing me from behind a twitter block, with frankly weird transphobic stuff like this:



So seeing that he's been spending time searching for old photos of me unrelated to the actual question is... yeah, a bit unsettling?

The entire article is pretty much nit-picking edge cases and technicalities, combined with some statistical slight of hand. 

In this response I will:
  • Examine each of his six claims
  • Summarise what they mean for the credibility of the original blog post
  • Try to "steelman", put together the harshest possible criticism I can of my own work to highlight potential weaknesses.
  • Try and put together what the analysis would look like if I did follow all his rules.

Part 1: The six claims made:

Before getting into the meat of his six criticisms, I have to say it's reasonably clear he doesn't understand the study design or matching in clinical studies in general, and that's ok, but to clarify there is no possibility here that there are "details" that "slipped into an appendix or out of the paper" that change this. 


If this study simply said "two cohorts" then I would have been obligated to check whether matching occurred but was not described. Matching did not occur by definition. Matching and the recruitment of consecutive patients are contradictory. For those without a background in medical research patient matching occurs when we only recruit/include patients for one group that have the same characteristics (on some pre-defined subset of characteristics) as patients already recruited in the other arm. This is most commonly "age and sex" matching, where a 47 year old female in one group means we recruit/include a 46-48 year old female in the other group, although it can be on other factors such as disease severity or the presence of important comorbidities etc. On the other hand inclusion of consecutive patients means we include every patients that meets our criteria with no breaks, this is done to stop researchers cherry picking. So if you include 50 consecutive patients then it's the next 50 that come through the doors no questions asked. A little thinking will explain why no appendix specifying "oh yes we also matched patients" can turn up for a study of consecutive patients.

I will now read through each of the six issues.

"1:  Correlations"


I did actually point out the underlying independence assumption was not technically valid in the letter. Crawford claims that this is "frightening", and also that "having built an artificial intelligence system that crunched dozens of variables for the purpose of stock trading, I can guarantee that he is very wrong".

Perhaps having worked in intensive care units, as I have, and having practiced medicine, as I have, is not as good a background to comment on the expected strength of correlation between medical diagnoses and clinical features as... stock trading? On the other though it potentially... is. 

There is no actual reason given in this critique beyond "well actually I think it will have a large effect". 

There is no perfect solution here (other than studies routinely providing IPD), as the correlation between outcomes cannot be derived from summary data. Virtually all similar analyses are susceptible to the same criticism, and similarly treat different baseline characteristics as independent. Such analyses exposed the largest and most deadly medical frauds in history.

The de facto call to not analyse overly similar baseline groups unless we have "correlation tables" would throw out the entire Carlisle body of work, would mean we would never have found about half of the most retracted scientists, among them Dr Boldt, and we would still be killing patients by their thousands with HES.

No calculation of the degree of correlation required to make this distribution of results plausible under the null hypothesis is given. Instead this is just an argument from eminence based on having traded stocks. There's not a lot to respond to here

"2: Malignancy"


In this section the author points out that for "malignancy" the chance of getting a p value over 0.4 is more than 60% and these might add up to an order of magnitude over a set of 22 variables. I agree which is why I said it was "quick and dirty" rather than calculating the exact likelihood for each. For the record this is the exact value for every variable of p >0.4 and their product. It does indeed reduce the risk from being on the order of 1 in 100,000 to being on the order of 1 in 10,000. This seems a lot like nit-picking at the edges. 


He also has misread the statement as saying that the assumption of independence underestimates improbability. I did no such thing, and he needs to read whole sentences in their entirety.

"3: Primary Diagnosis: Dumbass"


Aside from the title being a bit odious, this is the first section where we see possible statistical slight of hand. He points out that as primary diagnoses are definitionally correlated because being in one precludes being in the others they should be handled as one. As far as that goes it's reasonable, I did consider doing a 5x2 Pearson Chi Square table here but decided to prioritise consistency (ordering of matrices for a 2x2 and nx2 Fisher can actually be done slightly differently even by the same programs). So up to that point it's not unfair.


But then we get to the possible sleight of hand. To calculate this he then sequentially takes the probability of the 4 outcomes with the smallest number of events without replacement, multiplies them, and so calculates the outcome with the most events to be one. That's simply not how this works, for obvious reasons. One of those is that if he started at the top of the table and just worked down he'd get a different result. Also his numbers are badly wrong but we'll get to that after we see the actual result. 

We do actually have valid techniques to estimate probability of 5x2 tables without falsely "dropping" a particularly convenient variable, the most common is a 5x2 Pearson Chi Square.

A 5x2 Pearson Chi Square gives a p value of 0.99932

So the chance for this "primary diagnosis table result alone is going to be roughly 1000 to 1 (I say roughly for reasons I'll outline at the end of this post about how I would criticise myself if you asked me to be as harsh as possible).

Again this seems like nitpicking a result in a way that doesn't actually change conclusions.

This is partly why I chose to treat every diagnosis as pairwise so that roughly the same test was being applied throughout.

We can apply a 5x2 test (in which case we get a p value of >0.999) which is probably more valid for this particular variable, but it doesn't mean quite the same thing as the test we've used for the others. It's a choice. Neither is particularly good for the original paper though?

What we can't do is hack together some test from high school combinatorics and apply it as though decades of biostaticians examining 2xn contingency tables in clinical research, just... haven't thought of or heard of high school maths.

For starters let's just examine the starting point of his method that "there are 1,793,220 distinguishable ways to paint the 94 balls evenly split into two urns each one of 5 colors with numeracy (ordered) (37, 21, 14, 12, 10) as in the example above"

That's... wildly wrong?

This number obviously arises from the product of all the numbers above each +1. i.e. there are 38 ways to separate the red balls (from 0 in the first urn and 37 in the second to the reverse), and for each of those ways there are then 22 ways to spread the blue balls and for each of those there are 15 ways to spread violet balls etc. Therefore the number of ways to "paint the balls" as he puts it is 38*21*15*13*11 or 1,793,220.

Can you see the massive problems here? Firstly there aren't 22 ways to spread the blue balls for all 38 ways of spreading the red balls. If the first urn of 47 balls has say 35 painted red... then you can't paint another16 of them blue... because there are only 12 balls left?

So that's obviously just... not correct. Secondly once all red, blue, violet, and green coloured balls have been split the indigo balls obviously only have a single arrangement, not 12 for every other arrangement?

This number is just indefensible. It requires a series of nested mistakes and no examination of assumptions. Not all problems can be solved with high school techniques.

As one last side note if we don't allow definitionally correlated variables for instance the "comorbidities" would have to be dropped completely because "no comorbidity"must be strongly dependent on the other variables, but the categories aren't mutually exclusive so then we'd be left with no test at all.

"4:  Practicality"


In this section Crawford asks "Why would anyone fake a non-essential aspect of the paper like this?"

Of course that question is only relevant if one assumes the rest of the paper is genuine, that is that there is a real set of 47 consecutive patients in each arm that actually exist, and then whoever was responsible (and I have explicitly not blamed any particular author, or even specified that it must be an author) went back and changed their baseline characteristics.

I have of course made no such assumption.

"5: Cherry Picking":


The author claims I looked only at 16 p values without considering the study as a whole. That's... simply not true? He then launches into a (slightly tangential but correct) explanation that 10 p values of >0.9 is very suspicious in 10 variables, but expected if there are 1000 variables. This is correct (and an example of "floating numerators").

But it's also irrelevant and a bit of a sleight of hand. 

The implication seems to be that I have done this without reporting how many dichotomous variables there were (which is factually and trivially untrue) or perhaps that the number of total variables is somehow unknowable (which is bizarre).

Yes this number of high p values wouldn't be unusual if a study reported 1,000 dichotomous baseline variables, but... it didn't?

"6: Lack of consideration for rates of an illness"


In this section Crawford attempts to argue that quite constant risk of rare illness in the population as a whole will lead to quite constant rate of such admissions to a single ICU. This is based on his "real world" experience, one assumes of working in many ICUs and examining their data?

Of course this is silly. Saying that an illness being expected to be fairly constant over a national population should also lead to a constant rate of admissions to a single ICU over short periods of time is a bit silly?

I would not be surprised at all if we saw this degree of similarity over these two consecutive time periods for a national dataset, but that's not what we have. This unit only saw most primary diagnoses at a rate of one admission per month or less.

This is frankly a bit of a weird argument, it's trying to say that because a much larger population would have a more constant rate of admissions we should also expect that in a smaller population. That's not even close to valid.

Conclusions:

This critique is a mix. I see four broad elements:
  • Some of these criticisms are of the form of "you could have done x instead" and gotten a different number, although if we follow these through to their conclusions they usually also show wildly unlikely/implausible numbers.
  • Some are technically correct nitpicking that doesn't change the conclusions
  • Some of these are arguing from eminence and "trust my expertise" about expected degree of correlation between different medical risk factors and admission patterns to intensive care units, which is difficult to believe when the author doesn't actually claim to have any experience with either and instead cites work on the stockmarket.
  • Some is just ad hominem, name calling (dumbass, douchenozzle), mixed through with a slightly creepy search for 10+ year old photos of me at university?

Part 2: How I would criticise my original post if I wanted to try and tear it down.

How I would make this argument

The one argument he seems to have come right up to a bunch of times but then chosen not to develop is the difference between one tailed and two tailed p values, and left and right sided p values. Because there are a discrete number of results possible for dichotomous data p-values are slightly offset towards 1 when reporting left sided p values or 0 when reporting right sided. In the case of Fisher's exact lets work through an example with numbers much smaller to make the maths easier and to exaggerate the difference more obvious.

Imagine there are 10 patients in 2 groups with 2 events in each (4 total). What's the p value of this and what does it mean?  Here is a graph of the probabilities for the distribution of those 4 events between the two arms.




As usually performed a fisher exact measures how "close" each variable is to the middle, that is "what proportion of replicates would show this degree of imbalance or greater" (although technically any matrix-ordering technique is possible). What this means is we essentially reflect the graph around its middle, giving results like this.



Now a Fisher exact as usually performed will assign p values describing the chance of  results being "this similar or less" assuming the null hypothesis (that is that events are not intrinsically more likely in either arm). For the most similar possible result this will always be 1 and a result of exactly 0 is mathematically impossible. This matters less and less for larger and larger numbers of events and patients but a lot when you only have 4 events in 20 patients. 



Now, while that is the usual way of presenting p values from a test which journal editors will recognise, the scatter graph with all the "1"s could invite people to intuit things that are not true, i.e. assume that p = 1.0 only happens 1% of the time, and I'm open to that as a criticism. I've definitely prioritised presenting p values in a familiar way.

In this extreme example we can work through a solution, we can take the average of the p value for "this even or less" and the p value for "less even", and average them to get a value that will be definitionally centred on 0. That looks like this.




Obviously that's a bigger shift in p values than we see in most of the Marik et al data (except for a single variable that had only 3 events and has a whopping 0.38 shift), but I think this might have been a point Crawford was circling but didn't quite make. Whether that was because he didn't quite get there or he did actually calculate them but didn't like what he found I can't answer. 

Part of me leans towards didn't quite get there, the "balls in urns" claim isn't one you'd make if you understood what you were doing and were trying to trick people, it only comes from deficits in understanding. The other part of me wonders why I'm giving the benefit of the doubt to a creepy dude who spreads transphobic rumours about me on the internet? But it's a philosophical and not a technical question.

If we graph these composite values that must be centred on 0.5 without bias here, this is what we actually see:



There is still an obvious bias with the results substantially offset above 0.5. Keep in mind that while these values should be centred on 0.5 there are many disadvantages to this and many intuitions you may have will be incorrect. For example values >0.90 will be mathematically impossible for most variables on this scale, so for more than three quarters of those dots above 0.5 the value show is the maximum possible for that number of events. This graph should NOT be taken to mean that no variables were perfectly matched.

There are pluses and minuses to presenting both forms of the graph, some might choose to present this orange one instead of my original blue (although you'd have to explain in detail how it's calculated and what it does and does not mean, unlike just presenting a standard test you'd expect people to be familiar with), but it would be a valid choice.

So the weaknesses I would see if I was trying to be as critical as possible of my own original post:

1.) I chose to present the p values from the Fisher Exact test as usually performed and, while not actually claimed that might have invited people to infer that p values of 1 should actually occur only 1 in 100 or 1 in 1000 times intuitively. This is me putting a degree of trust in the recipients not to read incorrectly between the lines, but I guess stranger things have happened at sea.

2.) As a follow on it might also have invited readers to compare the dots to the 0.5 line, there is still obvious large imbalance if "corrected", although a lot of other valid intuitions are lost, so I wouldn't recommend using that graph alone and without detailed explanation of what it is. 

3.), by describing assuming the probability of p values over 0.4 as being uniformly 0.6 as a "quick and dirty" method it might not have been clear that this meant there's a small imprecision for each. This cumulative imprecision increases the likelihood of this occurring from the order of 1 in 100,000 to 1 in 10,000

So none of these change the conclusion, but all are arguably better ways to present the data (especially if the reader had only a very basic understanding of statistics). 

Part 3 What if we actually tried to apply Crawford's criticisms:

So if we take away the innumerate "1,793,220 combinations in 2 urns" thing which is just provably wrong, we are left with:

 - don't present definitionally correlated variables as pairwise comparisons
 - don't present p values that aren't strictly centred on 1.
 
We have to drop the comorbidities completely since they are also definitionally related, but unlike Primary Diagnosis there's no valid test to apply here where some relationships are exclusive and some are not. This obviously lowers our statistical power by quite a lot.

This only leaves us with Primary Diagnosis (as a single 5x2 variable), Mechanical Ventilation, Vasopressors, Kidney Injury, +ve Blood Cultures.



So even if we absolutely gut our dataset removing most information, and strictly treat correlated variables as a single outcome, it's still on the order of about 1 in 1,000.

I think this is why you don't see actual content from these guys (other than crazy stuff where an urn holding 47 balls is filled with 60 or 70 or 80 balls etc), or why they don't replot p values so they're strictly 0.5 centric, or why they don't work out the exact chance of that many p values over 0.4, etc etc etc, because even if you do that the effect is slightly changed but still wildly obvious.

And if you're honest and say "this makes the likelihood 1 in 10,000 rather than 1 in 100,000) most people would go "... ok?"

One Final Thought:

That post took much longer than I expected to write, and meant making a whole bunch more graphs and running hundreds of more calculations just to show that, no "nitpick x" is not a fatal flaw; all to respond to an article filled with massive glaring school child errors and personal abuse and a creepy collection of unrelated photos of me.

Just how much time it's reasonable to spend on replying to innumerate trolls publicly spreading transphobic rumours about me on the internet is still an open question in my mind.
 
Kyle










Viewing all articles
Browse latest Browse all 13

Latest Images

Trending Articles





Latest Images