**Andrew J. Vickers, PhD**

The origins of our recent economic troubles are complex, but there is widespread agreement that people like me — a statistician — bear a large share of the responsibility. In brief, math types built statistical models to predict whether homeowners would default on their mortgages, which worked absolutely great until lots of homes started getting foreclosed. Then the models didn’t work so well anymore. Banks had sold securities backed by mortgages, and the value of those securities depended on the probability that the mortgages would be repaid. The statistical models predicted a high probability of repayment, making the securities extremely valuable. When the models turned out to be wrong, a whole lot of investments suddenly became worth very little indeed.

Forecasting whether a homeowner will pay back a mortgage is, in theory, a relatively straightforward prediction problem. If you gave me 2 sets of historical data from US homeowners, one with details — such as their size of mortgage, value of house, age, income, and assets — and the other set showing whether the homeowner defaulted, I could run a logistic regression and build a pretty good statistical model to predict someone’s chance of foreclosure. This would be particularly robust, given the millions of homeowners in the United States and the large data set.

The problems with the mathematical models started when banks started offering mortgages without knowing a borrower’s assets or income. (These were called “NINjA” loans, as in “No Income? No Assets? Here is a loan anyway.”) Models don’t generally run very well unless you have good data to plug into the model; therefore, analysts were forced to make what amounted to an educated guess as to the chance that a NINjA loan would end in default. What really kicked the models into a tailspin was that housing prices then began to fall, dramatically increasing foreclosure rates. One theory as to what happened is that first banks made it easier to get a cheap mortgage, and this drove up house prices. Statistical models therefore predicted low rates of default, which encouraged more lending and higher house prices. When reality eventually kicked in and the bubble burst, house prices fell; borrowers went into foreclosure; banks lost money on mortgage loans; and the economy went pear-shaped.

All of this was pretty avoidable. Anyone on Main Street can tell you that something is going badly wrong when a housecleaner can borrow $700,000 to flip an apartment or when a mediocre condo is priced at 12 times the median income of the county in which it’s located. The math types sitting on Wall Street were stuck with their data that they had input into models that said everything was going to be fine: The housecleaner’s $700,000 was just a line on a spreadsheet, with a probability of default given by a nice neat formula.

#### Where Was That Very Loud Bang?

During the Second World War, convoys of supply ships would cross the Atlantic from the United States to the United Kingdom carrying soldiers, weapons, and other supplies. German U-boats would try to sink the convoys, and battleships protecting the convoys would attempt to sink the U-boats. The British Navy wanted to know how best to set exploding depth charges to have the best chance of hitting submarines, and therefore sent data to a couple of statisticians working at the War Office in London. The data consisted of the direction in which the depth charge had been fired relative to the direction of the ship and then whether the submarine had been hit. After a couple of months, the statisticians had gotten precisely nowhere in working out how best to target depth charges.

One of them, bravely enough, volunteered to go out on a battleship and observe a sea battle. He saw a bunch of depth charges fired off, the ship going in one direction and the wind in another, and then massive explosions. The data that the statisticians had been working with — eg, “depth charge at 35° to starboard” — were totally unreliable: Pretty much the only thing one could tell for sure was whether the depth charge went to the left or right. When the statistician got back to London and reanalyzed the problem, ignoring most of the data that he’d been sent, he solved it pretty quickly. This is one of the reasons I am writing this in English, rather than German.

#### My Formula Looks Great, So What’s the Problem?

I was once asked to help design a phase 1 trial looking at the effects of a drug on immune function. The basic idea was to give patients different doses and see which had the best effect. As far as I could make out, there were 3 endpoints — cytokines, T cells, and neutrophils — so I essentially wrote a statistics section stating that we’d work out the best dose for each of the 3 endpoints and then compare them. When the trial was completed, I was sent a spreadsheet with over 600 data points per patient. It turns out that there are a number of different ways of measuring cytokines, T cells, and neutrophils, and the laboratory had done them all. As a result, I had to develop a different way of analyzing data on the fly, without reference to the protocol.

My problem was that I didn’t understand the scientific content of the study because it included complex immune assays, such as “IFN-gamma production from CD45RO+ CD4+ cells, unstimulated.” Therefore, I was no different from the Wall Street “quants” who had no insight into the securities that they were valuing –including $500,000 loans to semiemployed laborers– or from the statisticians in the War Office who thought that huge explosions at sea could be pinpointed to within 5°.

#### Why Statisticians Prefer to Stick to Math

A common mathematical problem is figuring out when to leave home to catch a plane. You might figure that it takes 5 minutes to load the car, and that the 10 miles to the airport would be driven at 55 mph. Here is the formula:

Total time = Loading time + distance ÷ average speed × 60

This gives 5 + 10 ÷ 55 × 60 = 15.9 minutes. There is nothing wrong with the math here; it is 100% correct. However, if I used this formula to get my family to the airport, I’d miss my flight. It takes more than 5 minutes to load a bunch of children plus booster seats into a taxi, and you can’t drive to John F. Kennedy International Airport through Brooklyn at 55 mph.

If you read statistics journals, you’ll find a lot of papers that are the equivalent of tinkering with Total time = Loading time + distance ÷ average speed × 60. For example, an academic statistician might point out that the formula makes the assumption that variations in speed are logarithmically distributed, and develop new formulas for Gaussian, uniform, and Weibull distributions. However, unless the statistician running the formula knows me, my family, and my neighborhood, we’re still going to be late.

You wouldn’t have to spend much time in the statistician’s department to understand why knowledge of Gauss is favored over knowledge of the scientific background of study, such as Brooklyn traffic patterns. The study that I analyzed looking at the immune drug was published in a specialist chemotherapy journal, and I was the seventh author out of 10. I don’t think any statisticians read it. A paper published in *The Journal of the Royal Statistical Society* on extending the travel time formula for alternative parameterizations of the velocity distribution would have the statistician who wrote it as first author. Of course, plenty of statisticians read statistical journals. Authors of statistical papers become well known in the statistical community, and they get promotions and invitations to lecture because of their contributions to mathematical theory — not their knowledge of the immune system. Making a fuss about data hardly endears statisticians to their scientific collaborators; there is often huge pressure to “just give us the *P* value.”

I am all for advancing the science of statistics. Indeed, I’ve written papers proposing new statistical methods. However, at some point, theory has to link up with reality. Compare the statistician who went out to watch the naval battle, got a feel for the data, and helped to win the war, with the Wall Street statisticians who sat programming in their cubicles, never daring to set foot in a mortgage broker’s office (let alone a battleship), and who brought much of the world economy to its knees.

If you liked this article, you’ll love Andrew Vickers’ collection of stories on statistics: “What is a p-value anyway?”**Disclosures**

Assistant Attending Research Methodologist, Memorial Sloan-Kettering Cancer Center, New York, NY

Disclosure: Andrew J. Vickers, PhD, has disclosed no relevant financial relationships.