1st interview ——————————
Overview of Tetlock findings
The headline result of the tournaments was the chimp sound-bite, but EPJ’s central findings were more nuanced. It is hard to condense them into fewer than five propositions, each a mouthful in itself:
- Overall, EPJ found over-confidence: experts thought they knew more about the future than they did. The subjective probabilities they attached to possible futures they deemed to be most likely exceeded, by statistically and substantively significant margins, the objective frequency with which those futures materialized. When experts judged events to be 100 percent slam-dunks, those events occurred, roughly, 80 percent of the time, and events assigned 80 percent probabilities materialized, on average, roughly 65 percent of the time.
- In aggregate, experts edged out the dart-tossing chimp but their margins of victory were narrow. And they failed to beat: (a) sophisticated dilettantes (experts making predictions outside their specialty, whom I labeled “attentive readers of the New York Times”—a label almost as unpopular as the dart-tossing chimp); (b) extrapolation algorithms which mechanically predicted that the future would be a continuation of the present. Experts’ most decisive victory was over Berkeley undergraduates, who pulled off the improbable feat of doing worse than chance.
- But we should not let terms like “overall” and “in aggregate” obscure key variations in performance. The experts surest of their big-picture grasp of the deep drivers of history, the Isaiah Berlin–style “hedgehogs”, performed worse than their more diffident colleagues, or “foxes”, who stuck closer to the data at hand and saw merit in clashing schools of thought.10 That differential was particularly pronounced for long-range forecasts inside experts’ domains of expertise. The more remote the day of reckoning with reality, the freer the well-informed hedgehogs felt to embellish their theory-driven portraits of the future, and the more embellishments there were, the steeper the price they eventually paid in accuracy. Foxes seemed more attuned to how rapidly uncertainty compounds over time—and more resigned to the eventual appearance of inherently unpredictable events, Black Swans, that will humble even the most formidable forecasters.
- A tentative composite portrait of good judgment emerged in which a blend of curiosity, open-mindedness, and unusual tolerance for dissonance were linked both to forecasting accuracy and to an awareness of the fragility of forecasting achievements.12 For instance, better forecasters were more aware of how much our analyses of the present depend on educated guesswork about alternative histories, about what would have happened if we had gone down one policy path rather than another (chapter 5). This awareness translated into openness to ideologically discomfiting counterfactuals. So, better forecasters among liberals were more open to the pos-sibility that the policies of a second Carter administration could have prolonged the Cold War, whereas better forecasters among conservatives were more open to the possibility that the Cold War could have ended just as swiftly under Carter as it did under Reagan. Greater open-mindedness also protected foxier forecasters from the more virulent strains of cognitive bias that handicapped hedgehogs in recalling their inaccurate forecasts (hindsight bias) and in updating their beliefs in response to failed predictions (cognitive conservatism).
- Most important, beware of sweeping generalizations. Hedgehogs were not always the worst forecasters. Tempting though it is to mock their belief-system defenses for their often too-bold forecasts—like “off-on-timing” (the outcome I predicted hasn’t happened yet, but it will) or the close-call counterfactual (the outcome I predicted would have happened but for a fluky exogenous shock)—some of these defenses proved quite defensible. And. though less opinionated, foxes were not always the best forecasters. Some were so open to alternative scenarios (in chapter 7) that their probability estimates of exclusive and exhaustive sets of possible futures summed to well over 1.0. Good judgment requires balancing opposing biases. Over-confidence and belief perseverance may be the more common errors in human judgment but we set the stage for over-correction if we focus solely on these errors and ignore the mirror image mistakes, of under-confidence and excessive volatility.
Extremizing (explained again)
The example I used in the Super Forecasting book was the example from the advisors to President Obama when he was making the decision about whether to launch the Navy SEALs at a large house in the Pakistani city of Abbottabad.
The thought experiment runs like this, that if when the President went around the room and he asked his advisors how likely is Osama to be in this compound, this mystery compound, if each advisor had said 0.7, what probability should the President conclude is the correct probability? Most people sort of look at you and say well, it’s kind of obvious, the answer is 0.7, but the answer is only obvious if the advisors are clones of each other. If the advisors all share the same information and are reaching the same conclusion from the same information, the answer is probably very close to 0.7
Imagine that one of the advisors reaches the 0.7 conclusion because she has access to satellite intelligence. Another reaches that conclusion because he access to human intelligence. Another one reaches that conclusion because of code breaking, and so forth. So the advisors are reaching the same conclusion, 0.7, but are basing it on quite different data sets processed in different ways. What’s the probability now? Most people have the intuition that the probability should be more extreme than 0.7, and the question then becomes how much more extreme?
More note on that
Centre for Effective Altruism where I’ve been working the last few years, we often get people to independently come up with probability estimates for different things before we discuss something, and then after we discuss it.
We’ve never done this thing of then combining them and then saying well, if we’re all on one side, then that should make us even more confident than the average of our answers. But perhaps we shouldn’t, anyway, because we’re all clones of one another or something like that or we all have access to too similar information, but that’s maybe something we should consider doing.
Philip Tetlock: Well, well-functioning groups that are very good at overcoming biases like failing to share distinctive information, groups that are effective at that, you want to be careful about extremising. For example, it wasn’t a good idea to extremise the judgements of super forecasting teams.
How to solve – does your research mean that we shouldn’t trust experts?
skeptics are over-claiming
It’s very hard to strike the right balance between justified skepticism of pseudo-expertise, and there’s a lot of pseudo-expertise out there and there’s a lot of over-claiming by legitimate experts, even. So justified skepticism is very appropriate, obviously, but then you have this kind of know-nothingism, which you don’t want to blur over into that. So you have to strike some kind of balance between the two, and that’s what the preface is about in large measure.
Experts good and bad
Bad - Laws of diminishing returns in
I do not have evidence that PhD level academics are worse than the average person off the street. What we do have evidence for is something that Daniel Kahneman called The Attentive Reader of the New York Times hypothesis. He coined that phrase when this research was in a very early phase back in 1987, 1988 when we were colleagues at Berkeley.
What that means is you get a boost from being an attentive reader of The New York Times, or The Wall Street Journal to be bipartisan here. You get a boost from being an attentive reader of the news. So moving from nothing to being an attentive reader of the elite press, The Economist, Financial Times, whatever, the elite press, there is a boost. That boost is substantially greater than the boost you get moving from being an attentive reader of the elite press to having a PhD in China Studies. You hit a point of diminishing marginal predictive returns for knowledge depressingly quickly if you’re an academic.
Hedgehog are especially bad at long-term predictions
Robert Wiblin: And you haven’t found that experts can be poorly calibrated, even the hedgehogs?
Philip Tetlock: Well, the hedgehogs were certainly worse calibrated in the early work, especially for their longer-range forecast. That’s true.
What experts are better when feedback loops are quick
Cognitive diversity is good, and maybe diversity in general
Do you have any evidence about how much that kind of demographic diversity, like across ethnic or gender lines … Does that match up with cognitive diversity?
Philip Tetlock: I suspect it does to some degree. It’s an empirical question. We’re not in a great position to answer that question because the people who volunteer for these forecasting tournaments tend to be quite disproportionately male, and well-educated, and having somewhat of a quantitative inclination. So a bit of a Silicon Valley personality profile there.
How does super forecaster algorithm works?
If I had simply done what the research literature tells me would’ve been the right thing and looked at the best algorithm that distills the most recent forecast or the best forecast and then extremises as a function of the diversity of the views within, if I had simply followed that, I would’ve been the second best forecaster out of all the super forecasters. I would have been like a super, super forecaster.
Techniques – fermi estimation, outside/inside view
Fermi estimation, is the tendency to take problems and decompose them into their unknowns. That’s a very useful heuristic, especially for these really weird problems like Yasser Arafat’s remains are going to score positive for polonium in either the Swiss or the French autopsy.
Watching the super forecasters took a question like that that initially really just like a hopeless head scratcher and turned it into something tractable, that was an interesting thing to behold. That ties into another thing they do, which is they look for what would… Kahneman draws a distinction between the inside and the outside view approaches to forecasting, and the super forecasters are much more likely than regular mortals to look at things from the standpoint of the outside view.
The division between the inside and the outside view is blurry on close inspection. I mean, if you start off your date with a base rate probability of divorce for the couple being 35%, then you … Information comes in about quarrels or about this or about that, you’re going to move your probabilities up or down. That’s kind of inside view information, and that’s proper belief updating.
Start with the outside view
I think a categorical prohibition on the inside view is way too extreme, but starting off with your first guess with the most plausible outside views is pretty demonstrably sound practice.
Resolution vs. calibration
mean, one of the things we looked at in our early work on Expert Political Judgment was whether the well-calibrated forecasters were just being cowards, right? So it rains 60% of the time in Seattle. So you always predict 60%, so you get a perfect calibration score. Whereas what you really want are people who say it’s a 95% chance of rain when it rains and a 5% chance when it doesn’t rain. And that gets you a great resolution score as well as being well-calibrated because you’re right. But when you say 95% and it doesn’t happen, you take a big hit so there is a trade off in people’s minds. I think if you tell people you’re judging them on both properties, it’s going to force them to be more mentally agile and they’ll be making more trade-offs in their head. They’ll say, oh well, I don’t want to be overconfident. On the other hand, I don’t want to be a chicken.
Different outcomes depends on
So he noted that there used to be this page on the Good Judgment Project’s website, that used to break down various different ways to get better forecasts and suggested that you got a 40% boost from talent-spotting forecasters, as you just mentioned, and a further 10% boost from giving them training tools, 10% from putting them on teams and getting them to talk to one another. And then maybe a 25% boost in accuracy from using algorithms to process and then aggregate their various different predictions. Does that ring true to you still today?
Philip Tetlock: These I guess I would have to characterize these as stylized facts, that the baseline here is the unweighted average of the regular forecasters. It’s true that once you’ve identified the super forecasters, and you put them into teams, they have a big advantage over regular teams and individuals working alone. That is true and is in the vicinity of 40%, yes. The training number of 10% is approximately right, the teaming number of 10% is approximately right. The algorithm number really conflates a couple of things. I mean, the algorithm number could be larger or smaller depending on how you calculate it. Since the aggregation algorithms are piggybacking on the most recent forecast of the best forecasters, that means they’re drawing on super forecasters.
Philip Tetlock: So the question is, how much better can the aggregation algorithms do if you just put the super forecasters out of the equation? And I think that number is about right, 25%.
“The strongest predictor of rising into the ranks of super forecasters is perpetual beta, the degree to which one is committed to belief-updating and self improvement, perpetual beta is roughly three times as powerful a predictor as its closest rival, raw intelligence”.
What does the work for us is a measure of the frequency with which people engage in belief-updating, low magnitude frequent belief-updating is a powerful driver.
Picking the right base-rate is very important
What works: good algorithms
Like predict no change
People should predict no change in short-term (even maybe medium-term)
Philip Tetlock: Especially for the shorter term forecast, because change is less likely in the short term. I think one finding when you cross both books. One finding is that the people somewhat exaggerate change in the short term, but they understate in the longer term. But that’s not entirely true even there. I think they’re exaggerating change, even in the five year range. Okay, I’m thinking out loud.
and probably you understate in the long term (but this part is not certain)
What doesn’t work: predicting black swans
Quote: But apparently, Karl Marx also said that, I’m not a Marxist by the way, “With the train of history, it’s a curve the intellectuals fall off.”
Robert Wiblin: Yeah. And probably said as anyone else as well.
Philip Tetlock: Well, it’s ironic given how often the Marxists have fallen in the 20th century, but it’s an all the more apropos remark. There’s a lot of truth to that, predicting change is hard and predicting dramatic change is really, really, really hard.
How long into the future one can forecast - a decade? 7 times to shuffle the deck
Quote: The magician statistician, Persi Diaconis at Stanford, once asked the question: “How many times do you have to shuffle a deck of cards before all information is lost?” So you have a deck of cards, open up a new deck of cards, all the cards are perfectly ordered from deuces up through aces. And they’re all in exact order, same order. And how many times you have to shuffle, I mean do a proper shuffle? I guess there’s a definition of what a proper shuffle is, and how many proper shuffles do you have to do before all order is lost? I think the answer is five or six (ed: It’s 7).
I mean, there are things that are happening that are random and how much randomness, and you’re not getting full card shuffles every day or every month or every year. There are substantial pockets of stability in history. But how fast is the randomness compounding, so the optimal forecasting frontier is going to be very, very close to chance when you reach a certain point.
I mean, if you want to call me a pessimist, because I wouldn’t think they’re going to do a very good job a century out- a generation out. Now, when they get to five to 10 years, maybe there’s going to be some advantage, but it’s going to be increasingly small.
Yeah, I guess it just depends on the nature of the question. Because if you’re saying, yeah, who’s going to be prime minister of the UK in 50 years time? I mean no superforecasters are going to get that, everyone is just back to chance. It’s just like guessing names at that point. But something like, which party will be in power, maybe you can get a little bit of resolution there.
Robert Wiblin: So for example, if you’re trying to forecast progress in artificial intelligence, like forecasting at what point do you get transformative change? Like at what point will the algorithms reach the point at which you can get transformative change, is very, very hard. But trying to potentially forecast just the amount of computational ability that we will have or how fast will computer chips be? Seems like potentially we can have something to say about that even looking 50 or 100 years out. Just because we have enough of a historical record and adjusting to the trends there. So it gets gets much harder, no doubt. But I think, superforecasters might be able to do better than just chimps throwing at dartboards.
And maybe I’m naive, but I think when astronomers and astrophysicists tell me that the sun is going to go supernova in three or 4 billion years, I think they’re probably right. It’s going to come close to the Earth’s orbit, it’s going to destroy all life on the planet.
Robert Wiblin: Somethings are kind of mechanistic.
Philip Tetlock: And yeah, there are some categories of things, right? There are timescales and there are levels of determinism and certain operating laws where you have enough confidence that we think we can extrapolate out. I mean, where’s climate on that continuum?
Best extremising is when ppl who disagreed agreed
think the best extremizing algorithms are ones that are sensitive to the cognitive diversity of the crowd. So when people who disagree or suddenly start agreeing ’cause you’re increasing your confidence in extremizing, it’s not just because you’re taking the most recent forecasts or the best forecasters.
I think prediction markets are very valuable. I mean, you really … tournaments are really only beating prediction markets under certain conditions. And they’re beating prediction markets when you use the very best extremizing algorithms and you have a lot of data to fine-tune the algorithm. And the prediction markets themselves are not deep and liquid..
But when you think about it, prediction markets in the sense are doing almost automatically a lot of the things that the aggregation algorithms are doing. Their upweighting the most recent forecasts of the most highly capitalized confident vetters.
Mechanism of why social status blocks prediction markets
Philip Tetlock: Well, forecasting tournaments are a difficult sell because they challenge stale status hierarchies. Imagine that instead of being an academic, I started in the intelligence community as an analyst 40 years ago and I was a China specialist. And then I worked my way up the ranks and finally I made it out on the National Intelligence Council, became a big shot and maybe I get to participate in an occasional presidential daily briefing or maybe be close to the President when he’s meeting with Xi Jinping. After 40 years, that’s a lifetime of accomplishment, working and doing these reports and playing by the rules, playing by the epistemic ground rules of the intelligence community of drafting concise, compelling narratives that are sent up to the policy community people say, “Yeah, I think I learned something I didn’t know before… and promote that guy.”
Robert Wiblin: Yeah.
Philip Tetlock: And then someone comes along from the R&D branch of the Intelligence Community and says, “Hey, you know what? We want to run these forecasting tournament, and we want to see whether 25-year-olds can do a better job than 65-year-olds, at predicting the course of events in China. You know, what’s going on with the Chinese economy, Chinese domestic political system, how are they going to respond to the trade war? What are they going to do with Hong Kong? What do you get to do with Taiwan? What are you going to do with this island? What are they going to do with North Korea?”
Philip Tetlock: All those things and it turns out that the 25-year-olds are outperforming the 65-year-olds like me. It doesn’t take a lot of imagination to suppose that the more senior people are going to be decidedly unenthusiastic about this innovation.
Robert Wiblin: Yeah.
Philip Tetlock: This looks like a superficial idea that a bunch of nerdy computer programmer, neopositivist types might come up with. But this is not something serious people should entertain. So we talked about Tom Friedman and the superforecasting book as a prototype of a style of journalism that seems from a forecasting point of view, to put it charitably superficial and to put it more harshly, reckless. But at any rate, it’d be very difficult to figure out what he’s saying.
Philip Tetlock: So I mean, you’re not going to expect him to be very favorable, you’re not gonna expect most op-ed people to be very favorable, because this is perceived as an attack on how they do things. They’re very intelligent people there, they’re very verbal. They’re very knowledgeable. They have a career, they have a lot of reputational capital in the approach that they’ve taken. And the notion that these forecasting tournaments could even partly supplant or complement the expertise they have is dissonant. The notion that they would completely displace them is anathema. But even partial displacement, I think produces resistance.
Problem with percentage estimations
Why doctors lawyers don’t want to use probabilistic reasoning using percentage estimates. People are afraid you are gonna with numbers you gonna anchor to much of them. People don’t understand how noisy numbers are.
Diversity of minds or inner agents (draft)
think that the best individual forecasters do often seem to talk to themselves as if they have more than one mind.
So I think the Kahneman insight on outside view and reference classes is profound. (I think it’s taking base rates and projecting it forward – easy forecasting)
More notes found in Matter (to process)
economist I really like, Bent Flyvbjerg, I’m not sure quite how to pronounce his name, has done some really great work on the intense and consistent overoptimism in planning deadlines and cost forecast for megaprojects such as massive bridges and huge buildings- Philip Tetlock: I think it’s lovely work, yes. Robert Wiblin: For example his UK team that looks into this estimated that successful delivery was likely for only 20% of the UK government’s major infrastructure projects. Have you considered using forecasting teams to call bullshit on some of the megaproject overoptimism that you get from, I guess, contractors and government?
attribution-substitution heuristic, we nicknamed it the bait and switch heuristic. Philip Tetlock: But it’s what you were pointing to earlier, it’s that you confront a really difficult question about climate change or a really difficult question about the economy or the Persian Gulf or God knows what. You can find a really difficult question and someone comes along and seems to have a lot of status and well dressed- Robert Wiblin: Smooth talker. Philip Tetlock: -smooth talker, well connected, well networked. And gives you a story, and you slip into answering another question. You’re no longer trying to answer the question at hand about China or the economy or climate. You’re trying to ask, does this person look like the sort of person who would know the answer to this question? And the answer there is a resounding, yes. This person clearly looks like it, and you run with it. You then act as though the answer to the easy question is also the answer to the hard question. You conflate the things in your mind.
think that the best individual forecasters do often seem to talk to themselves as if they have more than one mind.
Robert Wiblin: What do you think it would look like to live in a world where elites around the world are better at predicting social and political trends? How confident are you that this world would be safer, especially from war and catastrophes? Philip Tetlock: I think that in a competitive nation state system where there’s no world government, that even intelligent self-aware leaders will have serious conflicts of interest and that there’s no guarantee of peace and comity. But I think you’re less likely to observe gross miscalculations, either in trade negotiations or nuclear negotiations. I think you’re more likely to see an appreciation of the need to have systems that prevent accidental war and prevent, and put constraints on cyber and bio warfare competition as well as nuclear. Philip Tetlock: So those would be things I think would fall out fairly naturally from intelligent leaders who want to preserve their power, and the influence of their nations, but also want to avoid cataclysms. So in that sense, yes. I think there … I’m not utopian about it. I think we would still live in a very imperfect world. But if we lived in a world in which the top leadership of every country was open to consulting competently technocratically run forecasting tournaments for estimates on key issues, we would, on balance, be better off.
So it seems being super forecaster is a fools games:
- because you will be punished by outliers, and outliers are what you really care about most, but nobody can reward you for that