Suppose you need information about the future population Metropolia City in the year 2020. So you solicit two experts to provide assessments of this quantity and provide their expert judgments in the form of distributions as shown.
The experts' distributions in this example are
LogNormal(100K, 1.2), and
Arguably the most important criterion for judging a subjective assessment is for the source of that assessment (i.e., the expert) to be well-calibrated. Calibration is a property of the assessor, not a property of a single assessment, and reflects the accuracy to which an expert can quantify his degree of uncertainty. Being well-calibrated reflects the idea that over the course of many assessments, his assessed probabilities correspond to the empirical frequency of occurrence. There is much to be said, and a rich literature on this topic, so instead of going deeper here, I refer you to Morgan and Henrion (1998).
Which expert in the example is more correct? If both experts are considered well-calibrated, then both can be considered equally valid or equally credible. Each specific distribution is a reflection of the well-calibrated assessor's degree of uncertainty.
You could conceivably receive a distribution from a well-calibrated non-expert, who has never even heard of Metropolia City and knows little about urban population analysis. However, because he is extremely well-calibrated, his distribution could be considered as credible as the one provided by the urban population analyst whose Ph.D. thesis studied urban planning for Metropolia. Although we might consider the two distributions equally credible in terms of calibration, we expect the population analyst's assessment to somehow be superior.
A well-calibrated assessor who knows very little about a topic can be expected to produce a distribution with wide variance, reflecting a state of high uncertainty. The well-calibrated specialist, being more knowledgeable on this topic, would be expected to produce a narrower distribution. We consider the narrower distribution as conveying more information.
Shannon information theory defines a measure of the amount of information (or uncertainty) conveyed by a probability distribution. The amount of overrall uncertainty is measured by Entropy, defined asThe two distributions in the first graph have entropies of 11.23 and 12.46 respectively. The smaller entropy of the first distribution reveals that it has less uncertainty, and hence is the more more informative.
Since each assessment conveys information, we ought to be able to obtain an aggregated distribution that is superior to each individual assessment. Dozens, if not hundreds, of methods for aggregating expert judgments have been proposed and studied (Clemen & Winkler (1999), French (2011)). The mixture of densities method, which traces back to Laplace, combines the assessed distributions as a weighted average. For equal weighting
Future_population[ Expert = ChanceDist(1/2,Expert) ]
which yields the following distribution
This combination seems perfectly reasonable, but presents a paradox. The average of assessments from two well-calibrated experts is not, in general, well-calibrated (Hora (2004)), and the entropy of this aggregated distribution,
12.09, is greater than the entropy from Expert 1. So from the criteria of calibration and informativeness, the aggregated distribution does not appear to be superior.
From the pragmatic and the intuitive perspective, you will probably agree that the average of the two assessments is probably better than either of the original assessments. But how to we reconcile this intuition with the fact that it fails our well-calibrated criterion and scores inferior in terms of informativeness? Those criteria should remain a gold-standard for individual assessments, but should they not apply to aggregated assessments?
One resolution to this paradox is to concede that the ideal of well-calibratedness is unattainable. If both experts really were perfectly and absolutely calibrated, then we should go with Expert 1's distribution, and not the aggregate. But because we don't really believe this ideal is attainable, the real purpose of aggregation is to average out inaccuracies in calibration. We intuitively sense that averaging helps in this way, but the strict framework of calibration + informativeness fails to capture the subties of calibration errors. For example, one common cognitive bias that causes poor calibration is overconfidence, and so we can view the increase of entropy by the aggregate as having the desirable effect of averaging out the overconfidence of Expert 1. Hora (2004) showed that aggregation can both increase and decrease the level of calibration in particular situations, but empirical studies have found that the combination is usually a better estimate, and better calibrated, than the single estimate from the best expert (Cooke (2007), Clemen (2008), Lin and Cheng (2009), French (2011)).
- Robert Clemen and Robert Winkler (1999), Aggregation of Expert Probability Judgments, Methods 19(2):1-39.
- Clemen (2008)
- Roger M. Cooke (2007), Expert Judgment Studies, Reliability Engineering Systems Safety 93:655-777.
- Simon French (2011), Aggregating Expert Judgment, Revista de la Real Academia de Ciencias, Fisicas Y Naturales, Serie A. Matematicas 105(1):181-206
- Steven C. Hora (2004), Probability Judgments for Continuous Quantities: Linear Combinations and Calibration, Management Science, 50:597-604.
- Shi-Woei Lin and Chih-Hsing Cheng (2009), The Reliability of Aggregated Probability Judgments Obtained through Cooke's Classical Model, J. of Modelling Management 4(2):149-161.
- Morgan and Henrion (1998), Uncertainty: A Guide to Dealing with Uncertainty in Quantiative Risk and Policy Analysis, Cambridge University Press, second edition, esp. Chapter 6.