# How the strange Cauchy distribution proved useful

Share Now

On Tuesday I had an interesting exchange with Jorge Muro Arbulú, a professor in Peru, about the Cauchy distribution, which also called the Lorenzian distribution. Unlike most probability distributions you encounter, the mean and variance for strange distribution do not exist! How is it then, Jorge asked, that Analytica is able to compute the mean and variance? As he asked the question, he demonstrated Analytica computing the mean and variance in an example model. Not only is this an excellent question, but it also reminded me of how I encountered this distribution last year in a theorem by Martin Arjovsky and Léon Bottou that resolved a problem in a deep learning model that I had been debugging at the time. That’s pretty cool, because I find it unusual for such an oddity of a distribution to be relevant to a real problem, let alone one that I’m working on! The Cauchy distribution arises in resonance and spectroscopy applications in physics, but in my case it came to the rescue in a deep learning project. Read on and I’ll introduce you to this strange distribution, address Jorge’s question, and tell you about how it ended up relevant to machine learning.

## The Cauchy Distribution

The cumulative probability function for the Cauchy is given by The parameters of the distribution are m, the mode, and s, the scale. The Cauchy distribution describes the position of x in the following triangle when the angle a is uniformly distributed between –/2 and /2. From trigometry, you’ll remember that for this triangle, . Let ax be the angle corresponding to a particular x, then which gives us the cumulative probability function. The second equality above expresses the uniform distribution on a, and the third equality substitutes the trigometry relationship. The density is obtained by taking the derivative of F(x;m,s). It is easy to generate Monte Carlo samples from a Cauchy. Just select the angle a from a Uniform(90,90) — in degrees here, and solve for x — since Analytica’s tan wants the angle in degrees, use m+s*Tan(Uniform(90,90)).

## Mean and Variance undefined??

[Pillai & Meng, 2015] write

The Cauchy distribution is usually presented as a mathematical curiosity, an exception to the Law of Large Numbers, or even as an “Evil” distribution in some introductory courses.

Many of us may recall the surprise or even a mild shock we experienced the first time we encountered the Cauchy distribution. What does it mean that it does not have a mean?

I find the non-existance of the mean to be mindblowing. It isn’t a case where the mean is unbounded, it is that it really doesn’t exist, more akin to zero divided by zero. When I first wrote down the integral for the expected value as out popped a mean. So why do they say the mean doesn’t exist? Eventually, I had to ask my daughter to resolve this one for me. Take the same approach, but change the integration bounds: In both cases, were taking a limit to the same -∞ to ∞ bounds, but getting two different answers. And because of that, we say the mean doesn’t exist. In fact, for different choices of integration bounds, out pop different limits. The mean isn’t uniquely defined, you can essentially make it anything you want. I find it hard to wrap my head around that one. The variance diverges in a more obvious way, where it makes sense to talk about it having infinite variance. To use the Cauchy aka Lorenzian distribution in Analytica, add the Distribution Variations library to your model, and then use the function Lorenzian(m,s). Here is a Lorenzian(5,2) plotted from 1000 latin hypecube samples: From the statistics view, you can read off a mean of 5.00 and standard deviation of 63.25. So if the mean and variance don’t exist, how is it that Analytica was able to compute them? To understand the paradox, I ran a simple experiment. I generated 10 million independent random samples from the Lorenzian(5,2) distribution (monte carlo), and at each step, at each sample I computed the variance of the samples so far. Here is the running variance over the first 4M samples: When you run an experiment like this on a better behaved distribution, what you’ll usually see is the variance converging to its true value as sample size grows. Here though, there is a trend towards larger and larger variances as the experiment proceeds. Basically, the vast majority of the samples are between -10 and 10, but every now and then it gets a whopping outlier, so big that it pushes the variance up. Then as it goes for a while without another whopping outlier, additional small samples cause the variance to decrease until wham!, another big outliner drives it up even further. Near the 3.75Mth sample, it got one that was so big (actually big negative), it pushed the variance up so much it became hard to see the interesting detail in the <4M range: If you were to continue with this experiment to infinity, you could set any positive limit on the Y-axis — even at 10^100, and if run long enough, the variance will eventually plow through that limit! Here is the running mean during this experiment: With the mean, these wild outliers clearly cause large jumps, but there isn’t any clear net growth effect. It is hard to say whether it is converging to something. So found the mean of 1 billion samples and repeated that experiment 50 separate times. No single convergence value stands out, but it does seem interesting that the sample mean landed between 0 and 3 in all 50 trials. Anyway, coming back to Jorge’s original question — how is it that Analytica can compute the mean and variance if these don’t exist? We’ve seen that it is easy to sample from a Cauchy distribution, and that you can always compute the sample mean and the sample variance for any sample. This is what Analytica was doing. But neither sample mean nor sample variance is meaningful in the case of Cauchy, and if you were to repeat the experiment, or up the sample size, it would not compute same same estimates consistently.

## Relevance to Machine Learning

My wife will attest that my obsession with Generative Adversarial Networks (GANs) began last fall, since she had to put up with me talking about it constantly. The GAN is a deep learning architecture introduced by [Goodfellow et al., 2014] which learns to generate random samples from highly complex huge-dimensional complex probability distributions. Papers showing applications of GANs are being published at a phenomenal rate (see The GAN zoo). In a GAN, two deep learning networks are trained: G(z) and D(x). The first network, G(z), is called the generator and accepts a random vector z as input, and outputs a “sample” point, x∈Rⁿ, where n is usually very large, and where the n dimensions have a highly complex correlational structure. In other words, G(z) learns to map from a random vector z to a representative sample from the target distribution P(x) that training instances are drawn from. The second network, D(x), is called the discriminater, and its job is to detect whether x was sampled from the “real world” (actually sampled from P(x)), or is a fake generated by G(z). During training, G’s objective is to fool D, whereas D’s objective is to learn not to be fooled. The training process is thus an adversarial “game” between D and G. Training a GAN is notoriously difficult. In my own attempts, I witnessed good progress near the beginning of training. G started producing samples that looked fairly convincing and had pretty good coverage of the full sample space. But as training continued, G would fixate on the cases it was best at and suddenly stop producing samples other cases that it was less skilled at. For example, when simulating hypothetical loads on the electrical grid, initially it would generate data that resembled sunny days, data that resembled run-of-the-mill days, and samples that resembled windy days. But as training progressed, it seemed to figure out that it was really good at simulating the typical winter day and after a while that’s all it generated. Extremely few or no sunny days, run-of-the-mill summer days, windy days, etc. In the GAN field, this is called “mode dropping”. But really there are two phenomena here — one is the eventual absence of some “modes”, the other is the divergence from a reasonable representative “concept” to one that is less representative. The divergence issue was to me the most puzzling, because the objective function(s) promote the generation of samples that more closely match the true distribution. Given that the objective function points in that direction, why would it diverge? That was the mystery, and my key barrier to successful training. This became an impetus for me to do lots of theorizing, experimentation, reading and analysis (this was my obsession at the time, after all). The Cauchy distribution made its appearance with the first experimentally-consistent explanation for this behavior that I found, appearing in Theorem 2.6 of [Arjovsky and Bottou, 2017].

### Training on MNIST

The MNIST dataset contains 60,000 black & white images of hand-written digits, 28×28 pixels in size. The goal is to train a network, G, that can generate random images that are indistinguishable from actual hand-written digits. Furthermore, since MNIST contains the same proportion of each of the 10 digits (0 thru 9), the learned network ought to generate images of “3” with the same frequence as images of “1”s. GANs don’t just memorize the training data, they learn to generate representative samples that they’ve never seen before, but which appear to come from the target distribution. Notice here that the sample space is R784, i.e., 784 dimensions. I conducted a variety of experiments using two 3-layer Linear-ReLu nets, D and G. In the base case, D was alternatively exposed to a true training image and then to a G(z) for a uniform random z vector using the currently trained generator up to that time, and then G was updated by backpropagating through the generator. This was standard stuff direct from the original paper

They are starting to become recognizable, and there apperas to be a fairly reasonable sampling of nearly every digit. By 200 epochs, it has refined them even further:

but then things went downhill from there. As it progressed, it showed a propensity for “1”s, “7”s and “9”s, and I saw the other digits appear less and less frequently, until by the 1000th epoch, it was completely fixated on its favorites.

To aid with experimentation, I wanted a way to automatically measure the coverage of the different digits as training progressed and as I compared variations on network architectures, algorithms, and parameters. For that, I independently trained a Convolutional Neural Network (CNN) on the MNIST dataset, and used it to classify G’s output into one of the 10 digits. To measure the amount of coverage, as training progressed, I periodically had G generate 100 random images, classified them into digits, computed the entropy and graphed it. Perfectly uniform coverage of all 10 digits would have an entropy of Ln(10) = 2.3. The graph makes the good initial coverage followed by the gradual decline with more training apparent. At the time, [Arjovsky & Bottou, 2017] hadn’t been published yet, but there were a lot of other people exploring this same thing and a lot being written. One point of broad consensus at the time seemed to be that these phenomena occurred when G trained faster than D (we now know that if anything, this wisdom was backward). So I conducted experiments to test these theories in various combinations. For example, for every 1 epoch of G, train D for 2 epochs (blue) or train D for 5 epochs (green): Not only did this collapse faster, they collapsed more thoroughly to the point where they only generated “1”s.  I conducted numerous other experiments to test many other hypotheses and alleged wisdom by others in the field, but they all experienced this mode collapsing divergence, until…

### Testing the Cauchy hypothesis

The Cauchy result from Theorem 2.5 and 2.6 of [Arjovky & Button, 2017] sent a strong signal that you shouldn’t use the -log(D(G(z)) alternative from [Goodfellow et al, 2014]. To that point, I had not actually explored the impact of that approximation. In my final MNIST experiment reported here, I returned to the original base-case model, but replaced the -Log(D(G(z)) objective term with the vanilla Log(1-D(G(z))) objective — i.e., a straight Jenson-Shannon divergence metric. I cal this “D1G1 pure”, since D and G are each updated once per epoch and the objective is the pure one. The result is the red line. Notice that the “pure” Jensen-Shannon divergence metric shows no divergence in coverage.  As predicted by [Goodfellow et al, 2014], it did start out a bit slower, but that was far offset by the stability. The next two images show generated samples after 100, 500, and 1000 epochs respectively.

In short, their Cauchy theorem seems to have pinpointed the underlying cause for the divergence in my own experiments.

## Summary

The Cauchy distribution, also called a Lorenzian distribution, is a strange creature. It’s primary application seems to be for teaching students of probability that some distributions don’t have a well-defined mean and variance. The original question was: How is it that Analytica can compute the mean and variance when these don’t actually exist? The resolution to this paradox is that it is easy to generate samples from a Cauchy, and you can always compute the sample mean and sample variance. But as you increase sample sizes, these don’t stabilize. This distribution is strange, which is why I found it so interesting when I saw it appear last year in the context of training Generative Adversarial Networks, in what actually turned out to be a very useful result for a pragmatic problem I was working on.

## References

Share Now

### 1 thought on “How the strange Cauchy distribution proved useful”

1. Lonnie …

does a .pdf version of this Cauchy distribution blog post exist … ?

i would be interested in that version since it seems that regardless
of my choice of browsers and settings therein, i am unable to get
the graphics and images to load for the version at this URL:

How the strange Cauchy distribution proved useful – Analytica (lumina.com)

perhaps there is some other issue … ?

in any case, please count me as quite interested in consuming
the content as it was intended to be presented.

thanks,
cdm

Scroll to Top

Automated page speed optimizations for fast site performance