Analytica-provided statistics don't match extracted datapoint statistics
I have a model with inputs that are probability distributions that I entered using fractiles of the cumulative density functions. I am now trying to extract the datapoints from the CDFs (that Analytica made using median latin hypercube sampling with 1,000 samples) to use with R to produce boxplots of each distribution. However, I've noticed that the statistics (mean, median, etc.) Analytica provides for a distribution are different than the mean and median that I calculate in Excel and R from the datapoints.
I am struggling to figure out why these are different, if anyone has ideas I would greatly appreciate it!! And I can provide more info if it would help.
BTW, to create a box plot in Analytica (we call them Tukey Plots or Tukey Bar Plots), you can do the following.
- Add the OnGraphDraw Annotations library to your model. (From a diagram, select File / Add Library...., then select). Link.
- Bring up the Object Window for the variable you are plotting.
- Make the OnGraphDraw attribute visible. Easiest way, right-click in the left part of the window where attribute names appear. In the context menu, turn on OnGraphDraw.
- In the OnGraphDraw attribute, paste this expression:
continue := Plot_Tukey_bars( canv, info, roles, showMean:true )
- Check "after axes, before data". (doesn't matter whether "after fully drawn is checked. You can optionally uncheck it).
- Display the graph and select Probability Bands view.
- Optionally: In Uncertainty Settings (Ctrl+U), Probability bands, also check 0% and 100%.
The Wiki page on Tukey Plots, as well as the description of the Plot_Tukey_Bars function, give these instructions, and so these are the places to refer to when you want to do this.
Regarding the discrepancy in the stats and mean that you asked about, I assume that the second column in your screenshot that is labelled "Extracted from excel" are the stats computed by R. The difference is large, so it is unlikely to be due to a sampling error (i.e., where the 1000 random points aren't the same random points). Instead, I think you are using two totally different distributions. You didn't mention what distribution you are dealing it, but make sure that whatever distribution you are using uses the same parameterization in R as in Analytica, or adjust the parameters accordingly.
An example of this occurs with the LogNormal distribution. Analytica expects you to specify the median of your variable (X) for the median parameter, but I've seen other libraries that expect you to provide the median of Exp(X) for the median parameter. If this is the case in R, you might have to specify the distribution in R as LogNormal( Exp(median_X), Exp(gsd_X)), instead of the Analytica convention of specifying LogNormal( median_X, gsd_X ).
Oh I guess Analytica files are not supported attachments here - instead here is a screenshot of the inputs to one of the nodes if that helps?
The first column in your original screenshot contains the correct median and mean for the distribution defined in your second screenshot, i.e. the one defined using the Fractiles distribution function. I obtained the exact same values when using your Fractile distribution.
I guess that means there is an error in your Excel calculation that used the extracted CDF points.
I assume you'll need to debug your Excel logic. When doing so, debugging the median from the CDF points should be pretty easy. For example, look at the Table view of the CDF in Analytica (which, I assume, is the data you exported to Excel). Scroll down to here the Y column is 0.500 (or as close as it gets). The X column has the median. Since that one is already directly in the data, you can immediately detect when your Excel code comes up with something else. If the Excel code is computing the median correctly, it should simple play back that value (or a linear interpolation of the two straddling values at y=0.499 and y=0.503, for example). Perhaps one you figure out why you're getting a different median, it'll be the same problem causing the mean to be different. There isn't a direct way to read the mean off a CDF plot, so your Excel code must be doing something more sophisticated with the CDF points.