Galaxy Zoo Starburst Talk

Binning - how to decide which method is best to use?

  • JeanTate by JeanTate

    Consider these two (sets of) plots1

    enter image description here

    enter image description here

    They illustrate two different ways of binning:

    • in the first, the bins are each 0.3 dex (in log_mass) wide2, and the x-axis values of the plotted points are bin mid-values (sorta)2; the number of objects in each bin is an output

    • in the second, the width of each bin is an output - the QS bins are created by having equal numbers of objects in each bin, those bins are then used to bin the QC objects - and the x-axis values of the plotted points are means.

    And there are surely other ways to 'bin data'! 😃

    My question: how do you decide which method of binning to use, for a particular analysis/presentation of results?

    Assume, in this case, we want to show that there is a clear 'mass dependent merger fraction' difference, between the QS and QC objects, which approach would be best (and why)3?


    1 They're both from the Mass Dependent Merger Fraction (Control vs Post-quenched Sample) thread: the first is by jules, on p5, about half-way down (December 5 2013 5:29 PM); the second (set) by me, last post on p8 (May 1 2014 11:23 AM)

    2 The first bin is log mass < 10.0, the last > 11.2; the x-axis values are rounded up (and the first and last somewhat arbitrary)

    3 And ignore any other issues etc, e.g. what threshold to use, how to derive 'error bars' (and present them), ...

    Posted

  • KWillett by KWillett scientist

    Hi Jean,

    Good question. I usually don't consider equal numbers the optimal way to bin data; you might end up with equal numbers of objects in each bin, but only if your distribution is symmetric and well-sampled across the range that you're looking at.

    I'll say straightaway that there's no optimum, mathematically well-defined way to choose bin sizes and locations. My default in plotting routines is to use Scott's Rule, which is a good choice for normally distributed samples of random data (so a reasonable stab at a starting point). Here the bin size is 3.5*sigma / n^(1/3), where sigma is the measured standard deviation of the data and N is the number of data points. How does that work?

    Posted

  • JeanTate by JeanTate in response to KWillett's comment.

    Thanks Kyle.

    I usually don't consider equal numbers the optimal way to bin data ; you might end up with equal numbers of objects in each bin, but only if your distribution is symmetric and well-sampled across the range that you're looking at.

    I'm not sure what this means (you use "equal numbers" twice, to me in apparent contradiction).

    Here the bin size is 3.5*sigma / n^(1/3), where sigma is the measured standard deviation of the data and N is the number of data points.

    If I have 1083 objects, and am looking to bin by log_mass, I first find the value of sigma for the 1083 log_mass values: it's 0.3327. The bin size is thus 0.113 (to three significant figures). Depending somewhat on how I define the first (and/or last) bin, I need ~18 bins for all 1083 objects (the full log_mass range is ~2.1).

    How does that work?

    I'll have a go, and let you know! 😃

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    I'll have a go, and let you know! 😃

    enter image description here

    There are 19 QS bins, and 17 QC ones; the lowest mass QC object is waaay below the lowest mass QS one, but there are no QC objects in the lowest mass QS bin (nor the highest). Error bars are "Bayes", and merger fractions are "Beta means" with a (0.5,0.5) prior. You can barely make it out, but there's a cyan line in the plot; that's QS, with merger fraction as N(merger)/N(total) for each bin.

    So, does this work?

    Myself, I can't see how it does ... at the two extremes it just looks like noise (I know it's not, but it looks like it). In between I'm not sure the increased 'x-axis resolution' adds anything much to what's already clearly shown in the '6 bins' plot above. And perhaps it suggests a 'rabbit hunt', "oh look! there are some values of mass for which the two curves make odd excursions; let's take a deeper look!" (and so we disappear down a rabbit hole, never to be seen again).

    What do you, reader, think?

    Posted