Galaxy Zoo Starburst Talk

Quench: Sample vs Control, what's the same, what's different

  • JeanTate by JeanTate

    Per the main Quench project page, the Control catalog matches the Sample in that for every Sample galaxy there is a Control galaxy with a very similar redshift and Log_mass:

    There are 3002 post-quenched galaxies in our sample. For each galaxy, we identified a non-post-quenched galaxy with a similar total stellar mass and redshift. These additional 3002 galaxies will serve as our control sample.

    My first attempt at verifying this involves a) removing outliers, b) grouping the remaining objects into 12 redshift bins, with the same number in each, c) plotting the bin means against each other. Here's what I got; the straight line is equality1:

    enter image description here

    That's a sanity consistency check; I think you'll agree it passes.

    There are - as of now - very few other fields (parameters) one can check; for example, none of the QC fluxes have been uploaded yet.

    One is Petro_R50, a measure of size. Here's the equivalent plot for that (note that it goes in the opposite direction; nearby galaxies - lower redshift - are at the top right):

    enter image description here

    UPDATE: Here's a version of the same data, but with the direction of the axes reversed, so 'the bins go in the same direction' as in the other two plots:

    enter image description here

    Except in the first redshift bin, QS galaxies are smaller than the corresponding QC ones; in fact, from a redshift of ~0.095, QC galaxies are all of about the same size. While the QS galaxies keep getting smaller.

    Interesting, eh?

    enter image description here

    That's the r-band magnitude. Because they're grouped into redshift bins, this is a proxy for absolute magnitude (modulo k-correction, galactic extinction, and maybe more). Like the size plot, the direction is the reverse of that in the redshift plot (bottom left is bright; top right is faint).

    Here the trend is different: in ~four bins, the QS galaxies have about the same luminosity as the QC ones; for the rest, the QS galaxies are brighter. As redshift increases (as we look back further into the past), the QS galaxies become ever brighter (cf their QC counterparts).

    Even more interesting, eh? For constant stellar mass, QS galaxies are smaller and more luminous than QC galaxies, from z~0.095 out (cosmologically younger, more distant).

    Of course, this is very preliminary ... much checking (etc) to do yet ...


    Details of what outliers I removed? You have only to ask! The number of galaxies in the plots is 2943 (QS) and 2937 (QC). Log_mass = -1 galaxies NOT excluded. There are 245±1 galaxies in each bin.

    1 Anyone know how to do this, or something similar, with Tools?

    Posted

  • JeanTate by JeanTate

    One more for this series (12 redshift same-number-of-members bins), the colors. In this case, (g-r);

    enter image description here

    With the exception of the first three bins, QS galaxies are bluer than their QC counterparts. There may be a weak trend for this color difference to become more pronounced as redshift increases.

    Posted

  • JeanTate by JeanTate

    Instead of binning by redshift, what about binning by Log_mass?

    There's a slight complication: 112 QS objects have Log_mass = -1, but only five QC ones do. Excluding these (and the same outliers excluded in the 'redshift bins' analysis) means that the number of objects per bin won't be the same, QS vs QC: only 237 (or 238) vs 244 (or 245). Does this matter? At some level, probably; however, I'll leave working out what to later (there are plenty of other refinements to be made anyway).

    So, the same four plots, only this time with Log_mass bins.

    First, redshift:

    enter image description here

    Wow! Although the binning was done by Log_mass, not redshift, the two catalogs match, almost perfectly! Further, the bin means are very nicely distinct. This isn't a consistency check; can you see why?

    Next, size:

    enter image description here

    Similar trend as above; QS galaxies tend to be smaller than their corresponding QC ones, for the same Log_mass. And, for the biggest eight Log_mass bins, the QS galaxies have about the same size, while the QC ones get smaller (keep in mind that 'size' here is apparent size, 'on the sky'; while it is related to physical size, there's some work to do to convert 'apparent size' to 'physical size').

    I'm going to change the order here; next is the (g-r) color plot:

    enter image description here

    Yikes! The three lowest mass galaxies - as represented by the bin means - are the same, QS vs QC; but from then on, QS galaxies become increasingly blue (compared with their QC counterparts).

    Saving the best for last, r-band magnitude:

    enter image description here

    In all but two (four) Log_mass bins, QS galaxies are brighter ... but there's no other (apparent) pattern!

    Posted

  • JeanTate by JeanTate

    Here are the Log_mass bin means, QS vs QC, for binning done by Log_mass:

    enter image description here

    That's the equivalent consistency check to the first plot above (redshifts, by redshift).

    And here is the same plot, but with the binning done by redshift:

    enter image description here

    That's the equivalent of the first plot in the third post in this thread.

    Consistency? Check! 😃

    Posted

  • JeanTate by JeanTate

    In at least one post - which I can't find! - a zooite commented that, after an initial drop-off of lower-mass galaxies, the redshift-log_mass scatterplot flattens, becoming essentially horizontal (i.e. log_mass constant, with no - empirical! - dependence on redshift).

    If you bin the data, as I have done above, you find that this apparent trend is just that, apparent.

    Here's a redshift-log_mass plot, with both the redshift and log_mass bins. By design, the redshift range is greater for the redshift bins, and the log_mass range greater for the log_mass bins. However, it is obvious - in both sets of bins - that the mean log_mass continues to increase as redshift does ... over the entire range of both redshift and log_mass. Keep in mind, though, that a) at least some outliers have been excluded, and b) the numbers of 'missing mass' galaxies is far greater in QS (112) and in QC (5):

    enter image description here

    Posted

  • lpspieler by lpspieler moderator in response to JeanTate's comment.

    I wrote about the redshift-logmass relationship here.

    Didn't call it flattening though. I just described the sharp concentration of the log mass between about 10 and eleven that increases versus higher redshifts.

    It wouldn't be a surprise however if I had said something about flattening. Because the lower value filters in scatterplots don't work I wasn't able to remove striking outliers which then changed the plot scale significantly.

    Posted

  • JeanTate by JeanTate in response to lpspieler's comment.

    Thanks! 😃

    I read that post of yours (but forgot about it, and couldn't find it again easily); however, it isn't the one I remembered ... that was one with - I think! - a Tools scatterplot. Now that Tools is working again, I may try to find it.

    Removing outliers and binning makes mean trends much more obvious. I wonder if it's possible to produce a plot like the last one of mine, using Tools? In its current version, I very much doubt it (but would love to be proven wrong).

    Posted

  • lpspieler by lpspieler moderator

    Hi Jean,

    I now got around to creating tables filtered for the log mass outliers.
    Take a look at my dashboard:
    http://tools.zooniverse.org/#/dashboards/galaxy_zoo_starburst/52126ef4e847263f86000072

    Now at least the trend becomes more obvious.
    Will update the tables in that dashboard with further filters for outliers.

    Posted

  • JeanTate by JeanTate in response to lpspieler's comment.

    Hmm, I could at least load your Dashboard, lpspieler, but the relevant (top) scatterplot was blank!

    Posted

  • lpspieler by lpspieler moderator in response to JeanTate's comment.

    I have the same problem too. With most of my dashboards I have to rebuild them by hand: trigger the re-build of the tables that depend directly from the data by changing some settings back and forth, then re-build the tables depending on those first-line tables and so on...

    I first thought that maybe my computer was too slow. But apparently I'm not the only one having that problem. I already informed the team about it.

    Posted

  • jules by jules moderator in response to lpspieler's comment.

    All your dashboard plots are blank for me - and I waited the customary 5 minutes for things to (not) load. 😦

    Posted

  • JeanTate by JeanTate

    In the Characterising classification biases thread (link takes you to my last post there), I wrote posts on two charts which show the way the 'spiral fraction' and 'roundness' vary with redshift, for both QS and QC catalogs.

    enter image description here

    Duplicates, objects classified as Star/artifact, and redshift outliers (two) were removed, except for the first plot. I then ranked them by redshift, and divided them into 24 equal bins. The mean redshift ranges from 0.019 (bin #1) to 0.255 (bin #24).

    Within each bin, I calculated the 'spiral fraction': the number of objects classified as 'Features or disk' divided by the total number in the bin. And then plotted them. The black line is the combined fraction.

    Does anyone know of a good statistical test to apply, to tell if the distributions are, statistically speaking, different? Particularly for bins #18 to #24.

    enter image description here

    The y-axis here is something which may be close to the mean ellipticity, defined as 10*(1-b/a), where a and b are the major and minor axis lengths, respectively. I equated "Completely round" to 0, "In between" to 3, and "Cigar shaped" to 6. I have downloaded the various AB parameters for a sample of objects, and so can calibrate zooite classifications. I'll modify the transformation accordingly, later.

    Zooites found that the control galaxies are more rounded than the corresponding QS ones1, in almost all redshift bins. I do not know what statistical tests to apply, to see how significant this is (I guess it almost certainly isn't, for bins #17 to #24).

    1 The QC catalog was chosen so that every QS object is paired with a control one, with the same stellar mass and a redshift that differs by less than 0.02. However, I did not look into the effect of the loss of some pairing, due to excluding duplicates, outliers, and objects classified as Star/artifact

    Posted

  • mlpeck by mlpeck in response to JeanTate's comment.

    Does anyone know of a good statistical test to apply, to tell if the distributions are, statistically speaking, different? Particularly for bins #18 to #24.

    Logistic regression would provide an appropriate statistical framework. You have one continuous predictor (redshift) and one categorical one (membership in QS/QC).

    Alternately I suppose you could view this as an exercise in contingency table analysis with the redshift bins treated as a categorical variable. However the bin selection might affect the results!

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    Thanks! 😃

    I had not heard of logistic regression before; time to go learn something new.

    One downside to a contingency table analysis is the temptation to select and redefine bins (even with several iterations), searching for some reasonable-seeming selection that results in a conclusion of "Yay, statistically significant!" 😮 I'm sure there's a term for this, and - very likely - strong words of caution about not going down that path ...

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    enter image description here

    Unlike the previous two plots, this one is pretty clear: QS objects were classified as 'not symmetric' ("Does the galaxy appear symmetrical?" "No") more often than were their control counterparts (except in the first four redshift vicesimoquartiles/bins). Also, from a low/minimum at around z~0.06 (bin #5), the asymmetry fraction rises. This trend may also occur in the controls, but perhaps not until z~0.12 (bin #15).

    At first glance, these trends seem inconsistent with the gradually declining 'roundness', with redshift, in the plot in my last post: how can galaxies appear both more round - with increasing redshift - and also more asymmetric?

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    enter image description here

    This is, in some way, the 'spiral' counterpart to the roundness plot (above).

    What's plotted is the fraction of galaxies classified as edge-on1 ("Could this be a disk viewed edge on?" "Yes"), of those classified as having features or a disk ("Is the galaxy simply smooth and rounded, with no sign of a disk?" "Feature or disk").

    It's a very noisy signal, so compressing it into six bin (sextiles?) or twelve (bi-sextiles?!? 😮) seems like a good idea.

    However, for the first few bins, it seems more of the 'disk' QS galaxies are edge-on than among the controls; for the rest, much the same fraction. Also, the Eos fraction decreases with increasing redshift.

    Perhaps the high fraction at low redshift, among QS galaxies, is partially due to them being clumps in edge-on disks (and not the main part of the galaxy, i.e. the bulge and central parts of the disk)?

    1 "Eos" = "Edge-on spiral"

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    enter image description here

    Very much a WIP (work in progress).

    If the classification was "Neither" ("Is the galaxy merging, or is there any sign of tidal debris?"), the 'Merger score' is zero. If "Both", 3. If "Merging", 2; and if "Tidal debris", 1. That's for each classification (this question is in both main branches of the decision tree; only if the first answer is "Star/artifact" is this question not asked); What's calculated is the mean, in each redshift vicesimoquartile/bin, for QS and control objects separately.

    Except for the lowest redshift bins (and the last one), there seems to be more merger activity among the QS galaxies than among their control counterparts. And that activity increases with redshift for QS galaxies (albeit with some wild excursions), but not for the controls (except, perhaps, in bins #18 to #24).

    But those trends may be artifacts of the scoring system ... stay tuned! 😃

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    I have downloaded the various AB parameters for a sample of objects, and so can calibrate zooite classifications.

    The details are in another thread, here (link takes you directly the post with the details 😉). The sample is the tenth redshift vicesimoquartile (a.k.a. 'bin'); all told there are 177 'smooth' galaxies (both QS and QC), of which 51 are 'Completely round', 112 'In between', and 14 'Cigar shaped'. The mean AB values for each are, respectively, 0.84, 0.54, and 0.23 ... which suggests that a) 'cigar shaped' galaxies are, in fact, Eos (edge-on disk galaxies), and b) the (0, 3, 6) values I assumed are too low.

    But before revising the transformation so drastically, maybe it's worth checking AB values for a wider range of redshifts?

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    Does anyone know of a good statistical test to apply, to tell if the distributions are, statistically speaking, different? Particularly for bins #18 to #24.

    Logistic regression would provide an appropriate statistical framework. You have one continuous predictor (redshift) and one categorical one (membership in QS/QC).

    I had this on backburner, but yesterday looked further into it (with the help of a friend who does logistic regression in her sleep!).

    enter image description here

    The data are as above, for 'Spiral fraction' (sorry I didn't label the plot); the x-axis is redshift (means of each of the bins), the y-axis 'logit' - if I understand the term correctly; the log (LN) of the 'odds' - and the lines linear trends (least squares fits). Yes, they (the logit regression parameters) should be estimated using a 'maximum likelihood estimator' (I think), but I read that, for data like this, the difference is far too small to worry about.

    enter image description here

    And that's the models plotted with the data.

    Clearly there is a difference at high redshift ... but how to work out whether it's significant or not? I read that there's a chi-square statistic which is relevant,but how to calculate it?

    It certainly looks like a really cool technique - and it's much easier to do than I had imagined (logistic regression that is) - but in addition to 'how do you test for statistical significance?', I am curious how to deal with datapoints like the QS bin #18 and QC bin #24:

    enter image description here

    If the 'odds' are zero, the log is negative infinity, not a 'value' you can easily use to estimate the alpha and beta parameters! 😛

    Posted

  • mlpeck by mlpeck

    You actually don't need to bin data to do logistic regression - the dependent variable can be binary valued with continuous or categorical predictors. Here's the R version of a logistic regression of spiral fraction on redshift and sample:

    summary(logit.spiral.all)
    
    Call:
    glm(formula = spiral.all ~ logz.all + samp * logz.all, family = binomial("logit"), 
        na.action = na.exclude)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.7097  -0.7985  -0.7122   1.3098   2.0373  
    
    Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)     -1.9904     0.1652 -12.047  < 2e-16 ***
    logz.all        -0.9881     0.1528  -6.465 1.01e-10 ***
    samp1           -0.7011     0.2349  -2.984  0.00284 ** 
    logz.all:samp1  -0.5463     0.2148  -2.544  0.01097 *  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 6929.1  on 6000  degrees of freedom
    Residual deviance: 6776.9  on 5997  degrees of freedom
    AIC: 6784.9
    
    Number of Fisher Scoring iterations: 4
    

    The model here says the log-odds is a linear function of log(z) with a different intercept and slope allowed for the quench (vs. control) sample. The intercept and slope differences for the quench sample are significantly different from 0, but not by an impressive amount.

    Here is a graph of the model. The broken lines are the predicted probabilities (not log odds) against redshift - red is the quench sample. Actual proportions in binned intervals are also plotted with uncertainties estimated from binomial statistics. I just used deciles here.

    enter image description here

    I also tried redshift instead of log-redshift as a predictor and got similar numerical results to JeanTate's above. Either way the qualitative behavior is similar.

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    Wow, thank you very much, mlpeck! 😄

    So far in Quench I've been using a spreadsheet to do my analyses (other than when I've used Zoo Tools); would you say it would be extremely difficult to do the logistic regression (etc) analysis you just posted, using nothing more than a basic spreadsheet (i.e. one that does not have this sort of statistical tool built-in)?

    But you've now given me a mighty powerful reason to explore R (not that what you've earlier posted wasn't enough)!

    Turning to Eos fraction: unlike 'spiral fraction', there's at least a soft, physically-motivated limit, as long as you assume disk galaxies have - in a big enough sample, plus a few other caveats - randomly distributed inclinations. "Soft"? Yes, because disks are not infinitely thin, nor are all disks the same thickness (however defined). So the model would be more constrained than just 'linear function with z' (or log(z)); there would be a constraint on the intercept, a narrow range of values (to accommodate varying disk thicknesses).

    Extending this: assuming that ellipticals cannot be more elongated than E7 (or E6; source is Buta's fairly recent, huge paper on Galaxy Morphology), two-component (two independent variables) models with 'In between' and 'Cigar-shaped' as dependent (categorical) variables might be very interesting to look at.

    Using R, not a spreadsheet, of course. 😉

    Posted

  • mlpeck by mlpeck in response to JeanTate's comment.

    If I were starting from scratch learning a system that's inevitably going to have a steep learning curve I would seriously consider learning python with scipy and matplotlib. But R is pretty much the defacto standard among statisticians and there's a vast array of packages for doing obscure statistical stuff.

    I'm also playing with a system for Bayesian modelling called stan that's pretty neat. Here's the stan code for logistic regression with multiple predictors:

    // multi logistic regression

    data {
        int<lower=0> N;
        int<lower=0> M;
        matrix[N,M] x;
        int<lower=0,upper=1> y[N];
    }
    parameters {
        real alpha;
        vector[M] beta;
    }
    model {
        alpha ~ cauchy(0., 5.);
        beta ~ cauchy(0., 5.);
        
        
        for (n in 1:N)
            y[n] ~ bernoulli(inv_logit(alpha + x[n]*beta));
    }
    

    This model says the categorical variable y has a bernoulli distribution with parameter p equal to the inverse logit of a linear function. Diffuse priors are placed on the parameters -- those are the lines alpha ~ cauchy(0., 5.), etc.

    I mention this because you might want to consider such a model with more informative priors if you think you know something about the behavior of your predictors. Also, this can be extended to ordinal variables (ones where the ordering is meaningful) and to multiple level models.

    Here by the way is the result for the same model I ran yesterday:

    print(stanfit.spiral.all,digits=2)
    Inference for Stan model: multilogistic.
    4 chains, each with iter=2000; warmup=1000; thin=1; 
    post-warmup draws per chain=1000, total post-warmup draws=4000.
    
                mean se_mean   sd     2.5%      25%      50%      75%    97.5%
    alpha      -1.98    0.01 0.16    -2.30    -2.09    -1.98    -1.88    -1.68
    beta[1]    -0.98    0.00 0.14    -1.27    -1.08    -0.98    -0.88    -0.70
    beta[2]    -0.70    0.01 0.22    -1.15    -0.85    -0.70    -0.55    -0.25
    beta[3]    -0.55    0.01 0.20    -0.95    -0.69    -0.55    -0.41    -0.13
    lp__    -3390.60    0.04 1.37 -3394.11 -3391.29 -3390.24 -3389.59 -3388.94
            n_eff Rhat
    alpha     900    1
    beta[1]   903    1
    beta[2]   988    1
    beta[3]  1004    1
    lp__     1239    1
    
    Samples were drawn using NUTS(diag_e) at Sun Dec  1 19:35:05 2013.
    For each parameter, n_eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor on split chains (at 
    convergence, Rhat=1).
    

    This is very close to the maximum likelihood solution that R's glm function produces, although the interpretation is different.

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    Again, thank you very much!

    I'm trying to learn python (etc), but as it also involves learning Linux too, it's slow. And I put it all on the backburner when I joined Quench in earnest. In the spirit of Quench's stated aims, I have tried to do stuff that does not require anything beyond a simple spreadsheet/database (but I quickly realized that ZooTools - in its present form - is inadequate), and which I think I could explain to any of my fellow zooites. Oh, and CasJobs too; querying SDSS is not that hard to do or learn.

    So, I'll tuck R (etc) away for a while, now that Laura is back. One last thing: I see that R runs on Windows - which would make my learning it much faster! - do you mind if I ask if you run it under Windows?

    Posted