Galaxy Zoo Starburst Talk

Control Sample

  • trouille by trouille scientist, moderator, admin

    We created the control sample by identifying a non-post-quenched galaxy with a similar mass and redshift as each post-quenched galaxy. Mass-matched here means a galaxy with a total stellar mass within a factor of a few of the post-quenched galaxy. Redshift-matched here means a galaxy within a redshift of 0.02.

    There are 112 post-quenched galaxies for which we weren't able to determine reliable masses. For these galaxies, the control galaxies are just matched in redshift.

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    If that is so, how did it happen that 29 'control' - by definition, 'non-post-quenched galaxies' - were matched with 29 'post-quenched galaxies' (details here)?

    Also, how were the five 'control' galaxies without estimates of mass 'matched' with corresponding QS catalog ones?

    That there were shortcomings in the selection process(es) seems obvious; what's not so obvious - from the outside - is:

    • a) what led to those shortcomings? and
    • b) what confidence can we (those of us with no insight into the selection
      processes) have that the shortcomings have been appropriately
      addressed?

    Posted

  • trouille by trouille scientist, moderator, admin

    Yes, the devil is in the details.

    #1) In the original code to create the control sample, I didn't think to make sure no post-quenched galaxies were chosen or that there were repeated control galaxies. In retrospect, I should have. But I figured -- I'm choosing randomly among many possible galaxies with similar mass and redshift. The likelihood that I'll pick one that already exists in either relatively small sample (just 3000 galaxies) must be next to nothing. That's because I'm thinking in the context of 1 million galaxies. To pick two of the same balls in a lottery with 1 million different numbers seemed very unlikely. But instead I should have been thinking about how small my mass and redshift bins were (factor of a few in mass and redshift difference of less than 0.02). Turns out that there's a few % chance of picking the same object -- hence the repeats in the samples.

    Posted

  • trouille by trouille scientist, moderator, admin

    #2) In the DR7 database, most galaxies have mass estimates. See http://www.mpa-garching.mpg.de/SDSS/DR7/ for the files with these values.

    But ~16% don't have reliable mass estimates, so they're given a place-holder value of '-1' for their mass.

    The five control galaxies with mass = -1 are control galaxies for post-quenched galaxies that also have mass = -1. It probably would have been better to specifically choose only control galaxies with mass estimates, but c'est la vie. It won't effect our results, since we'll only do comparisons that are mass-dependent for post-quenched galaxies that have mass estimates (and hence control galaxies that have mass estimates).

    Posted

  • trouille by trouille scientist, moderator, admin

    The question of confidence is a really interesting and important one. Part of what's been awesome to see is that the people in this forum find errors and mistakes and we (those of us who can make changes in Tools) follow up and fix things. That's exactly what a research team does. There are checks and balances and everyone's double checking to make sure things are right. And of course there will be mistakes -- we're human! We (as a research team) just do our best to make sure we find the mistakes before the article is published and others start building their ideas off of what we've found.

    If you look through the science journals, there are a good number of examples of, even after the journal is published, people find mistakes in their work. Again, we're human, and errors happen. When the mistake is found, scientists post addendum's or updates. I'm sure there are better examples, but here's one I found with a quick google: http://iopscience.iop.org/0953-8984/24/20/209401/article

    Although I would have preferred no mistakes in Quench, I do think it's a good thing that we have made mistakes -- it's part of the normal process of science and another window into seeing that science isn't a linear process, that scientists are fallible, etc.

    Posted

  • trouille by trouille scientist, moderator, admin

    So to actually answer your question: "what confidence can we (those of us with no insight into the selection processes) have that the shortcomings have been appropriately addressed?"

    You can have confidence that we're all trying to do things correctly and make no mistakes, but that we're human and so we will make mistakes. You should also have confidence that as a team we'll do our best to find those mistakes. And you can be 100% certain that if we do find a mistake, we'll fix it. That's the one thing about the ethical practice of science -- no sweeping things under the rug!

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    One thought that has occurred to me, more than once, when looking through the QC objects is this:

    if these objects were chosen at random, from among the ~million or so
    DR7 MGS (Main Galaxy Sample) objects, with the only constraints being
    something like (log_mass AND redshift in (~3k small regions)),
    what does the number of 'outliers' produced by this selection tell
    us?"

    For example, at least two QC objects are actually stars, at least two more are overlaps (background galaxy seen through a - much larger, on the sky - foreground one), and at least five are clumps/SFRs well away from the photocenter/nucleus. For our Quench Project research that may not matter much - this is a 'contamination' level only of order ~1% (maybe) - but perhaps there's more to it? For example, perhaps the selection algorithms you used - to find the QC counterparts - preferentially pick out unusual objects? perhaps objects in the DR7 MGS have a strange, not-yet-identified distribution, in (log_mass, redshift) space?

    At the moment, I'm (re-)reading Ann Finkbeiner's "A Grand and Bold Thing", about how the SDSS came into being, and am up to "Spectroscopic War". The text doesn't exactly say so, but I get the impression that identifying a star as a galaxy (in the Spectro pipeline) is something that should be exceedingly rare1 (if you cut on zconf > ~0.95), certainly well below 1% ... yet we have at least two such failures! 😮

    1 Yes, oddities are flagged by the spectroscopic pipeline, at the ~1% level, and a great many have turned out to be strange and wondrous things

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    This is related to the 'birthday problem'WP isn't it?

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    In the original code to create the control sample, I didn't think to make sure no post-quenched galaxies were chosen or that there were repeated control galaxies. In retrospect, I should have. But I figured -- I'm choosing randomly among many possible galaxies with similar mass and redshift. The likelihood that I'll pick one that already exists in either relatively small sample (just 3000 galaxies) must be next to nothing. That's because I'm thinking in the context of 1 million galaxies. To pick two of the same balls in a lottery with 1 million different numbers seemed very unlikely.

    Given that there are 3000 objects, and you're selecting with replacement, the probability of selecting k 'matches' follows the Poisson distribution1, doesn't it? In this case, λ will be 9 (= 3000 * 3000/1 million), which is also the expectation value. For k = 9, the probability is ~0.13; for k = 0 (i.e. no matches) the probability is ~0.0001. The probability of 9 or fewer matches is ~0.58, so the probability of more than 9 matches is ~0.42.

    If the population you selected the controls from is the DR7 SDSS MGS (Main Galaxy Sample), which seems to be the case, then the expectation value would be considerably greater: there are only ~700k galaxies in the DR7 MGS, and quite a few have redshifts outside the range of the QS objects. Conservatively, assume 700k; the expectation value is ~13. 29 is certainly far more than you'd expect if the QS galaxies were distributed as randomly in (mass, redshift)-space as the MGS galaxies; however, as you found, this is clearly not the case.

    1 f(k; λ) = Pr(X = k) = λke-λ/k!

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    Answering my own question, no.

    Posted