Galaxy Zoo Starburst Talk

What is needed to get a clean QS and a clean QC?

by JeanTate

As I and mlpeck have documented - at least partially - in several threads¹, both the Quench Sample (QS) catalog and the Quench Control (QC) catalog contain outliers, or anomalous objects.

As I understand it, to do analyses whose results are good enough to be published in a peer-reviewed astronomy/astrophysics journal (e.g. MNRAS), we need to either work with 'clean' datasets, or clearly identify and characterize the extent to which the datasets we use are not 'clean' (or perhaps both).

But what does 'clean' mean, in the Quench project?

I don't really know, but I'd sure like to have an active discussion on the topic!

For example, I think 'clean' means the objects in each of QS and QC should be galaxies, and not stars. But what if an object is a foreground (in our own galaxy) star and a background galaxy?

For another example, I think 'clean' means the data we analyze, for each object, should be that of a whole galaxy, and not a clump (perhaps a star-forming region) in the arm of a spiral. But how then do we treat mergers, in a consistent way?

And what do we do if the spectroscopic data are clean, but the photometric data are not (as seems to be the case with many objects which are near - on the sky - to very bright stars)? Do we run our analyses on separate datasets?

What do you think?

¹ A partial list, in no particular order:
Posted September 3, 2013 3:40 PM
by jules moderator

Well it doesn't make any sense to me to use datasets containing anything other than post quench galaxies and their controls. Explaining and justifying an "unclean" sample somehow seems sloppy and I think it's worth going that extra mile to find truly clean samples. But I could be wrong! ;D

"For example, I think 'clean' means the objects in each of QS and QC should be galaxies, and not stars. But what if an object is a foreground (in our own galaxy) star and a background galaxy?"

These examples are particularly worrying as both the galaxy and the star are likely to have received classifications effectively rendering that object unusable.

Hopefully someone from the team will chip in.

Posted September 3, 2013 6:09 PM
by mlpeck

Random thoughts in random order:
1. Maybe we should forget about the control sample for a while, like until someone uploads revised data with new classifications.
2. If I understand how the quench sample was selected it was done entirely based on spectroscopic properties. The spectroscopic sample looks quite clean and relatively homogeneous to me, with fewer than a handful of really conspicuously mistaken inclusions.
3. On the other hand, if the selection was based entirely on the population models of Chen et al. only the stellar continuum and absorption features played a role in the selection. This raises the possibility that some of the quenched sample is really non-quenched if we look at the entire spectrum (emission as well as absorption features). There's also of course the possibility that some spectra are completely unrepresentative of the galaxies from which they came. A lower bound on redshifts might have helped here.
4. It's worth taking a close work at Goto's (2007) paper on an "E+A" galaxy catalog drawn from DR5. E+A galaxies should be a strict subset of the ones we're looking at. His selection criteria were based on the strength of H-delta plus emission line strengths at H alpha and [O II] -- the first had to be strong in absorption and the latter two weak. He mentions examining images of all the low redshift (z < .05) objects in his sample and has some comments about possible differences outside the fiber coverage.
He also mentions that the median Petro_R50 in his sample is 1.44 arcsec, so small size by itself should not be cause for suspicion. It seems likely to me that compact size may be a defining characteristic of at least some of our sample.
1. In a recent paper with (presumably) similar science goals to this project McIntosh et al. (2013) used a mix of automated selection criteria and visual inspection to get a homogeneous sample of blue elliptical galaxies. Jean Tate has started doing this, and it's probably a good idea just to look at all of them.
Posted September 4, 2013 3:48 PM
by mlpeck

OK, I took my own suggestion and looked at all 3000 quench sample galaxies using the DR9 image list tool. I did not try to classify them. I did not try to make judgment calls about whether multiple nearby objects were merging or overlaps. I was mostly looking for cases where the galaxy is hugely larger than the spectroscopic fiber.

I found 5 of those, with perhaps another 10-12 borderline cases.

There were also some cases where the galaxy filled, or overfilled, the postage stamp images but where there were no obvious color gradients. I consider those to be proper members of the sample.

Some of the too large or borderline galaxies are really interesting. NGC 3320 looks to me like it is in the process of being turned into an S0 or elliptical, with visible starforming regions only in the outer parts:

NGC 4330 is a Virgo cluster galaxy that has been extensively studied as an example of ram pressure stripping:

NGC 1268 is near the center of the Perseus cluster (Abell 426) and if it's as close to the giant ellipticals as it appears to be (its redshift is a little lower at z=0.011 vs. z=0.017 for the ellipticals) it is no doubt losing its gas.

Posted September 4, 2013 9:13 PM
by JeanTate in response to mlpeck's comment.

Very cool! 😃

Not least because you - independently - found much the same thing that I did (see the Galaxies which are too big (Petro_R50 >> fiber aperture) thread).

I wonder if AGS00000iz (587731187819937915, a.k.a. UGC 00507) is one of the two, of five, (which you didn't show), or perhaps one of the borderline 10-12?

Posted September 4, 2013 9:40 PM
by mlpeck

That was actually second on my list of way too big objects.

Posted September 4, 2013 10:33 PM
by mlpeck

A simple redshift cut would go a long way towards solving the problem of galaxies that are too big. This is standard practice anyway and wouldn't raise eyebrows among peer reviewers -- I'd guess that not having redshift cuts would be more eyebrow-raising.

A subsection discussing the nearby objects that fit the spectroscopic criteria for being quenched and highlighting some especially interesting individual objects might be useful.

A cut at z=0.01 would eliminate around 14 objects and probably all the ones I found troublesome. A cut at z=0.02 would still leave 97.5% of the original sample, but I think some interesting objects would be tossed. Any higher lower limit would be too conservative, IMO.

I suppose one could make an argument for imposing an upper redshift limit as well, but I'm not going to go there.

Posted September 5, 2013 12:28 AM
by JeanTate in response to jules's comment.

Well it doesn't make any sense to me to use datasets containing anything other than post quench galaxies and their controls.

Expressed that way, everyone would surely agree. 😉

One of the issues is the extent to which should go, the effort you should put in, to trying to ensure this. For example, consider AGS00003jv, a QC/control object:

In the gri bands, it's indistinguishable from a point source. The DR9 spectroscopic pipeline classifies it as a galaxy, redshift 0.15864±0.00002 ... but "Warnings: SMALL_DELTA_CHI2" (meaning that there is at least one other template - presumably with a different redshift - that matches the spectrum almost as well). What sort of checking is it worthwhile doing, first to discover that this QC object is at least problematic, and second to decide what it actually is¹?

Perhaps a more important issue is: What is a post-quenched galaxy anyway?

To some extent, this question is answered by the way in which the QS catalog was created: a post-quenched galaxy is an object in the SDSS DR7 (DR9?) database which has spectral features that Chen's algorithm selects. True, we don't know much about that (and mlpeck has been doing some work to put limits on what it might be, given the current literature), but we can still identify objects with very small fiber coverings, or dramatic changes in (morphology) structure, or which are likely overlaps. And then decide if they should be included or not.

Then there's the quality question: 'clean' makes sense only in terms of 'fitness for purpose', and 'purpose' for us is analysis.

Consider AGS00000f1, a QS object:

This has a DR7 u-band model mag of 24.68±2.45 (in DR9 it's 25.13±1.90), making its (u-r) color (6.84) an extreme outlier. Yet the spectrum seems perfectly fine (DR9 interactive), as does just about everything else we have on it. So, unless there is some analysis we want to do which requires a reliable u-band magnitude, this object is 'clean'; however, if we do want to include analyses which use the u-band magnitude², what do we do?

Hopefully someone from the team will chip in.

Yes, that would be nice.

¹ In this particular case, it's surely worth the effort to discover that it's problematic, an outlier, and possibly anomalous (the answer the first question), but not worth any effort to work out what it is ... as it's a control object, simply drop it and find another with a redshift and log_mass close to whatever QS object it is paired with

² And apparently such analyses are quite desirable

Posted September 5, 2013 2:32 PM
by JeanTate in response to mlpeck's comment.

Maybe we should forget about the control sample for a while, like
until someone uploads revised data with new classifications.

I think this is very sensible, not least because it's the ('post-quenched') QS galaxies that we're trying to understand, and which are hard to find. And it should be pretty straight-forward to find many 'alternates' for the 'same z, same mass' control for all the true post-quenched galaxies.

If I understand how the quench sample was selected it was done entirely based on spectroscopic properties. The spectroscopic sample looks quite clean and relatively homogeneous to me, with fewer than a handful of really conspicuously mistaken inclusions.

But how did you decide that? What, specifically, for you constitutes "quite clean and relatively homogeneous"?

For example, I don't think it's all that obvious that QS objects AGS00001ka and AGS00001dn are mistaken inclusions, yet both spectra are heavily 'contaminated' by the light of a foreground (Milky Way) star. Much the same it true concerning non-interacting overlapping galaxies. As the SDSS spectroscopic pipeline cannot give anything quantitative about the likelihood of there being two (or more) redshift systems in a spectrum, how could we even estimate the likely 'foreground star contamination' fraction (other than saying that it's at least ~0.2%)?

On the other hand, if the selection was based entirely on the population models of Chen et al. only the stellar continuum and absorption features played a role in the selection. This raises the possibility that some of the quenched sample is really non-quenched if we look at the entire spectrum (emission as well as absorption features). There's also of course the possibility that some spectra are completely unrepresentative of the galaxies from which they came. A lower bound on redshifts might have helped here.

This is pretty key, eh? Because we do not know - in detail - how the 3002 QS objects were selected, we really can't say much about how many may be anomalous (other than those heavily contaminated by a foreground star, or non-interacting overlapping galaxy), can we?

Posted September 5, 2013 8:05 PM
by mlpeck in response to JeanTate's comment.

But how did you decide that? What, specifically, for you constitutes "quite clean and relatively homogeneous"?

I've been looking almost exclusively at spectroscopic properties. The control sample, whatever its flaws may be, looks very much like a randomly drawn selection from the SDSS population of spectra. The quench sample doesn't. See here and the second post here.

I'd really like to get on with more substantive analysis, and I thought the graphs I posted in those two monologues suggested some interesting avenues to investigate. Unfortunately, given the complete absence of anyone from the science team for over a week now I've reached the conclusion that this project has failed, so for now at least I can't justify spending the time exploring those avenues more thoroughly. On the positive side I was inspired by this project to learn enough SQL to feel confident wading into the CasJobs website, and that alone was worth the price of admission.

Posted September 6, 2013 3:27 PM
by JeanTate

It's nearly three months since the last post in this thread; that's the bad news. The good news is - from my own data analysis perspectives - that I think I'm getting close to being able to characterize the sorts of QS objects I think we should at least consider excluding. QC objects are a whole different ball of wax, but also much easier to address, as has already been discussed in this thread.

Just yesterday I finished collecting most of the posted outliers into a single category, and discovered - much to my delight and some surprise - that there are only ~100 of them! 😮 Still some work to do - and some possible classes of outliers to consider - and I'll post details later.

Posted November 23, 2013 3:12 PM
by JeanTate in response to JeanTate's comment.

Not surprisingly, what to exclude depends on what we want to study!

The easiest cases are those where the object is not a galaxy, nor even a part of a galaxy ... at least not within the redshift range of this project (i.e. ~0 < z < 0.35). It gets slightly more complicated when, for example, the spectrum is of a foreground star but the photometry - and zooite classifications - are of a large (on the sky) galaxy.

More generally, as the QS objects were selected by features in their spectra, and as we're studying galaxies, it should be pretty uncontroversial to exclude (in addition to 'pure stars');
- all objects in which the spectrum is a blend, either of a foreground star and a galaxy, or of overlapping galaxies with very different redshifts
- all objects for which the photometry is a blend, either of one or more foreground stars and a galaxy, or of overlapping galaxies with very different redshifts (or both).
Next, 'bad spectra' and 'bad photometry'.

I don't think there are any 'catastrophically bad spectra' QS objects. We had one in the v1 catalog - a high redshift object misclassified as a low redshift one - but it was removed. There are several 'catastrophically bad photometry' QS objects: tiny regions of huge Eos treated as stand-alone galaxies, and galaxies near very bright stars.

There are ~~~ten~~ 20 QS objects which, I claim, meet at least one of the above criteria; I propose they all be excluded from all Quench project analyses. I've posted the ones I've found so far at the end of this post (DR10 images).

More challenging is how to decide which 'bad spectra' and 'bad photometry' objects to exclude from what analyses; particularly, how to do so objectively and consistently. And I'll discuss this further in later posts. For now, the classes of object I'm still looking into (but haven't yet written about): the 'bad image' objects zutopian posted to the Outliers - collect them here! thread; objects with 'holes' in their spectra, holes which just happen to be where one or more of the 'BPT diagram emission lines' would be (if they were, in fact, in emission); the extent to which Eos with nuclear spectra might have 'hidden' AGN; and misleading Petro_R50 values for objects with very bright, PSF-like nuclei.

Top to bottom, left to right:
- AGS00001ka: foreground star
- AGS00001z3: star
- AGS00002ak: overlap (star/quasar?)
- AGS00001dn: foreground star
- AGS0000182: overlap (z=0.019 and 0.095)
- AGS00000ds: overlap
- AGS00000hm: foreground star
- AGS00000wp: overlap (z=0.117 and 0.024)
- AGS000026g: foreground star
- AGS00001se: foreground star
UPDATE: I omitted to post the 'near bright star' catastrophic 'bad photometry' QS objects:
- AGS000004v: photometry is also thrown off by a diffraction spike
- AGS00000a2: zoomed out a bit; the 'bright sky' makes the u-band estimates nonsense^
- AGS000005o: photometry is also thrown off by a diffraction spike
- AGS00000s1:
- AGS00000uh: photometry is also thrown off by a diffraction spike
- AGS0000265: photometry is also thrown off by a diffraction spike
- AGS00000l1: being this close to two bright stars seems to have quite messed up the pipeline's estimates of 'the sky'
^ Here's an even more zoomed out image:

Posted November 25, 2013 5:35 PM
by jules moderator

Sterling work as ever Jean.

Posted November 25, 2013 9:58 PM
by JeanTate in response to jules's comment.

Thanks jules.

Posted November 26, 2013 8:16 PM