How to decide which objects' MPA-JHU derived parameters are unreliable because ...

by JeanTate

As I understand it, we are using values for several key parameters derived from MPA-JHU data products. These include various line fluxes (and their errors), the 4000 Å break strength ("D4000", and its error), velocity dispersion ("V_DISP", and its error), and the star-formation rate ("SFR").

The methods used to produce these estimates are described in a number of papers, principally Brinchmann+ 2004, Kauffmann+ 2003, and Tremonti+2004 (see above link for details).

I have downloaded and read each of these, looking specifically to understand how the data products address:
- spectra contaminated by overlapping/foreground galactic stars (including the diffraction spikes - diffspikes - from very bright stars, which may be several arcmins from the galaxy
- spectra contaminated by overlapping - either foreground or background - galaxies, whether at similar or quite different redshifts
- SDSS spectra with some regions masked out (these regions may be as small as a few pixels, or as large as ~1000+ Å)
- SDSS spectra whose redshift estimate carries the warning "SMALL_DELTA_CHI2".
I found nothing; no mention in any of the papers of any of these. Although none seem to come out and say it openly, they all seem to very firmly assume that each and every spectrum they let their pipeline(s) loose on is 'good' (no masked regions, no poorly subtracted sky lines, etc) and is of a system of gravitationally bound stars, gas, dust, and perhaps an AGN.

Yet there are hundreds of objects in the Quench project* which have either 'bad' spectra, spectra containing more than one redshift system, or both! 😮

How do we decide which among these many hundred we should exclude from our analyses, as potentially problematic objects?

Specifically, how can we make such decisions in an objective, quantitative, reproducible, scientifically-acceptable manner?

*and among "the 1149" QS objects and "the 1196" QC ones that meet our redshift and estimated z-band absolute magnitude cuts.

Posted March 26, 2014 1:30 AM
by JeanTate in response to JeanTate's comment.

Some examples (DR9 images):

AGS00002c6: the blue 'star' is only ~0.02' from the galaxy's nucleus; it's not a separate photometric object:

AGS00002p5: the red 'star' is 0.043' from the nucleus, so is not within the fiber aperture (but how much of its light was scattered into the aperture? Note: many SDSS spectra were taken on nights that were too poor for photometry):

AGS00002o8: this is an overlap; the big galaxy has the same redshift (0.08059±0.00002), and some of the light from its stars (etc) certainly contaminate the target's:

AGS00002xj: nearby very bright star's light certainly scattered into spectroscopic fiber; at the time the spectrum was taken, fiber may also have been in a diffspike (for part of the exposure):

AGS00002ki: spectrum red-ward of ~8000 Å is missing (not even masked); no warnings:

AGS00002cy: spectrum blue-ward of ~4000 Å masked (including the 4000 Å break, H&K lines, ...); no warnings:

Of course, most of these are extreme examples. However, as there seems to be no obvious break in any continuum - distance from fiber center; overlapping star's brightness; width (in Å) of masked regions in the spectrum; prominent emission/absorption lines masked (or not); etc - how to decide where to make the cut? Remember: there's nothing in any of the papers which describe the MPA-JHU data products (that I could find, anyway) that even hints at how ... 😢

Posted March 26, 2014 2:20 AM
by trouille scientist, moderator, admin

For spectra with regions masked out, if you look at the error on the values for the flux of emission lines in those regions, are the errors large (i.e., the emission line flux is within perhaps ~2 sigma or less of the error)? For example, if Halpha is in a region of the spectrum that's masked out, does that source have a very large Halpha_err value or a flagged Halpha value?

(The values originally posted on Tools are from DR7, though great work has been done updating values in our Talk threads using the more recent databases -- http://www.mpa-garching.mpg.de/SDSS/DR7/SDSS_line.html)

As for the foreground/background contamination of stars/galaxies, my understanding is that the pipeline didn't do careful checks. The procedure mlpeck presented on page 2 of http://quenchtalk.galaxyzoo.org/#/boards/BGS0000008/discussions/DGS000022q is an excellent idea and I am pretty sure not done in the MPA-JHU pipeline (i.e., estimating the the fractional contribution from nearby sources to SDSS's spectroscopic fiber on our source of interest).

This is where your visual follow-up of the sources is so helpful to flag potentially contaminated sources, then followed up with mlpeck's procedure to see if there's significant contribution from the nearby source to our source-of-interest's spectrum.

Posted March 27, 2014 5:51 PM
by JeanTate in response to trouille's comment.

For spectra with regions masked out, if you look at the error on the values for the flux of emission lines in those regions, are the errors large (i.e., the emission line flux is within perhaps ~2 sigma or less of the error)? For example, if Halpha is in a region of the spectrum that's masked out, does that source have a very large Halpha_err value or a flagged Halpha value?

I haven't checked, so I don't know.

However, unless there's an exceptionally compelling case to say otherwise, I think every object which has any of the four 'BPT emission lines' masked, even partially, should be excluded from all analyses (that involve BPT class).

The more general concern remains, however: accepting the description of how the MPA-JHU parameter values were derived (per their published papers, from the SDSS spectroscopy data), what confidence can we have in any value (including its error bars) in cases where there are (large) regions of the spectrum masked (or completely missing)?

As I said in the OP, I could find nothing, in any of the key papers I read, which even vaguely hints that the teams even considered this question, much less robustly addressed it.

Further, as this is all new to me, I am very keen to know what professional astronomers, with decades (or more) of experience, would do in this situation! 😃

As for the foreground/background contamination of stars/galaxies, my understanding is that the pipeline didn't do careful checks.

Mine too.

The procedure mlpeck presented on page 2 of http://quenchtalk.galaxyzoo.org/#/boards/BGS0000008/discussions/DGS000022q is an excellent idea and I am pretty sure not done in the MPA-JHU pipeline (i.e., estimating the the fractional contribution from nearby sources to SDSS's spectroscopic fiber on our source of interest).

It's certainly very interesting!

However, I think it's - as yet - very far from being robust. But let's keep the discussion going!

Posted March 27, 2014 10:29 PM
by mlpeck

I found nothing; no mention in any of the papers of any of these.
Although none seem to come out and say it openly, they all seem to
very firmly assume that each and every spectrum they let their
pipeline(s) loose on is 'good' (no masked regions, no poorly
subtracted sky lines, etc) and is of a system of gravitationally bound
stars, gas, dust, and perhaps an AGN.

I understand that you're trying to get responses from some credentialed scientists, but I'm pretty sure no one would ever assume that data from a large automated survey is free from contamination, since that would be obviously contrary to reality. What they're assuming is that the level of contamination doesn't prevent them from meeting their science goals.

As for the spectra you pointed out, the 4th line in those graphs is the translated value of the ZWarning flag. No warnings indicates that the pipeline considered the redshift measurement secure. There are also bit mask flags for each wavelength bin in the spectrum files, which along with the inverse variance estimates tell you if and sometimes why specific wavelengths should be treated as missing.

Posted March 31, 2014 3:14 PM
by JeanTate in response to mlpeck's comment.

Thanks! 😃

What they're assuming is that the level of contamination doesn't prevent them from meeting their science goals.

That's very likely true.

However, for a complete outsider (me) it's frustrating to not find the topic discussed at all in what are obviously key papers (as measured by the widespread use of data, and number of citations).

A possible corollary: if you wish to make use of widely-cited work like this, in your own scientific research, you have no choice but to do your own data checking ... if only because you can never be sure that your science goals align perfectly (and the real world of science being what it is, the extent to which scientists - in general - are thorough in such checking varies widely).

Posted April 8, 2014 8:32 PM
by mlpeck in response to JeanTate's comment.

The procedure mlpeck presented on page 2 of
http://quenchtalk.galaxyzoo.org/#/boards/BGS0000008/discussions/DGS000022q
is an excellent idea and I am pretty sure not done in the MPA-JHU
pipeline (i.e., estimating the the fractional contribution from nearby
sources to SDSS's spectroscopic fiber on our source of interest).

It's certainly very interesting!

However, I think it's - as yet - very far from being robust. But let's
keep the discussion going!

Out of mild curiosity, and just to keep the discussion going, why do you think the procedure I described is "very far from being robust"? I have a suspicion that statement is based on a misconception, but since you offered no reason I didn't have anything to respond to.

Posted April 9, 2014 3:33 PM
by JeanTate in response to mlpeck's comment.

why do you think the procedure I described is "very far from being robust"? I have a suspicion that statement is based on a misconception, [...]

Yes, very likely.

If, in this case, our science goal is to estimate "the fractional contribution from nearby sources to SDSS's spectroscopic fiber on our source of interest", then I think the approach could, fairly quickly and easily, be made at least fairly robust. How? By quantifying what our goal is (integrated flux? flux in the four 'BPT diagram' emission lines? etc). By developing 'pure' stellar templates. By investigating diffspikes. Etc.

However, if our science goal is to produce robust estimates of the uncertainties in the MPA-JHU parameter values quoted in the QS and QC catalogs, for all classes of 'contamination' ('star in the fiber', scattered light from a diffspike, light from a physically unrelated background/foreground galaxy, masked/missing regions of the spectra, misaligned red/blue arm continua, uncertain redshift, ...), then I think we'd have a lot more work to do.

Posted April 10, 2014 1:43 AM
by mlpeck in response to JeanTate's comment.

I was trying to do one thing only, which is to quantify the contribution of putative foreground stars to spectra. I did it because the subject is peripherally interesting to me and I already had almost all the software infrastructure and data I needed to address the problem.

I could comment knowledgeably on some other sources of potential problems, but I'd prefer to leave that to someone who has actually published papers based on SDSS data in the peer reviewed literature.

Posted April 11, 2014 2:06 PM
by JeanTate in response to mlpeck's comment.

Thanks for clearing that up.

At the moment I can think of only one potentially difficult aspect that we'd have to investigate, to get a long way towards realizing that science goal: diffspikes. While a fully robust treatment would consider aspects such as the Airy disks of really bright nearby stars (their diameters are wavelength dependent) and differential refraction (very few SDSS objects were observed close to the zenith, but then very few were observed through at low altitudes either), if a galaxy's spectrum is heavily contaminated by these, the galaxy would likely not be a photometric object anyway.

Diffspikes are different: not only can they contribute to the light captured by the spectroscopic fiber much further from the bright star source, but the light scattered into the fiber will not itself have a spectrum that is simply a fainter version of that of the bright star itself ... a diffspike is, after all, a low-grade spectrum. Further, whether a fiber was, during the time the spectrum of the object was obtained, in fact diffspiked would likely take some digging into things like the relative position of the bright star and galaxy (i.e. RA/Dec diff), and the altitude and azimuth of the telescope boresight at the time the observation was made. On top that there's the 'seeing' at the time.

The other things, 'loose ends', should be fairly simple and easy to address; things like 'what quantitative measure of contamination?' and using a library of 'pure' stellar spectra.

Posted April 11, 2014 9:43 PM
by JeanTate in response to JeanTate's comment.

Diffspikes are different: not only can they contribute to the light captured by the spectroscopic fiber much further from the bright star source, but the light scattered into the fiber will not itself have a spectrum that is simply a fainter version of that of the bright star itself ... a diffspike is, after all, a low-grade spectrum.

I've been looking for a site which explains/shows this at a level which ordinary zooites (who are also keen amateur astronomers?) can readily follow. I found this one: Diffraction Pattern of Obstructed Optical Systems. Of course, the optics of the SDSS telescope (not to mention all the extra artifacts introduced by the camera and spectrograph) do not correspond exactly to any of those modeled, but qualitatively ... 😃

Posted April 15, 2014 1:59 AM
by JeanTate

Not sure I mentioned this, but I posted the content of the first two posts in this thread into separate threads in the GZ forum (How to decide which objects' MPA-JHU derived parameters are unreliable because..), and in the CosmoQuest forum (How to decide which objects' MPA-JHU derived parameters are unreliable because...).

The latter has some interesting comments (though none are direct answers to my questions).

Part of mlpeck's post from earlier today, here, is also indirectly relevant:

I do have one recommendation, perhaps for JeanTate but certainly for any working scientists who wander by: get the recently published book Statistics, Data Mining, and Machine Learning in Astronomy by Ivezic, Connolly, VanderPlas and Gray (2014, Princeton Univ. Press, ISBN 978-0-691-15168-7).

Calculus and some basic matrix algebra are prerequisites to understand the text, so it's probably not for the average zoo-ite. I would guess that most working scientists would find something to learn from the book even if they are experts in some aspect of data analysis just because the authors cover a huge amount of ground (mostly superficially to be sure).

Another thing that's useful about the book for astronomers is they make use of non-toy astronomical datasets from SDSS and other large surveys. Every chapter has some discussion of robust methods and methods for outlier detection in large datasets, and some of their ideas are certainly applicable to this project.

Right now I am skipping my way through the text. When I have access to more computing resources than the laptop and tablet I have with me at the moment I plan to dig into some of their data and algorithms. Techniques for cross-validation are pretty new to me, and I have some data I want to try out some ideas on.

The book uses Python code throughout, but its introduction to the language is too brief to be really useful. So, for the non Python programmer another resource is needed to learn Python. Something else fun to do!

Part of my response:

Cool! 😃

Over the last few weeks I've been looking for books of this kind, and had found Modern Statistical Methods for Astronomy: With R Applications by Feigelson and Babu (2014, Cambridge University Press, ISBN 978-0-521-76727-9), and Statistical Data Analysis, by Cowan (1998, Oxford University Press, ISBN 978-0-198-50156-5). Are you - or any other reader - familiar with either? Would you recommend either? If your budget can stretch to just one, which would you recommend?

Posted April 21, 2014 9:58 PM