Galaxy Zoo Starburst Talk

Characterising classification biases

  • lpspieler by lpspieler moderator

    Hello all,

    when going from classification to data analysis we will have to deal with how our classifications are influenced

    • by the way our brain and our perception system works
    • by obervational limits

    This has been the subject of very active research here on the Zooniverse.
    Well-known results are

    • the slight tendency to give a biased judgement of the handedness of spiral galaxies.
    • the probability of galaxies being classified as "smooth" which is correlated with magnitude.

    I would like to use this thread for discussing what we should/could do to characterize these influences.
    Possible examples are

    • using the results that have already been found (collect and then use GZ papers evaluating observational biases)
    • insert control images (like the mirrored galaxies in the handedness investigation).

    Any ideas?


  • trouille by trouille scientist, moderator, admin

    Great to see this thread started.

    Within our sample of 6004 galaxies, half are post-quenched galaxies and half are control galaxies. For each post-quenched galaxy, we identified a mass and redshift-matched galaxy. By mass-matched, I mean a galaxy with a total stellar mass within a factor of a few of the post-quenched galaxy. And my redshift-matched, I mean a galaxy within a redshift of 0.02.

    Question for the group: How might we use these control galaxies to address the impact of observational limits on our ability to classify?


  • JeanTate by JeanTate in response to trouille's comment.

    In addition to the obvious - basically repeats of the studies lpspieler refers to - using (original GZ and/or GZ2) objects matched to the controls (by apparent magnitude, colors (g-r, r-i), and size):

    1. Get an objective measure of the seeing/image quality (something from the PhotoObj Table?), sort all objects into distinct bins (multiple of 6, to allow easy concatenation/re-binning), analyze for systematic trends.

    2. Ditto, by extinction by (gri) band (also a set of parameters with quantitative values, in PhotoObj).

    As far as I know, no one has done any work on how differential redening (reddening?) affects visual morphological classifications. And while there's plenty on the effects of seeing/image quality, it has not been studied by any of the GZ Science Team (as far as I know), and is not even mentioned in either Bamford+ 2009 or Land+ 2008.


  • JeanTate by JeanTate

    I've just discovered that at least one of our Q candidates is in Stripe 82, AGS000006g. This gives us a marvelous opportunity to dig deeply into classification biases!

    For example, we could persuade whichever other members of the GZ Science Team who created the SDSS GZ4 images from scratch (from the FITS) to do their best with all the objects among the GZQ's 6k, starting with a simple co-add. With an additional, modest, investment of time, two subsets of co-adds of each could be created: the ~15% with the best image quality, and the same number with the worst. Also, upping the number of independent classifications per object to ~40 (say) - of all the Stripe 82 galaxies, both ordinary and co-adds - we could get a handle on how stable the classifications are (as N increases).

    On top of that, many - perhaps most - of the Stripe 82 objects will have been classified in one earlier GZ project or another, perhaps more than one.


  • lpspieler by lpspieler moderator in response to trouille's comment.

    Well the most obvious two ways of using control data are certainly

    • discard any "special characteristics" of post-quench galaxies wich are found to the same amount in the control sample
    • of there is a trend where the post-quench galaxies and the control sample become more similar w.r.t some characteristic (say morphology) with diminishing brightness (or other features) then this could be a hint that the difference between the two samples is underestimated.


  • JeanTate by JeanTate

    Identifying the outliers in both the post-Quenched and control samples is something we'll have to do, and it may be an on-going (iterative) processes. For example, AGS00001dn has a foreground star overlapping the nucleus, and its spectrum is heavily contaminated by the star's.

    With a clean control sample, we can compare the distributions of classification features - such as 'smooth' (proxy for 'elliptical') vs 'features' ('spirals') - with those in Lintott+ 2011 (the 'DR1 paper', based on the method described in Bamford+ 2009), to test for consistency of bias (with that identified in the DR1 paper). If the biases are consistent (and they may not be), we can use differences in the distributions of classification features (post-Q vs control) to attempt a finer-grained analysis of bias.

    This para, from Lintott+ 2011, describes the method they used:

    As described in appendix A of Bamford et al. 2009, we first
    divide the sample into bins of similar luminosity, physical size and
    redshift. We then find for each point in luminosity–size space the
    lowest redshift bin containing at least 30 galaxies, and assume that
    this bin represents the ‘true’ early-type to spiral ratio. In an attempt
    to keep this baseline estimate unbiased, we only consider bins well
    away from the magnitude, size and surface brightness limits of the

    It may turn out that the post-Q sample (and perhaps control too) has, from a morphology perspective, a lot more AGN-like nuclei (even though they may not actually be AGNs) than the galaxies in Table 2 of DR1. If so, then the results of the research Brooke did - in an earlier iteration of GZ - on the influence of AGNs on morphological classifications may be valuable. Even without that, we may be able to explore classification biases due to bright, point-like nuclei, by investigating the control sample alone.


  • wassock by wassock moderator

    If you are worried about trying to see how different people treat images in different ways it may be wise to first consider the question of if they are all looking at the same thing, particularly if you are worried about colour perception. Everyone in the study is using different kit with differing resolutions, graphic drivers, screen brightness, contrast, etc. Pop down to your local TV store to see the range of subtle colour variations between different screens. Does an i-pad display the same image as a low spec PC?


  • JeanTate by JeanTate in response to wassock's comment.

    One particular classification bias was studied in considerable detail, that of whether zooites favored 'clockwise' over 'anti-clockwise' winding directions for spirals (or vice versa). The results are in "Galaxy Zoo: the large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey" (Land et al. 2008).

    In this project, we are not considering 'spin' (as Land et al. described it), but the 'smooth vs features' classification (a top-level choice in our decision tree), which is known to be biased. In the earlier GZ papers, that bias was characterized in terms of luminosity, redshift, and size.

    As a first pass, we could probably use the same de-biasing method; however, the images used in Stage 1 come from DR8, which is quite different from those from DR6 (used in the original GZ), even though the underlying data (in FITS files) is the same. The 'experimental design' (i.e. the questions asked, and the layout) is different too, something which Land et al. report as having a clear effect (the 'spin bias' is different in the 'bias study' from that in the original GZ).

    In any case, it's something I think we will have to do, in Stage 2 of this project.


  • lpspieler by lpspieler moderator

    For Zooites to play around I've created a dashboard with tables for quench sample and control sample that contain all 10 possible colors (differences between channels), pre-filtered Smooth vs. Features:

    As an example of how to play around I added scatter plots for all 4 combinations sample/control and smooth/features of u-r vs. r-z. The difference between quench sample and control sample is remarkable. Even the two "smooth" scatter plots look quite different.


  • lpspieler by lpspieler moderator

    Ok, some first results. Given that it is not yet possible to use our classification results as axes in diagrams I divided the data in four segments: QC/QS vs "smooth"/"features" and plotted various values that were available as axes against estimated redshift.

    Nothing spectacular but some interesting bits:

    No surprising results concerning PetroRad50:
    In both QS and QC the objects classified as smooth
    are more sharply concentrated below 6, whereas the values for
    "Features" objects often rise up to 10, in few cases even above.
    In all four segments (QS/QC, smooth/features) higher values
    are found at lower redshifts.

    LogMass sharply concentrated between 10 and 11 for most redshift
    ranges, again with greater span at lower redshift. This is (I'm assuming)
    due to easier estimation for closer galaxies?

    D4000 exists only for the sample. No markedly differing distribution
    wrt redshift between "smooth" and "features".

    Interesting differences for color u-r:

    Both "smooth" segments are concentrated below 3.
    However, the sample becomes more sharply concentrated around 2.5
    with rising redshift whereas the control becomes less sharply
    concentrated at higher redshifts.

    Both "features segments are more sharply concentrated between 1 and 3.
    Also here the QS becomes more sharply concentrated around 2.5 wheres QC

    Properties of r-z:

    Most data in all four segments between 0 and one (some striking outliers
    between +/- 2.5)

    In all four segments data become more sharply concentrated around 0.7
    with higher redshift. That trend is stronger in both "features" segments.


  • JeanTate by JeanTate

    LogMass sharply concentrated between 10 and 11 for most redshift ranges, again with greater span at lower redshift. This is (I'm assuming) due to easier estimation for closer galaxies?

    The lack of any significant number of objects with log(mass) > ~11.x is very interesting; it means that no massive galaxy is a post-quenched one (at least in SDSS; caveats apply)! It's not terribly surprising, but - assuming further analysis checks out - a nice confirmation of some well-established (?) hypotheses concerning galaxy formation.

    The 'greater span at lower reshifts' is - largely and almost certainly - a manifestation of Malquist biasWP; low(er) mass galaxies can only be 'seen' in SDSS out to a redshift of 0.02 (say; turning this into a quantitative statement is fairly straight-forward).


  • lpspieler by lpspieler moderator

    Should have figured out the visibility connection myself.

    What's more:
    Interesting relationship between PetroRad50 and colors:

    Generally in the two "features segments" data points stretch out more commonly
    towards PetroRad values 8 or even 10 whereas the smooth segments are
    more strongly concentrated below 6.

    The u-r value of the "smooth" segments span 0.5-4 (narrower tail around 2
    with higher PetroRad). The "features" segments are more strongly concentrated
    between 1 and 3.5 (QS concentration is sharper than QC)

    Differences in the r-z vs. PetroRad50 distribution of segments are
    less marked.
    All 4 segments roughly concentrated in r-z 0.0 - 1.2
    However the two "features" segments become less sharply concentrated with higher
    PetroRad50 whereas the concentration of the "smooth" segments becomes


  • JeanTate by JeanTate in response to lpspieler's comment.

    Cool! 😃

    Can you share any Dashboards you've created which show these trends?


  • lpspieler by lpspieler moderator in response to JeanTate's comment.

    The scatterplots are PetroRad50 vs r-z. There is also a column u-r which you can assign to the Y axis.

    Don't let yourself fool by those nasty outliers with negative values which I can't filter out (as reported in the tools issues thread). The positive outliers, though, can be filtered out.

    Of course, it is possible to filter out data points with unwanted values already in the custom-made tables instead of the scatterplots.
    This might be one thing I might do next: I already created another dashboard with all colors (channel magnitude differences).
    (make sure to minimize everything at first so you get a better overview of what's already there)

    I'll plot all conceivable diagrams to derive "constrained" value ranges and then I'll apply these as filter in the tables.

    Arrgh, if only the data were downloadable already. A tool like R would be able to create scatterplots for all pairs of columns at once!


  • JeanTate by JeanTate

    enter image description here

    If dividing into four (equal) groups gives you quartiles, then 24 gives you ... vicesimoquartiles?

    Anyway, I removed the duplicates (and two redshift outliers) from both the QS and QC catalogs, ranked them by redshift, and divided them into 24 equal bins. The mean redshift ranges from 0.019 (bin #1) to 0.255 (bin #24).

    Within each bin, I calculated the 'spiral fraction': the number of objects classified as 'Features or disk' divided by the total number in the bin1. And then plotted them. The black line is the combined fraction.

    There does not seem to be any difference - either in any individual bin, or as an overall trend - between QS and QC. Which suggests that 'spirals' are just as common among the QS objects as among their controls. Possible exception: from bin #17 on, there may be fewer QS objects with 'spiral' morphology.

    Of course, this plot, by itself, cannot be used to estimate classification bias, because the galaxies have different average stellar masses (to take just one example; this plot is from the Quench: Sample vs Control, what's the same, what's different thread):

    enter image description here

    1 I also ignored the 15 classified as 'Star or artifact'. I'll later edit I have now edited the plot to include these (no surprise that 8 of these are in the five highest redshift bins)


  • JeanTate by JeanTate

    enter image description here

    This may take a bit more explaining. What it seeks to display is how the 'average roundness' of the objects classified as 'smooth' varies with redshift. Thus the degree of tilt ('inclination' in astronomer-speak) of spiral galaxies - how close to being face-on they appear - is not in this plot, because 'Features or disk' galaxies do not have a 'Roundness' (in the classification tree).

    As in my last post, I divided the catalogs into vicesimoquartiles, by redshift (24 equal-sized bins)1. I gave "Completely round" a score of 0, "In between" 1, and "Cigar shaped" 2. And I multiplied the means by 3.


    Because a popular classification scheme for elliptical galaxies is to give them an "E value", with completely round ones being E0, and the most extreme (cigar-shaped) E72. So, as a rough approximation, I assumed "Cigar shaped" is, on average, E6; I also assumed that zooites' estimates could scale linearly. Fortunately, the estimated axis ratios of all the galaxies are available from DR7; later I will run a CasJobs query to download them, calibrate the zooite classifications, and revise the transformation.

    What does the plot show? For starters, if you assume that 'roundness' is just as easy to decide for nearby (and big) ellipticals (or, more likely, ETGs, early-type galaxies, which is ellipticals plus lenticulars) as it is for distant (and small) ones, the plot shows that the galaxies in each bin are different mixtures of intrinsic roundness ... those at higher redshift are rounder than those at lower redshift. And this is consistent with what is - I think! - already known; namely, that the more massive an ETG is, the rounder it is (remember that stellar mass increases - smoothly, monotonically - with redshift).

    So far, still nothing we can use - directly - to estimate classification bias.

    However, the control galaxies are, in all but three (maybe four) vicesimoquartiles, rounder than the QS ones (though beyond bin #17 this difference may be too small to be of any significance).

    1 The mean redshift ranges from 0.019 (bin #1) to 0.255 (bin #24)

    2 There's a quantitative basis for this; the number is derived from the ratio of the length of the minor axis to the length of the major axis (formula later) as follows: 10*(1-b/a), where a is the length of the major axis, and b that of the minor axis. There's more on this in the GZ forum thread What is the relationship between 'ellipticity' and 'axis ratio'?


  • zutopian by zutopian

    I would like to mention following remark by Kevin Schawinski in an old GZ blog post.:

    "The conclusion – post starburst galaxies are dominated by objects who have intermediate morphology (often half of you thought they were disks and half thought they were ellipticals – telling us that they are just hard to classify!)."

    PS: I cited that remark also in following discussion.:

    EDIT: PPS: I found some GZQ images, which are actually easy to classify, but however the classification results are wrong!


  • zutopian by zutopian

    Below there is a statement/new paper by non-GZ astronomers, who checked some GZ classifications by using an automated galaxy morphology analysis method. I am not sure, if it might be a useful method to check the QS and QC samples, because as far as I understood, it can't be used to distinguish between S0 and Ellipticals.:

    Quantitative analysis of spirality in elliptical galaxies

    "The results suggest that more than a third of the galaxies that were classified manually by Galaxy Zoo participants as elliptical actually have a certain spirality. Although in most cases the spirality was low, 10% of the galaxies classified as elliptical had a slope greater than 0.5, ..."

    Levente Dojcsak, Lior Shamir

    (Submitted on 1 Oct 2013)


  • JeanTate by JeanTate

    As I continue my analyses on Eos (see the What bias does the varying fraction of Eos - in the QS catalog - introduce? thread), I find there is a real need to get a good, consistent handle on classification biases.

    At one level, this should be fairly straight-forward: simply repeat the analyses Willett et al. (2013) used to de-bias the standard (i.e. not Stripe 82) GZ2 galaxy classifications. Unfortunately, only those with access to the actual Quench project zooite votes could do that, and we ordinary zooites do not have such access. But even if we did, I'm not sure it would help much ... our universe is only ~2% as big as GZ2's (~6k objects vs ~300k).

    An indirect method might be as follows:

    1. extract the full data on GZ2 classifications for both the QS and QC objects
    2. 'reverse engineer' the QC data - GZ2 votes/classifications and corresponding QC ones - to produce estimates of the Ci,j and as many of the {s1, s2, s3, s4, s5, s6, s7, s8, s9} and {t1, t2, t3, t4, t5} as we need (see Section 3.3 of Willett+ 2013)
    3. using these parameter values, de-bias both QC and QS.

    Such a method would go some way to addressing the well-known 'experimental design' aspect of galaxy morphology classification studies (see, for example, Land et al. 2008).

    What do you think?