Galaxy Zoo Starburst Talk

Differences between v5 QS and QC catalogs and their v4 counterparts

  • JeanTate by JeanTate

    This thread is triggered by a post by edpaget, on page 7 of the Mass Dependent Merger Fraction (Control vs Post-quenched Sample) thread, timestamped December 17 2013 4:41 PM; here it is in full:

    Hi all. the absmags, values should now be back in the dataset. You may need to clear your browser's cache. I have no idea how they were accessible in the first place, since they were not in the dataset available previously. I'm going to guess weird cache magic.

    I just responded:

    Really?!?!? 😮

    I have no idea how they [the absmags values] were accessible in the first place, since they were not in the dataset available previously.

    I downloaded both the QS and QC catalogs on 26 September, 2013, from Tools. As .csv files. In QS, between fields called "sfr" and "hdelta_flux", and in QC between fields called "abs_r" and "sfr", there are five fields, named:

    • u_absmag
    • g_absmag
    • r_absmag
    • i_absmag
    • z_absmag

    Later - today, with luck - I'll download the two catalogs again, from Tools. And compare them with my v4s. I'll write up what differences I find.

    Do you mind if I ask, edpaget: What quality/version control do you (and the Zooniverse Development Team) use?

    Well, I've downloaded the two v5 catalogs, and must say that I'm quite, um, gobsmacked.

    Why?

    Because there are thousands - yes, you read that correctly - thousands of differences between the v4 and v5 files!!! 😮 😮

    I'm sure many - perhaps most, perhaps almost all - these differences are trivial, and will not affect any of our analyses. For example, in the QS catalogs, there are fields in both v4 and v5 named "log_mass"; in the v5 one there's an extra field, named "logmass" 😮 No biggie; the values for each of the 3002 objects are the same, so it's simply a duplicate field, with a different name.

    In this thread I'll post all the differences I find, between the v4 and v5 catalogs, in both QS and QC.

    What really knocked me for six, temporarily, is the discovery that the apparent quality control - over the data - seems appallingly bad. IRL I've worked in commercial environments in which this kind of, um, sloppiness would have resulted in dismissals. Much, much worse is the suspicion that this apparent sloppiness can be found in quality control over data in other Zooniverse projects ... how corrupt are the 'click databases'?

    Posted

  • JeanTate by JeanTate

    2984 of the changes, in the QS catalogs, are in the field "sdss_id". This is good news: the v5 catalog seems to have restored the SDSS DR7 ObjIds to their original, 18-character-long glory, reversing the frustrating error made in the v3 and v4 catalogs (i.e. truncating them, by treating them as integers not text strings perhaps).

    And the 18 objects for which there is no change? Why the 'real/true' ObjId ends in '00', so truncation made no difference! 😄

    At least, so it seems. Given what I have called the apparent sloppiness and poor quality/version control, I'm going to check all the uid-sdss_id matches against those in the v2 catalog. Stay tuned.

    Posted

  • JeanTate by JeanTate

    I really, really hope the following change is not due to corruption of the Quench 'clicks database', that it is limited to just an error/flaw/etc in extracting and processing that data, to input it to the database used by Tools.

    The following five QS objects have had their zooite classifications changed, from the first question in the decision tree ("Is the galaxy simply smooth and rounded, with no sign of a disk?"):

    • AGS00000pd : Features or disk -> Smooth/In between
    • AGS000018s : Features or disk -> Smooth/In between
    • AGS00001dp : Features or disk -> Smooth/Cigar shaped
    • AGS00001ig : Features or disk -> Smooth/Cigar shaped
    • AGS000020j : Features or disk -> Smooth/In between

    These changed classifications are the cause of all the changes in the following fields (number of such changes in brackets):

    • smooth (5)
    • disk_edge (5)
    • center_bar (3)
    • central_bulge_prominence (3)

    In addition, they are the cause of some - but not all! - the changes in classifications of the following fields (number of such changes/total changes):

    • symmetrical (2/4)
    • spiral_arms (3/5)
    • merging (1/104)
    • how_round (5/227)
    • clumps (5/N1)

    1 I'm saving this for later; it's a real doozy 😦

    Posted

  • JeanTate by JeanTate

    More classification changes, this time to arm_tightness ("How tightly wound do the spiral arms appear?"):

    That's all the changes in the arm_tightness, and together with those in my previous post all the changes in the field spiral_arms. In addition, one more of the "N" changes in clumps is accounted for here.

    While it's not going to make much - if any - difference to the results, ChrisMolloy will, no doubt, be at least somewhat perturbed to learn that there are four QS objects with changes in the "symmetrical" ("Does the galaxy appear symmetrical?") field:

    The last two are also affected by a change from Fod to Smooth (see above). In addition, both the first two are among the "N" changes in the clumps field, and one - AGS00000kg - is one of the 104 changes in the merging field.

    Posted

  • trouille by trouille scientist, moderator, admin

    Hi Jean,

    Before you get too far into this, Ed is updating the Quench tables as we speak. I would hold off until 2 hours from now. I'll post here when I know he's done. Thanks!

    Posted

  • trouille by trouille scientist, moderator, admin

    That was fast. Ed just let me know that the Tools Quench tables are back to 3002 sources each. OK, take it away Jean!

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    Thanks.

    Ed just let me know that the Tools Quench tables are back to 3002 sources each.

    That's good to know.

    If I may ask you, Laura, are you concerned over the (apparent, to me anyway) less-than-ideal quality/versioning control?

    OK, take it away Jean!

    Will do. However, I must say it's getting me a bit down.

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    I've just finished checking: the v5 and v6 QS catalogs are identical.

    So all my previous posts - which are about the QS catalog, and differences with the v4 one - remain valid.

    Posted

  • JeanTate by JeanTate

    This post concerns the disk_edge and central_bulge fields, in the QS catalog.

    In v4, there are 759 objects with non-null values in the disk_edge field; five more than in v5. Good (there are five Fod objects 'reclassified' as Smooth in v5).

    In v4, 526 of these 759 are "No" for central_bulge ... but only 523 in v5. That too is good (the 'discrepant three' are all Fod -> Smooth; all "No" in the central_bulge department).

    In v4, 233 (=759-526) are "Yes" for central_bulge ... but only 231 in v5. Likewise, that too is good (the 'discrepant two' are both Fod -> Smooth; all "Yes" in the central_bulge department).

    So how come there are 213 mismatches - between v4 and v5 - in the central_bulge field?!?

    Well, among the "No" for disk_edge, there are zero central_bulge values. In both the v4 and v5 catalogs. Good.

    However, among the 233/231 "Yes" for disk_edge how many are "Yes" for central_bulge are there? How many "No"? And how many {null/blank}?

    19/189 , 10/43, and 204/0, respectively! 😮

    In one sense that's good news: 204 QS objects had valid zooite classifications for central_bulge which the v4 QS catalog failed to provide. In another sense it's deeply disturbing: why did it take until December 2013 for this data to appear in the catalog?!?

    Now 204 != 213±2, so there must be some changes in the central_bulge classifications, beyond merely eliminating the {null/blank} values. And indeed there are:

    Some simple arithmetic: 213 = 204+11-2 ... the numbers match. Good.

    Does this matter? Is there any ordinary zooite so keen/foolish as to want to analyze the central_bulge classifications, of the Eos (by definition, Eos = "Yes" disk_edge, at least at the highest level of analysis)? Who could possibly be so ... pedantic ... as to want to do that?!? (there are no prizes for being able to - correctly - answer this question).

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    I almost forgot: among the 213:

    • every one is among the "N" changes in clumps
    • six of the 104 changes in merging are in these 213

    Otherwise, these are completely independent of all the other v4 - > v5 changes.

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    This next change in zooite classifications is almost comical.

    277 QS objects have changes in their how_round classification ... all but five are Cigar sahped -> Cigar shaped.

    The five? They're all Fod -> Smooth.

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    I'm saving this for later; it's a real doozy 😦

    Time to talk about the clumps ("Are there any off-center bright clumps embedded within the galaxy?") field.

    There are 3002 objects in the QS catalog. Care to guess how many differences there are, in the clumps field, comparing v4 with v5? 100? 1000? 2000?!?!

    Nah, you're not bold enough; the number is 2472, fully 82% of the objects.

    If that doesn't shake your faith in the trustworthiness of the Science/Development Team's work, I can't imagine what would.

    OK, a breakdown of this - totally gobsmacking - difference:

    • it's an 'equal opportunity' difference: objects classified as both Fod and Smooth are affected (Star or artifact classifications are exempt: all clumps values are {null/blank}, in both v4 and v5)
    • the objects with no clumps classification change are either Star or artifact OR Fod; putting this another way, EVERY object classified as Smooth has the value in the clumps field changed, between v4 and v5

    In fact, expanding on that last bullet, in v5 it's as if not a single zooite who said an object is smooth supplied an answer to the clumps question ("Are there any off-center bright clumps embedded within the galaxy?"). Or: all zooite classifications of this kind were dumped into the trash bin.

    For those zooites who voted Features or disk (in the main1), the clumps answers/classifications are a bit more nuanced:

    • 522 Fod objects have the same clumps classification in v4 and v5
    • five Fod objects with v4 classifications that changed to Smooth in v5 lost their clumps classifications (obviously)
    • 231 of the remaining v5 Fod objects 'lost' their v4 clumps classifications
    • one retained its Fod classification, but changed its clumps classification: AGS00001k4 : No -> 1

    Still to go: merging. This is - obviously - vital to the work jules has been doing. It's also a field that has rather more than a trivial number of changes, from v4 to v5, in QS (104). As she'll be offline until well after Xmas, I won't post what I found (let her have a peaceful break). Instead I'll next look at the differences between the v4 and v6 QC catalogs.

    1 we ordinary zooites have no choice but to trust the Science/Development Team(s) on this; as we do not have access to the classification vote distributions, we have no way to independently check

    Posted

  • jules by jules moderator in response to JeanTate's comment.

    Fire away Jean. I did spot the difference in V5 sample numbers and I should have mentioned it. However, my philosophy throughout has been to do the donkey work (set up the filtered sub-tables etc) on the basis that it's easy to change the underlying "master table" once we have the final version. Not having a huge amount of time to devote to Quench it seemed a reasonable plan. So go on - what horrors have you found amongst the mergers?

    What you have unearthed so far is just staggering. I can confirm that both samples currently total 3002 items each and the duplicate log mass columns are still there but clearly we can't take this project any further until we have datasets we can be confident in using. On a more positive note these findings should help the science team enormously. I really hope there are no wider repercussions other than the Quench project which as a pilot has certainly thrown up some wild cards of late.

    Sterling work as ever Jean - please tell me you will have some time off next week. 😉

    (I only dropped by to make sure everything was ticking along nicely...)

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    And the 18 objects for which there is no change? Why the 'real/true' ObjId ends in '00', so truncation made no difference! 😄

    At least, so it seems. Given what I have called the apparent sloppiness and poor quality/version control, I'm going to check all the uid-sdss_id matches against those in the v2 catalog. Stay tuned.

    And it is indeed the case that all 18 QS objects whose sdss_id match in the v2 and v4 catalogs have DR7 ObjIds which end in '00'.

    That's the good news.

    The somewhat less good news is that there are three others which end in '00', but whose v2 and v4 catalog sdss_ids do NOT match! 😮 Here they are, the v2 sdss_id, then the v4 one, then the uid:

    • 587727944570568900, 587727944570569000, AGS00000bm
    • 587735347488751800, 587735347488751700, AGS00001gu
    • 587742013297983800, 587742013297983700, AGS000029o

    It would seem that whatever truncation was done on the DR7 ObjIds - resulting in character strings ending in '00' - it was not the obvious one.

    Posted

  • JeanTate by JeanTate in response to trouille's comment.

    In an earlier post in this thread I noted that - as far as I could tell - the v5 and v6 QS catalogs are identical.

    We already know that the v5 QC one is 'short' 56 QC objects, and that the v6 QC catalog contains 3002 objects; we also know what those 56 objects are.

    How, then, do the v4 and v6 QC catalogs compare? For example, does the v6 one include the full, un-truncated SDSS DR7 ObjDs (as the v6 QS one does)? And what about AGS00004n1, the 'double entry' object which was not entirely removed from the v4 QC catalog?

    The good news: AGS00004n1 is indeed not in the v6 catalog.

    The almost good news: otherwise the v4 and v6 QC catalogs are identical, with one exception (see below). Why is this 'almost' good news? Because the truncated DR7 ObjIds are still in v6!

    The exception? The v6 catalog contains an extra field, called "logmass". Each of the 3002 values in this field are the same as the corresponding (per uid) values in the "log_mass" field. In other words, it's a redundant field.

    Posted

  • mlpeck by mlpeck

    The somewhat less good news is that there are three others which end
    in '00', but whose v2 and v4 catalog sdss_ids do NOT match! 😮 Here
    they are, the v2 sdss_id, then the v4 one, then the uid:

    I think this probably isn't too mysterious. Whatever program the data were imported into tried to convert the ID strings to double precision floating point numbers -- not huge integers -- the ANSI C type long long would have enough digits to represent SDSS id's exactly, but it's not too likely a typical spreadsheet would use them.

    A double precision number has about 16 digits of numerical precision on Intel architecture machines, but remember the representation is binary so it's more like 16-ish digits in practice.

    Posted

  • JeanTate by JeanTate in response to mlpeck's comment.

    Agreed, likely not too mysterious.

    Its relevance is - for now - moot ... both the QS and QC objects have robust mappings, both of the uids with DR7 ObjIds and QS-QC object pairings (redshift and log_mass), and the cross-ID file I produced (here) seems to accurately record the mappings/pairing/etc.

    Posted

  • JeanTate by JeanTate in response to jules's comment.

    Thanks jules.

    So go on - what horrors have you found amongst the mergers?

    I'll post my analysis, but likely not until tomorrow (too much it's-just-two-days-to-Xmas right now).

    What you have unearthed so far is just staggering. [...] On a more positive note these findings should help the science team enormously.

    I hope so too. What worries me - greatly - is the uncertainty of how widespread this is.

    I've already found - and written up, to varying degrees - quite a few examples of what looks (to me) like sloppiness in GZ-related data analysis etc, but this is the first case I've come across of what looks awfully like a serious data integrity problem. I've written a post/thread in the GZ forum on it - How is the data integrity of the Galaxy Zoo 'clicks databases' assured? (it's copied in GZ Talk too).

    At some point it might be worthwhile reviving an aspect of the various discussions preceding/at the Zooniverse Chicago Workshop; for example, I suspect that the way that Space Warps and Planet Hunters are set up, as projects, make these sorts of problems much less likely to happen. But that's for another day ...

    Posted

  • JeanTate by JeanTate in response to JeanTate's comment.

    I hope you all had a great time over the break ...

    Still to go: merging. This is - obviously - vital to the work jules has been doing. It's also a field that has rather more than a trivial number of changes, from v4 to v5, in QS (104).

    The changes are of two kinds:

    • what was "Disturbed" in v4 is "Neither" in v5 (96)
    • The "merging" classifications for "Star or artifact" QS objects have all become {blank} in v5 (8)

    In both ALL objects so classified in v4 were re-classified in v5: all v4 "Disturbed" became v5 "Neither", and no other merging classification was changed; whatever merging classification "Star or artifact" objects had in v4, they all became {blank} in v5, and those are the only QS objects for which merging became {blank}.

    Posted