Talk:What does a GID represent
From ICISWiki
Ruaraidh Sackville Hamilton:
Summary of issues, 28 April 2010
- Establishing the correspondence between a germplasm record (identified by GID) and the germplasm sample(s) it represents – can we resolve the difficulty of doing this reliably just by adding some more carefully defined methods of creating germplasm?
- Ontology of methods – we need a review and better ontology of methods; can this be fed into the GCP ontology process? How and what info should we feed into it?
- Ditto for NTYPE and NSTAT: we need a review and better ontology of the type and status of names; can this be fed into the GCP ontology process? How and what info should we feed into it?
- Scope of the germplasm represented by a GID – does it represent a single packet of seed, the entire harvest of a plot even though split into multiple packets, a whole management neighbourhood, multiple management neighbourhoods, multiple maintenance neighbourhoods, multiple derivative neighbourhoods, or multiple generative neighbourhoods? (It could represent them explicitly because it is the founding sample of such neighbourhoods so that we can trace other GIDs representing elements within the neighbourhood(s); or implicitly because it is the only GID representing the neighbourhood(s).) Do we need to do anything to document this scope? I think there is already some wiki discussion on this.
- Biology vs management. Remember we had to introduce MGID to embody the management concept, separately from GPID1. MGID points to the root of a management neighbourhood (a management concept). GPID1 points to the root of a derivative neighbourhood (a biological concept). (We don’t have any pointer to the root of a maintenance neighbourhood, except in warehouse tables). Is it semantically OK to have a single field METHN to encompass both concepts?
Ian Delacey:
The base understanding is that a GID represents a packet of seeds (in the generic sense as it may represent a clone) that exists now or did exist sometime in the past. If you wouldn’t mix the seeds from different packets then each should get a new GID. If this is the case, then each of your accessions will, over time, acquire many (in the end 100s) GIDs. This is the base concept in ICIS and means absolutely that Accession and GID cannot have a 1:1 relationship. What is a must is that GIDs may (must) have a many (GIDs) to one (Accession ID) implying that a GID cannot refer back to more than one Accession ID. This is what Graham refers to as a management neighbourhood. This is the same concept that each (derivative) GID can only refer back to one generative GID – a generative neighbourhood.
Ruaraidh:
Absolutely agreed. What you describe is the ideal we aim for, and the ideal we achieve for our genebank. I didn’t go over this issue because I think we’ve got that one under control, at least to the extent that is under our control, i.e. for our own accessions; but you are right it is part of the concept of what a GID represents – specifically how broad is the range of genetic entities that are referenced by one GID.
In our genebank, every cycle of seed increase generates a new GID, so yes an accession is indeed a family of GIDs within one management neighbourhood. However, based on discussion with Graham, when we split a single harvest into multiple seedlots e.g. for long-term vs short-term storage, we use different lotids, attached to a common GID, to distinguish the different packets used for that harvest. That means our 1:1 correspondence is GID:packet of seeds harvested, not GID:packet of seeds stored. At the root of that management neighbourhood is the GID that represents the seed packet that arrived at IRRI; that’s the GID that should have the proposed new method “accessed into genebank collection”, and that’s the GID on which we hang all the passport data, and that’s the GID for which I have difficulty identifying ownership. All the other GIDs have other methods such as seed increase, and we use the new field MGID to point back to the root GID representing the incoming sample, so having identified ownership of the root GID there is no problem identifying the other GIDs within the same management neighbourhood. By this means, we can use this root GID to refer specifically to the incoming sample, or to the entire family of GIDs that constitute the accession (two types of entity that cannot be distinguished for accessions from other genebanks represented by only one GID)
So our genebank procedure follows the ICIS ideal closely. The problem is you can’t force that on others, and as far as I can see literally no one else follows the same standards that we do, no one else does what you recommend.
• In all other cases, there is at best a 1:1 relationship GID:management neighbourhood
o The breeders choose not to track their seed increases using GIDs, so they have one GID referring to any number of cycles of seed increase (they use their own seed lot management systems outside ICIS for managing seed increases)
o For data about germplasm received from others, data are almost never available on management of the material, so we have no choice but to document 1:1 for GID:management neighbourhood. This applies equally to breeding lines, genetic stocks, released varieties, and accessions at other genebanks.
• Worse, often a single GID has been used to refer at least to multiple management neighbourhoods if not even more broadly
o Very often breeders and other researchers refer to their samples by variety name, which is far more problematic than referencing the accession ID, especially for landraces. At least when referring to an accession ID you know that you are referring to one management neighbourhood. With a variety name you’ve no idea what has happened genetically since the variety was released or how many different management neighbourhoods it’s been through – so the same GID represents a family of samples that has at the least been through multiple management neighbourhoods. They could even have come from different maintenance or derivative neighbourhoods (like the Californian rice variety Calrose – two very different genetic entities released under the same name in the same state)
o Often they don’t change the name or GID even if they’ve knowingly put it through cycles of selection to change it or to create a pure line – so the same GID may refer to a family of samples from different maintenance neighbourhoods.
o Often a single GID is used to refer to an entire landrace, even though may have been put through many centuries or millennia of various unknown generative and derivative and maintenance methods – so here 1 GID refers to multiple derivative neighbourhoods and even to multiple generative neighbourhoods.
o When seed stocks run out, some breeders even replenish stocks from an independent source without changing GID, even though that independent source is from a different management neighbourhood and may be from a different maintenance neighbourhood or even derivative neighbourhood.
So the problem that, for all genebanks other than ours, we have 1:1 for GID:management neighbourhood is actually rather trivial compared with the problems caused by the lax way breeders and many geneticists choose to manage their germplasm. The follow-up problem is that your ideal scenario is only ideal, and the practical reality is that it’s impossible to implement: inevitably a GID may represent anything from a single packet of seed to an entire family of genotypes that are unrelated except by virtue of having the same landrace name. ICIS has no way to record this information.
But that is beside the point of the current exchange. The point is that I must be able to identify which GID represents the root of my accessions, and I can’t (and, if I obtained the accession from USDA, I must also be able to identify which GID, if any, represents the USDA’s sample, so that I can set that GID to be the GPID2 of my sample, and if there isn’t one I have create a new GID to represent their sample; and I can’t)
Ian:
Because I realised the problems that would arise if users used GIDs as names I argued strongly in the early days of ICIS that the GIDs not be shown on any output from ICIS. This was not accepted, and I am glad about that, because it eventuated that the GID is very useful to all users as well as the curators of ICIS. It does however have downsides some of which are severe. Many people will try to use them for germplasm names. A GID should only be unique for a ‘packet’ of seeds. As an eg I just requested some seed of Agatha (a spring bread wheat genetic line which is the source of the leaf rust resistant gene Lr19) from the Australian Winter Cereal Collection. This line was made in Canada as a cross between the original source of Lr19 (the Greek line Agra) and the famous Canadian cultivar Thatcher. The reason for doing this is that the DNA extracted from Agatha at CIMMYT does not carry some markers that we thought marked Lr19. We want to see if the seed in the Australian collection carries these markers or not. What is clear is that all the samples of Agatha in all the germplasm banks in the world should not carry the same GID. It is also clear that they will also carry a different accession IDs both for all the different times they are accessed in the same collection (different from imported as not all imports should be accessed) and all the different banks in which they are accessed. It is also a requirement that the packet of seed (with its own GID and its own Accession ID – not the same) should be traceable through all accession IDs in all places from which the Agatha in AWCC got its accession Agatha. It would also be desirable if it could traced through all other management processes, export, import, seed increase, long term and/or short term store (short term being different from normal year to year storage) that occurred in its journey.
Ruaraidh:
Absolutely agreed. And I think on the whole we do have this in IRIS – as in the case of the Azucena spreadsheet we sent you, you can see we have many GIDs corresponding to the same variety. But this is precisely the cause of the problems we face. Once you have different GIDs to represent all the different samples of Agatha, how do you determine which one represents the sample you want to know about? You can’t.
I agree that ideally GIDs shouldn’t be used as names – it’s been pragmatically necessary only because we haven’t got a naming system that functions as a useful germplasm ID system.
Ian:
This problem is endemic to all plant breeding programs, not just germplasm banks. Clearly, germplasm banks have a legal requirement for ‘accessions’ and plant breeding programs have a legal requirement to their cultivars, the seed in the bag must possess the performance properties claimed for it when it is sold – it must contain the reproducible genetic constitution on the label. In addition the PBPs must (should) be able to trace the management history of all packets of seed in their possession. In my experience the chief breeder wouldn’t be able to find the packet of seed that is required. So I ask the breeder (organisation) if I can have it and then go to the main technician responsible who will get it. They, rightly, say what do you want if for, as they will get more ‘pure’ seed if you want it for crossing that for an experiment. In most traditional programs the records are in spredsheets devised by the technician and enormous problems occur when this technician retires. The mind boggles at what happens with Monsanto and Pioneer Hybrid which produce millions of packets of seed a year, let alone all their tonnes of bags of seed for seed increase and sale. But at least the system must know in this case. As an aside the problems created at large breeding companies when they acquire other breeding companies would be interesting to deal with. Monasanto acquired about 10 longterm companies in the 1990s.
Ruaraidh:
Yes.
Ian:
Together with the original developers of ICIS, we devised he management methods in anticipation of developing methods for dealing with these problems. (comment on import methods later).
A comment on methods.
There are three types, Generative, Derivative, Management. In a first cut of understanding them; the first two reflect genetics and reproductive biology but the last does not. In all cases, because it was always open for the methods (the ones we suggested in the first effort) to be modified by attributes and/or by adding new methods, it is totally appropriate to add new ones. My first reading of your emails leads me to think that adding your new import management methods is highly desirable and will, as you have suggested, solve many (if not all) of your problems. I will need to examine and think about all the detail of your emails to get a full understanding of whether it will solve them all.
Ruaraidh:
“Management” or “Maintenance”? I always find these terms confusing. I thought it’s Generative, Derivative and Maintenance, all three reflecting genetics, in the last case following methods that seek to maintain genetic composition unchanged. Management implies more than just the biology of seeking to maintain genetic composition unchanged – it also implies doing this under a single germplasm management regime. IRGC 328 here in IRRI and IRGC 328 in USDA are in different management neighbourhoods because they are managed by different genebanks, but the same maintenance neighbourhood because they are linked only by methods that seek to maintain genetic composition unchanged.
Ian:
I came into the methods discussion at the beginning of the change from IWIS to ICIS as the original developers as they (CIMMY and IRRI), rice and wheat are inbreeders, were having trouble talking to maize people (outbreeder) and catering for the (population) crossing methods (many parents on one or both sides of the cross). What was required was a lecturer in plant breeding with experience in on the ground breeding, seed management and experience in dealing with data, and in this case the pedigrees arising from breeding (and germplasm banks) programs. By default this turned out to be me and it took me (and the team) 6 months to come up with what I did (in which time I pestered a lot of people). In doing so I made a lot of compromises and decisions. As it turns out, one of my bad decisions was the import protocol. My first bet was to break this up into the different type of material and/or organisation from which the material was imported. You can see that we stuck to the idea of different methods for collections. If you look into those there are a number of methods for collection for the three different breeding methods, Inbreeders, outbreeders and clonally propagated. I guess that I had no knowledge of running germplasm banks, only in using them, made me recommend that we compromise with only one generic import method. Looks like a bad decision in retrospect.
Ruaraidh:
On the collecting methods: you may have noticed that these are among the ones I question. From the descriptions of these in the METHODS table, it sounds as though they are intended to be used as methods of genebank accessions that originated in samples collected from a field or market place. I think that is not compatible with the concept you outlined above of having a different GID for each sample. The sample collected from the field or market place should have its own GID, and it would have method=collected with GLOCN=collecting location, because that is how it was created. The genebank accession was not itself created by being collected – it merely originates in a collected sample. The GID representing the (root of the management neighbourhood of the) genebank accession must always have the method “accessed into genebank” and GLOCN=genebank, regardless of the method use to create the original sample from which the genebank was obtained. GPID1 is used to document the origin of a sample. So in the case of accessions originating in collected samples, the GID representing the accession has a GPID1 pointing back to the GID representing the collected sample. That’s how we are implementing it, and it seems to fit the ICIS concept perfectly.
Ian:
My original idea was to define crossing (now generative methods to account for mutations and GMOs) and selection (now derivative to accommodate haploidy and polyploidy) methods (which are both inherited from the old IWIS without reference to reproductive system. It turns out that all of the methods, (which will occur in different frequencies or some not at all in other reproductive systems) will have different genetic consequences in the different reproductive systems. In consequence it was decided to classify the methods into the three different classes. I think experience has shown this to be wise even though it makes providing an ontology a difficult business.
Ruaraidh:
I am still not sure about this. Yes the genetic consequences are different in different reproductive systems, but that difference arises because of differences in the properties of the parents, not because of differences in the method. Both data documentation and data analysis should take this into account. And then what do you do about crosses between outbreeders and inbreeders? And what about parents with intermediate outcrossing rates? And what about unknown outcrossing rates? And what about quantitative differences between parents in their outcrossing rates? And see the problems you got into with methods for interspecific crosses.... All problems arising because of attempting to incorporate properties of the parents into specification of the method.
Ian:
We devised the management methods as a device to accommodation all the problems of seed increase, naming cultivars and lines, seed transfer, seed storage and the like described above. We realised, of course, that these methods do (or could) have genetic consequences and so we decided to define these methods as ones which are designed to maintain genetic structure of the population being manipulated as compared to increasing genetic variability (generative methods) or decreasing it (derivative methods). At the time that the concepts behind ICIS were being developed Paul Fox was developing the concept of ‘data islands’. Because data was attached to different names all over the world for the ‘same germpalsm’ data could not be integrated. While the GID is the ultimate in unique identifiers, it turns out that data needs to be integrated at level of germplasm ‘name’, your accession or collection, the breeder’s cultivar or breeding line or population, or the geneticists genetic stock.
Ruaraidh:
“The GID is the ultimate in unique identifiers”? Well ...
a) It’s unique only locally within the context of the ICIS implementation, not beyond, and for what GIDs are intended to represent. That’s the same with an accession ID – it’s unique within the context of our genebank database, but not beyond; and it uniquely identifies what the accession is intended to represent, not different packets of seed of an accession. Once you go beyond the context of the ICIS implementation, the GID is itself just another name that has been added to the germplasm, just like the accession ID.
b) It is a unique ID for a record in a table. If you can’t tell which germplasm sample the record refers to, it is not a germplasm ID at all – and that is the fundamental issue we need to resolve.
Richard Bruskiewich
My short assessment is that, in principle, as was suggested, the Germplasm method controlled vocabulary, if adequately reviewed and enhanced, ought to adequately codify “GIDrepresents”, at least, from the fundamental perspective of semantic classification of germplasm.
And, as you suggest, each “type” of germplasm (as codified by an enhanced Method CV) ought to have its documentation requirements well described, to adequately guide data encoders in curating germplasm records of each specific type. I also suspect that the movement to ICIS toward a more formal ontology management subsystem, with a properly curated ontology, should also help.
I think, though, that the enormity of the data curation and data quality challenge for germplasm in ICIS is clear. How we meet this challenge, sociologically, is the key question to be answered (e.g. perhaps within GRiSP and perhaps, the GCP cross-cutting scientific resource effort).
Shawn Yates
Refining the methods over creating a new field in the GERMPLSM table is a great idea. It will hopefully save major revisions to the coding in the ICIS32.DLL. I am CCing Bibiana at CIMMYT, Casper at Nunhems, and Ian DeLacy for their input as well. My notion is that all methods and name types were added by someone for some reason, so I would like to get as much input as we can before making universal changes to those tables.

