What does a GID represent

From ICISWiki

Jump to: navigation, search

Contents

The problem

A GID represents a sample of germplasm – but which one? It should be obvious that this is a really fundamental issue: we absolutely must be able to know unambiguously what germplasm corresponds to what record in the database. Otherwise it’s pointless keeping the database. Yet this is often difficult or even impossible in ICIS databases.


Consider the name “IRGC 328”. It is an accession ID, assigned by the rice genebank at IRRI to identify one of its accessions. But a search in IRIS for “IRGC 328” returns many GIDs with the same name. Which one represents the IRRI accession, and which ones represent copies of that accession held by others? Even the holder of the germplasm has difficulty knowing which GIDs represent its own samples. In fact at the moment the only way a user can unambiguously identify which GIDs represent the his/her own germplasm is to keep a look-up table in an independent database – a list of the user’s germplasm samples in one column with the corresponding GID in a separate column. This constitutes a serious problem with ICIS.

The difficulty is even greater when a user needs to know about GIDs representing samples that are not under that user’s management – an independent LUT managed by the user is clearly impossible. This is one of the major reasons for the presence of so many errors - users choose the wrong GID (e.g. assigning incorrect values to GPID1 and GPID2 to their own GIDs) because they can do little more than guess at what those GIDs represent.

The semantics of determining what a GID represents

“IRGC 328” would be a unique ID in a database containing only data on IRRI’s genebank accessions, but not in a database containing data on other germplasm as well. It needs something else to identify it as an accession from IRRI’s genebank. Since ICIS databases also contain data on many other types of germplasm, not only genebank accessions, it also needs something else to identify it as an accession (not, e.g. a sample collected from a farmer’s field, an entry in an international nursery, etc).


This leads to a three-level identification:

Data Notes
GIDRepresents = accession -> This is an accession Currently there is no way to record this information in ICIS, no way we can know which GIDs represent accessions and which represent other types of germplasm
Add GLOCN=9016=GRC -> This is an accession from IRRI’s genebank

Progress was made by tightening the definition of GLOCN, to be the location of the germplasm sample at the moment of its conceptual birth as a new unit of management requiring a new GID. For a genebank accession is the location of the genebank that manages the accession.
(However, it remains problematic for historical data, because of the previous ambiguous definition of GLOCN, which resulted in its inconsistent usage – sometimes the location of the genebank, sometimes the location of the donor of the accession, sometimes the location at which it was originally collected. Historical data in IRIS therefore contain many errors in GLOCN, so we cannot use it reliably to determine the holder/owner of a germplasm sample)

Add a name with NVAL=IRGC 328 -> This is the accession identified by the IRRI genebank as IRGC 328

Progress was made by establishing NSTAT=8=preferred ID as a flag indicating that the name was assigned by the germplasm holder to identify the sample uniquely within the holder’s collection, distinguishing it from all other samples held by the holder.

Note that NSTAT=8 doesn’t work for breeders’ lines which are named only with a single unique ID assigned by the breeder – in these cases, to follow ICIS rules we have to assign NSTAT=1=preferred name, even though the name also serves as a unique ID.

Note also that not all collection holders assign their own unique sample IDs to their own germplasm.

The current situation


The major extant difficulty is clearly in categorizing the type of germplasm – the top level in the 3-level identification above. To get round the problem, we may use a range of clues in the data. Here are some examples:

GID represents Criteria to recognize
Collected sample Criterion: (METHN=69) and (GLOCN=LOCID in LOCATION where LTYPE ≤ 409) (currently 0 GIDs in IRIS!!!)
Inconsistent: (METHN≠69) and (LTYPE=409) (=Germplasm collection site) (currently 29479 GIDs in IRIS)
Inconsistent: (METHN=69) and (LTYPE>409) (currently 13741 GIDs in IRIS)
Accession in genebank Criterion: (METHN=62) and (GLOCN=LOCID in INSTITUT) and (GID=GID in NAMES where NTYPE=1 and NSTAT=8 and NLOCN=GLOCN)
Inconsistent: (GID=GID in NAMES where NTYPE=1 and (NSTAT≠8 or NLOCN≠GLOCN))
Inconsistent: (METHN≠62) and (GID=GID in NAMES where NTYPE=1 and NSTAT=8 and NLOCN=GLOCN)
Inconsistent: (GLOCN≠LOCID in INSTITUT (i.e. GLOCN points to a LOCID that does not have a record in INSTITUT) and (GID=GID in NAMES where NTYPE=1)
Generation of accession Criterion: (GID=GID in NAMES where NTYPE=21) and (MGID=GID in GIDREP where REPRESENTS=”ACC”) and (METHN is MID in METHODS where MNAME like “*increase*”)
Inconsistent: (GID=GID in NAMES where NTYPE=21) and (MGID=0)
Inconsistent: (GID=GID in NAMES where NTYPE=21) and (MGID==GID in GIDREP where REPRESENTS≠”ACC”)
Inconsistent: (GID=GID in NAMES where NTYPE=21) and (METHN is MID in METHODS where MNAME not like “*increase*”)
Cross Criterion: (METHN=MID in METHODS where (MTYPE=”GEN” and (NGPGS>1 or (MNAME like “*cross*” or MDESC like “*cross*”)))) and (GLOCN=LOCID in LOCATION where LTYPE in [405..408, 410..412]) and
Inconsistent: (METHN=MID in METHODS where (MTYPE=”GEN” and (NGPGS>1 or (MNAME like “*cross*” or MDESC like “*cross*”)))) and (GLOCN=LOCID in LOCATION where LTYPE not in [405..408, 410..412])
Breeder's own sample Criterion: (METHN=MID in METHODS where MTYPE≠”GEN”) and (GLOCN=LOCID in INSTITUT) and (GID=GID in NAMES where NTYPE=5 and NSTAT=1 and NLOCN=GLOCN) (Currently necessarily 0 GIDs, because INSTITUT does not have LOCID field)
Test sample Test sample GID represents a sample maintained at GLOCN for testing in nurseries.
Official release of a cultivar Official release of a cultivar GID=GID in NAMES where NTYPE=4 and NSTAT=1
Copy of a sample Copy of a sample Copy of an accession, breeder's material, variety, test line etc being managed informally outside of a genebank or test system away from the original developer
Inconsistent data Inconsistent data (METHN=MID in METHODS where MTYPE≠”GEN”) and (GNPGS>0)
Other cases above
Sample of uncertain status The status of the sample is uncertain: GLOCN does not point to a collecting location or to an organization, so is none of the other categories Sample of uncertain status The status of the sample is uncertain: GLOCN does not point to a collecting location or to an organization, so is none of the other categories

 

The process of guessing what a GID represents is further formalised in, SetInc.doc a proposal written after the 2010 ICIS developers’ workshop to help users identify GIDs correctly in their search for candidate source GIDs for new germplasm.
Note that many of the clues suggesting what a GID represents come from NAMES data – and yet what a GID represents is a property of the GID, not of its names. The direction of relationship should be the other way round – knowing what a GID represents, we can define naming rules. But because we don’t record what a GID represents we have to work backwards, using NAMES data to try to work out what a GID represents.


A further complication arises because of the large number of errors in the NAMES data that need to be used as clues. These errors occur because of the absence of adequate data quality controls in ICIS, and because of inadequate definitions of associated NAMES data. Recent improvements in the definitions of NLOCN and NDATE have helped, but the definitions of NTYPE and NSTAT remain problematic.

The way forward – refining methods?

As noted by Shawn Yates, the “GIDRepresents” concept above is more or less embedded in methods. Perhaps if we refine methods appropriately we can eliminate the need for an independent typology of what the GID represents. This pre-supposes that describing how a sample was created adequately describes what type of germplasm it is. For example:


 Method of creating a germplasm sample What type of germplasm the sample represents Unique ID
Collect from field or market place A sample collected from field or market place Should have a collector’s sample ID
Access into a genebank An accession in a genebank Should have an accession ID e.g. IRGC 328
Entry in international nursery An entry in an international nursery Should have an entry ID e.g. IRTP 5555
Copy in working collection A sample in a working collection If in an informal working collection, typically the holder does not assign a unique ID


So then we very easily allow a quick search of only genebank accessions or only international nursery entries.


Then also there are other problems with the methods that need cleaning up.

  • Methods mixes up method with breeding system; while it is true that breeding and propagation methods do depend on the breeding system, where the same method is used for both it shouldn’t be artificially divided into two methods – for example there shouldn’t be three methods for “unknown generative method” (MID 1, 2, 3), which differ only by the nature of the germplasm samples, not the method.
  • Methods mixes up method with species. There shouldn’t be a method for interspecific cross (MID 109). The fact that it is interspecific should come from the species attributes of the male and female parents
  • Methods mixes up method with intended future methods. “Recessive backcross” differs from “backcross” only in the intent to follow up the backcross with selfing – that’s not a different method.
  • Methods mixes up method with project. IRIS has a set of methods for the Upland Perennial Rice Project. This association with project should be through GLOCN of the GIDs involved, not the method.<span id="fck_dom_range_temp_1273727371805_846" />

The way forward – refining NTYPE and NSTAT definitions?

During a teleconference call, Shawn Yates observed that we don’t have NTYPE and NSTAT well defined/documented. True and a problem. A large proportion of the errors we encounter arise because of inadequate definitions, so that different users of the same database have different perceptions of what is right or wrong and we end up with conflicting data entry from different people.

NTYPE ontology is defined in UDFLDS. Based on definitions in IRIS, Ruaraidh Sackville Hamilton proposed refinements in NTYPE-IRIS.xls. Note that the table includes a column on correct usage. This is not available in UDFLDS but it should form part of the definition, clarifying to users how to use the NTYPE. It should at least be documented in the TDM.

Shawn Yates provided the corresponding UDFLDS fields from IWIS3 in NTYPE-IWIS3.xls


NSTAT ontology is briefly specified in the ICIS TDM. Additional proposed definitions are in NSTAT-IRIS.xls.

Why is NSTAT not in documented in UDFLDS?

NSTAT appears not well conceptualised. It encompasses an unsatisfactory mix of concepts such as language and usage of the name. Hopefully ICIS 6 will handle it better!

The way forward – data validation rules?


Method-specific rules are needed for names / NTYPE / NSTAT / NLOCN / NDATE / GLOCN / GDATE.

The scope of a GID

From one perspective, ideally a GID represents a packet of seeds (in the generic sense as it may represent a clone) that exists now or did exist sometime in the past. If you wouldn’t mix the seeds from different packets then each should get a new GID. If this is the case, then each of your accessions will, over time, acquire many (in the end 100s) GIDs. This is the base concept in ICIS and means absolutely that Accession and GID cannot have a 1:1 relationship. What is a must is that GIDs may (must) have a many (GIDs) to one (Accession ID) implying that a GID cannot refer back to more than one Accession ID. This is what Graham refers to as a management neighbourhood. This is the same concept that each (derivative) GID can only refer back to one generative GID – a generative neighbourhood.

However, often this may be impossible. Information available on accessions at other genebanks is almost invariably restricted to accession-level information, with no data on how the accession had been managed at the other genebank: in these cases inevitably Accession and GID have to have a 1:1 relationship.

Even if possible, from other perspectives it may be considered undesirable. Breeders and users interested in pedigrees typically don't want to see information on the management of breeding lines after a few generations of selection, and may want to manage their breeding lines by some other system than ICIS.

Worse, often a single GID may be used to refer at least to multiple management neighbourhoods if not even more broadly: 

  • Very often breeders and other researchers refer to their samples by variety name, which is far more problematic than referencing the accession ID, especially for landraces. At least when referring to an accession ID you know that you are referring to one management neighbourhood. With a variety name you’ve no idea what has happened genetically since the variety was released or how many different management neighbourhoods it’s been through – so the same GID represents a family of samples that has at the least been through multiple management neighbourhoods. They could even have come from different maintenance or derivative neighbourhoods (like the Californian rice variety Calrose – two very different genetic entities released under the same name in the same state)
  • Often they don’t change the name or GID even if they’ve knowingly put it through cycles of selection to change it or to create a pure line – so the same GID may refer to a family of samples from different maintenance neighbourhoods.
  • Often a single GID is used to refer to an entire landrace, even though may have been put through many centuries or millennia of various unknown generative and derivative and maintenance methods – so here 1 GID refers to multiple derivative neighbourhoods and even to multiple generative neighbourhoods.
  • When seed stocks run out, some breeders even replenish stocks from an independent source without changing GID, even though that independent source is from a different management neighbourhood and may be from a different maintenance neighbourhood or even derivative neighbourhood.


Thus inevitably a GID may represent anything from a single packet of seed to an entire family of genotypes that are unrelated except by virtue of having the same landrace name. ICIS currently has no way to record this information.

A solution for ICIS 5.5

Identifying what a GID represents

A three-level identification will be adopted to identify what each GID represents, i.e.

  • Level 1: the method of germplasm creation (METHN of the GID) defines what type of germplasm sample is represented by the GID. This means methods must be defined so precisely that defining how a germplasm sample was created is sufficient to identify what type of germplasm it is.
  • Level 2: the location where the germplasm was created as a (GLOCN of the GID) defines who has/had responsibility for the sample. As previously established, this means we must not confuse the location of a GID with the location of its source or origin. Where it was created must be the same as where it was/is managed. It also means we have to define locations in as much detail as we need to distinguish between different responsible organizations. For example to distinguish between samples under the management of different teams within IRRI, we must have a different location defined for each team.
  • Level 3: an ID that identifies the germplasm uniquely at least within the scope of the given method and location. This may be specified either (in the case of teams that assign their own IDs) through a name with NSTAT=8 and NVAL assigned by the holder or (in the case of teams that don’t do that) by the unique ICIS identifier Database-user-lgid. The implication of this is that a GID must have no more than 1 name with NSTAT=8, and the combination METHN-GLOCN-NVAL must be unique for a name with NSTAT=8.

Refining methods

The following management methods are edited or introduced


MID MNAME MDESC
69 Collected sample GID represents a sample collected from in situ conditions (field, market).
70 Accession into genebank GID represents a sample accessed into a formal genebank collection
71 Accession into historical genebank GID represents a sample that was accessed into a genebank that has since been closed
72 Entry in international nursery GID represents a sample that was acquired to be systematically maintained for distribution and evaluation in international nurseries
73 Entry in national nursery GID represents a sample that was acquired to be systematically maintained for distribution and evaluation in international nurseries
74 Copy in working collection GID represents a sample that was included in a working collection – it may be a formal or informal collection, maintained or not maintained, available or not available for distribution, with or without a name assigned by the collection manager to serve as a unique ID for the sample
80 Unknown Nothing is known about the ancestry of this GID. It represents the earliest documented ancestor of other GIDs, an historical sample used only to document ancestry of other GIDs
81 Indirect The link from this sample to its parent may be indirect: there may be a number of intermediate samples between them, and the true link may be through other GIDs in the neighbourhood but there is no specific evidence to link them
260 Component of mixture The sample is one component selected from a landrace or other sample that is a mixture of components



Methods 70-74 represent refinements of the uninformative method 62 “IMPORT”

Method 80 is to replace 31 for GIDs with GPID1=GPID2=0, distinguishing a sample whose prior history is completely, from 31 for a sample that is known to have been selected by some unknown method from a known or unknown GPID1.

Method 81 is typically used to connect a GID into a maintenance neighbourhood when its precise source is unknown but the data indicate it is part of the neighbourhood. (If GPID1>0, GPID2=0, the GID will be treated as the root of a separate maintenance neighbourhood within the derivative neighbourhood of GPID1)

Method 260 is needed for landraces, which are commonly variable, sometimes a mixture of distinct components. One common method of managing them in genebanks is to divide the original landrace into its components and maintain them as separate accessions. Splitting into components may be done at any time, on the original collected sample or subsequently during management of an accession

Data curation issues

There are many GIDs in IRIS that don’t meet these conditions – there is a need to correct their data now that we have better agreement on what is right and what is wrong.

Never add a name to a GID with name type expecting it to provide the level 1 identification of the type of germplasm. For example,

  • Don’t add an IRTP number (NTYPE=11) to represent an INGER entry unless the GID has METHN=72
  • Don’t add a variety release name (NTYPE=4) unless the GID has METHN=326 and GLOCN=the country of release.

This approach mixes up the biology and management of germplasm. This means additional GIDs may need to be created to record both the biology and the management. For example:

  • IRTP 22270, an entry in the INGER international rice nurseries, is a selection IR 31375-3-2-2-3 from IR 31375-3-2-2. Currently its source is shown as IR 31375-3-2-2. This is wrong.
    • IRTP 22270 should have METHN=72 and GLOCN=INGER to show that it represents an entry added to INGER. Its source should be a new GID, IR 31375-3-2-2-3
    • The new GID for IR 31375-3-2-2-3 should have a derivative method to show it is a selection from IR 31375-3-2-2, with GLOCN=the location of the breeder.
  • PSB RC 54, a variety released in the Philippines, is a selection IR 60819-34-2-1 from IR 60819-34-2, which is currently shown as its source. This is wrong.
    • PSB RC 54 should have METHN=326 and GLOCN=Philippines to show it represents a sample officially released as a variety in the Philippines. Its source should be a new GID IR 60819-34-2-1.
    • The new GID should have a derivative method to show it is a selection from IR 31375-3-2-2, with GLOCN=the location of the breeder.
  • CIOR 12042, an accession in the US National Small Grains Collection NSGC, is a selection CI 1962-2 from CI 1962, also an accession in the NSGC, which is currently shown as its source. This is wrong
    • CIOR 12042 should have METHN=70 and GLOCN=NSGC to show it represents an accession at NSGC. Its source should be a new GID for CI 1962 2
    • The new GID should have a derivative method to show it is a selection from CI 1962, with GLOCN=the location of the breeder (NSGC).



Conclusion

If we clean up methods so that it documents what each GID represents based on how/why the GID was created, and clean up definitions of NTYPE and NSTAT, perhaps we don’t need any restructuring of ICIS data design.

Personal tools