Loading Isozyme Data into ICIS
From ICISWiki
Contents |
Introduction
The example dataset is a large set of isozyme data. There are 30967 samples covering 27653 distinct germplasm entities (GIDs) which were tested at 22 isozyme loci represented by a total of 85 alleles. The data is a compilation of data from a number of experiments (suppliers). Raw results are given in table ISO. Field ISO.order-merge contains the sample number. Many columns of passport data are hidden, but two important fields are the ORIGIN field which contains location IDs for the origin of the germplasm and the SUPPLIER field which indicates which experiment produced the data. For each locus there is a column named with the locus name containing a two-digit genotype 'nm' or text values ND, NV, MX, HT. 'nm' indicates that the genotype has allele n and allele m at the locus, ND means that the locus was not tested in that sample, NV means that the results could not be read, MX indicates a mixed response (possibly heterozygote), HT indicates two alleles (probably heterozygote). Then, for each locus, there is one column for each allele named with the locus name followed by '_n' where n = 0,1,2, ... These columns contain 0 or 1 indicating presence or absence of the allele, -1 for NV loci, -99 for ND loci. Essentially there are two effects (data sets indexed by different factors) in this table, the sample by allele effect and the sample by locus effect.
Since this data can be stored in many ways, there are some technical decisions to be made based on biology and use cases. The two effects in ISO contain equivalent information. Which way should we load it? Some information such as the genotype information 'nm' in the loci fields is probably better managed as allele frequencies (0,0.5,1) in the allele fields. Other information, such as ND or NV is better managed in the locus fields. Then there are the passport data fields, some of these, such as germplasm name, are better managed in the GMS. Even the GID could be managed in a germplasm list. Alternatively these can be labels of the primary factor sample number.
For this exercise we will load positive allele frequencies into a sample no by allele effect and all the genotypes and codes into the sample no by locus effect (even though the genotypes will be redundant). We will load GID and supplier (SOURCE) and ORIGIN as labels of sample no.
Sample ISO table format
| sampleno | supplier | origin | GID | varname | adh1_0 | adh1_1 | adh1_2 | adh1_3 | ... | sdh1_1 | ... | sdh1_5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | DSB/PSV | 235 | 324993 | MO NHAC | 0 | 1 | 0 | 0 | ... | 0 | ... | 0 |
| 2 | DSB/PSV | 235 | 96271 | DOC PHUNG | 0 | 1 | 0 | 0 | ... | 1 | ... | 0 |
| 3 | DSB/PSV | 235 | 168528 | VE VANG | 0 | 1 | 0 | 0 | ... | 1 | ... | 0 |
| : | : | : | : | : | : | : | : | : | : | : | : | : |
| 30967 | BC | 171 | 353242 | ZIONGZO | 0 | 1 | 0 | 0 | ... | 1 | ... | 0 |
In the sample ISO table above, there are 30967 samples. It has 86 alleles(adh1_0,adh1_1,adh1_2,...,sdh1_5).

