Problems with Uploading non-sensitive data to Central Database

From ICISWiki

Jump to: navigation, search

A major obstacle in uploading pedigree information from the local database to the central database is separating the non-sensitive (public) data from the sensitive (private) data. As all user pedigree entries and changes are kept in the local database, it does not take long for a large volume of both types of data to be stored within. It is very important that some protocols or software is developed to help deal with this issue.

The following is a email thread between Shawn Yates (SPARC), Sandra Micallef (University of Queensland) and Graham McLaren (IRRI) discussing this topic.

Please feel free to contribute any experiences/thoughts below.


Hi Graham and Sandra,

Hope all is well. It sure sounds like everyone involved with ICIS (including me) is very busy these days!

Fran and I were discussing the sharing of Canadian lines with Sandra this morning and there are some obstacles I am bumping into that I would appreciate your ideas on:

1. Going through the local databases and trying to delete out information that the breeders deem sensitive has been quite a daunting and time-consuming task. It also seems to be creating more errors as I go along. Sandra have you worked with anyone else who has run into this problem?

(Graham's Response) I have not worked with this problem, but my take would be as follows: climb into the central and mark all records as private by entering ‘1’ in LGID for example. Then I would look through and identify all the terminal lines I wanted to make public and I would set LGID to ‘0’ for these lines. (You may be able to identify these by name types (eg all realeases), or by location (eq anything not from here) or in the worst case by simply going through the names table sorted in various ways.). I would then set up a pair of queries which located GERMPLASM records for any GPID1 or GPID2 of any public (LGID=0) record and made those records public (set LGID=0). I would re-run these queries until no further records were made public. I would then ‘delete’ all the remaining private records (by setting GRPLCE=GID) and upload the local.
2. Would it be faster to output the SETGEN lists to an Excel spreadsheet and then batch load them to the Central? (Graham's Response) Don’t think so. You would have similar problems.

3. Each year we receive introduction nurseries from CIMMYT, ICARDA and other countries. I enter these into the GMS as they arrive so our breeders can look up the parentage, if available. However, during the last update we received from Sandra, many of the lines previously unavailable were in the new Central, which caused duplicates in the system. Is there a way of finding which lines are duplicates in the system so they can be overwritten?
(Graham's Response) If the material comes from CIMMYT and has SID,SID attached you can link directly to the CID,SID fields in central and set up a batch file of .REP records for processing by BROWSE. Otherwise you have to find the duplicated by name matching, either with BROWSE or by direct name matching queries.

I am hoping to find a quick solution to sharing this information because I can’t spend months working on it. Our main focus right now is to get our molecular data entered into the ICIS system and re-design the Data Comparison Tool to handle it.
Best Regards,
Shawn


Hi Shawn,

Yes we definitely have considered this scenario before and it's probably something that has kept ICIS 'unpopular', so to speak, among some of the breeders/researchers here. They want to have a facility which allows them to select which germplasm is "private" and others that are public, such that the system can integrate their own data on the local with the central without difficulties and minimal effort.

I have to admit that I haven't given it much thought lately because we have been concentrating more on the contents of the Central Wheat database, but we will run into this problem very soon when we start entering the pedigrees from Leslie Research Centre.

The solution that Graham has given below seems to be the most logical one, however, it also seems to be quite lengthy and time consuming which is a problem for all of us I guess. I think that eventually ICIS will need to cater more appropriately for this scenario, for example, having a seperate field in the local which defines whether the germplasm is private or public and when uploading to the Central, it will remain in the local or uploaded to the central with GPID1 and GPID2 updated accordingly.

I am not aware of any previous discussions on this topic - has it ever come up during any ICIS workshop before?

It should make a good topic for ICIS2006.

Cheers,
Sandra


There are two pints to the solution I propose.

1. Who declares what is public and what is private? This can be facilitated in several ways by tools in SetGen or elsewhere, and should obviously be done by the ‘owner’.
2. Do we want to allow situations where progenitors of a public germplasm are private? I suppose commercially we do for hybrids, but I ignored this possibility in Shawn’s problem because I think his arises by mixing up corrections and updates to central with breeding lines. We should try to keep these activities separate (as with Ian’s update database), but that requires quick turn-around in central updates.
3. The actual processing of the update is not much more arduous than the normal update of the whole local once item 1 is taken care of.

Graham


Comments/Suggestions

Shawn Yates (SPARC):
I think Sandra and Graham's solution above would work just fine:

1. Create a new field in the GERMPLASM table, called PRV_STAT (Private Status)
2. If a cultivar's PRV_STAT = 1 then it is considered private and will be ignored during an upload. This also makes it easy for the administrator to identify private lines.
3. Create a function in SetGen where PRV_STAT can be set when the cultivar is entered.
4. Create a function in SetGen where an entire list of lines can be batch-set to private or public (PRV_STAT=1 or 0). This way entire nurseries can be set without having to go in and manually set each cultivar's status.

In the situation where progenitors of a public germplasm are private, the entry in the Central would just not have any information until such a time as the progenitors become public. I think in most cases, the opposite would be true (private lines having public progenitors). By setting the private status in the GERMPLASM table, you are only making the cultivar itself private, not its parents.

Sandra Micallef (University of Queensland):
We have run into this problem here as well.

The first level is the one you are describing – you have hundreds of crosses and lines which are not public yet, so you need to keep them ‘private’- how do you select the data which you want to upload, without spending hours doing it manually? I still think we need another field in the GERMPLSM table to define whether the GID is “hidden” or “public”. This will make it easier to identify the subset of data which needs to be uploaded. Once you have this list, it should not be too difficult to omit the ‘hidden’ records from the upload process.

Another option could be that the ‘hidden’ records are uploaded to the central anyway, but GPID1 and GPID2 are set to 0 (in the Central), with an attribute of “data not yet published” (or something similar). A problem would then arise of how to keep the ‘real’ record of GPID1 and GPID2 in the local database? You can do this in two ways:

1) either leave the record in the local as is, such that you now have two records for the particular GERMPLSM, the original one in the local and the one in the central with GPID1=GPID2=0. When the data becomes public, the record in the local can be uploaded to the central and replace the one with no parents.
2) Or define GPID1 and GPID2 in the CHANGES table in the LOCAL database, and use the CSTATUS field to define whether the changes are uploaded in the next upload or not.

Another point that I think goes well with this problem is that we NEED URGENTLY a user interface for the upload process. I know that William Eusebio was working on automating the upload process – but I think the administrator still needs to have an interface, especially if we need to manage effectively the situations mentioned above.

Personal tools