Talk:GFSVersion2Spec

From Glabwiki

(Difference between revisions)

Revision as of 02:40, 2 March 2007

Simpler parallelization

So I got thinking about a simpler approach to parallelizing GFS, for use with Xgrid, or SGE, or whatever. I think the deficiencies of the current approach are: 1) it requires extensive setup/configuration - its not automatic 2) complicated communication scheme that is presently not robust (I can write these things saying bad things about it because I wrote the code! Morgan)

So, I think the "digest_and_match" mode has opened up a new possibility here of a lightweight client that can be (more) easily distributed to a very large number of nodes. In this mode, a master process, let's still call it the client, is started up by the user, which specifies some genome, say HG, and mass list.

The client then divides up the sequence into a series of "sectors", i.e. short regions defined by a start and stop position. It doesn't actually divide them - it just decides where they will be divided (say, every 1 MB). Sequence is stored in a common location, accessible by all compute nodes. So is mass list. Then, this "client" process, uses the grid interface to launch a whole set of client processes, each which uses digest and match just to do one of these pieces. It would have input parameters that specify the sequence coordinates to start and stop at. And the mass list. So each of these would go and do their little piece, then dump a small xml output file (the equivalent of the internal plist). After each one is done, the master just picks up the xml output and gathers it together into the same master plist we've always used for output. It then waits until all the mini processes have completed, and assembles it into a single output, as usual.

The benefits: - Automatic distribution, regardless of sequence size or genome. Just take a list of sequence files and divvy it up into 1MB chunks - This approach would be amenable to very large spectrum files, because each job would only work on a small piece of genome - This approach would work for performing much greater distribution of the jobs to all available nodes, especially if we're running Xgrid. Then we can take advantage of other idle lab machines, on top of the cluster. - This might make it feasible to use Jainab's HMM on the whole human genome, by harnessing so much compute power. - If one node/job fails, only a small piece is missing, and the rest of the results should not be affected. (In the present scheme, if say one chromosome of the HG fails, it can't be ignored). - No distributed objects needed. Complete output of all nodes preserved, if something fails. Simpler.

I don't know why I didn't think of this. Anyone else have thoughts? Morgan