GFS Development and Management:2006-January

From Glabwiki

Contents

2006-01-09 (Mon)

  • Modify HMM code to train the new model
  • Finish modifying the training code, but the results gave about 50% state are high intensity state. This should be reexamined. - the problem was caused by using the same name for array and integer, I've thought PERL could tell them by the type, but nope.
    • I'm glad you found the problem! Morgan

2006-01-17 (Tue)

  • The training results still gave twice more of state 3(0.32667) than state 2 (0.16565), state 1 has 0.21359, state 0 has 0.29410. The reason for that is because we always choose the highest intensity peptide whenever there are sequence overlapped peptides. To solve this, suggestions are:
    • Define states based on the hit regions. But there are several issues about this design: 1.How to decide the highest intensity. One solution is to use the original cutoff, then get the new cutoff. 2. Lose the whole peptide information
    • Shrink the range of high intensity state, instead of using 1/3 of the overall intensity, choose top 15% or 20%, so each state distribution will be evenly distributed. This solution makes sense.

2006-01-18( Wed)

  • Remove duplicates (same genes) from Random hit regions
  • Training Statistics ON RANDOM GENE WINDOWS:
 C terminal                  R             K          Ratio
 High Intensity             663           447         1.4832
 Mid  Intensity             647           512         1.2636
 Low Intensity              1014           773         1.3118

The ratio is NOT always increasing when the intensity increases, which is different as real gene window. But the cutoff used here is from the real gene windows. We need to get the different intensity cutoff for the random hits. But if we use different cutoff for the real data and random data, will it overtrain the model?

2006-01-20( Fri)

  • By trying the different cutoff in order to get three intensity states evenly distributed, finally the top 20% of total intensity from each mass range goes to High Intensity state, the middle 40% and the rest 40% go to Mid Intensity state and Low intensity state.
  • The statistic results on random genes are:
                                        Rand genes                                Real 
  C terminal                  R             K          Ratio         R         K        Ratio
 High Intensity             379           256         1.4805        519        314      1.6529
 Mid  Intensity             843           631         1.336         832       737       1.1289
 Low Intensity              1102         845         1.304          916       772       1.1865
  • States Distribution is:
 Intensity State             0(none)         1(low)       2 (mid)      3(high)
 Real Data Set               0.29410       0.23265        0.23691      0.23635
 Random Genes               0.29425       0.23469        0.24195      0.22911

2006-01-24( Tue)

  • Finish training on both real genes and random genes, organize the statistical results.

2006-01-31( Tue)

  • Finish testing on the genes, tried about 20 different combinations of all the factors and the best result still come from composition model. The results do not show the improvement due to the new model design. The C terminus does not help the model, neither does N terminus. This might be caused by considering each factor with equal weight.