GFS Development and Management:2006-January
From Glabwiki
Contents |
[edit]
2006-01-09 (Mon)
- Modify HMM code to train the new model
- Finish modifying the training code, but the results gave about 50% state are high intensity state. This should be reexamined. - the problem was caused by using the same name for array and integer, I've thought PERL could tell them by the type, but nope.
- I'm glad you found the problem! Morgan
[edit]
2006-01-17 (Tue)
- The training results still gave twice more of state 3(0.32667) than state 2 (0.16565), state 1 has 0.21359, state 0 has 0.29410. The reason for that is because we always choose the highest intensity peptide whenever there are sequence overlapped peptides. To solve this, suggestions are:
- Define states based on the hit regions. But there are several issues about this design: 1.How to decide the highest intensity. One solution is to use the original cutoff, then get the new cutoff. 2. Lose the whole peptide information
- Shrink the range of high intensity state, instead of using 1/3 of the overall intensity, choose top 15% or 20%, so each state distribution will be evenly distributed. This solution makes sense.
[edit]
2006-01-18( Wed)
- Remove duplicates (same genes) from Random hit regions
- Training Statistics ON RANDOM GENE WINDOWS:
C terminal R K Ratio High Intensity 663 447 1.4832 Mid Intensity 647 512 1.2636 Low Intensity 1014 773 1.3118
The ratio is NOT always increasing when the intensity increases, which is different as real gene window. But the cutoff used here is from the real gene windows. We need to get the different intensity cutoff for the random hits. But if we use different cutoff for the real data and random data, will it overtrain the model?
[edit]
2006-01-20( Fri)
- By trying the different cutoff in order to get three intensity states evenly distributed, finally the top 20% of total intensity from each mass range goes to High Intensity state, the middle 40% and the rest 40% go to Mid Intensity state and Low intensity state.
- The statistic results on random genes are:
Rand genes Real C terminal R K Ratio R K Ratio High Intensity 379 256 1.4805 519 314 1.6529 Mid Intensity 843 631 1.336 832 737 1.1289 Low Intensity 1102 845 1.304 916 772 1.1865
- States Distribution is:
Intensity State 0(none) 1(low) 2 (mid) 3(high) Real Data Set 0.29410 0.23265 0.23691 0.23635 Random Genes 0.29425 0.23469 0.24195 0.22911
[edit]
2006-01-24( Tue)
- Finish training on both real genes and random genes, organize the statistical results.
[edit]
2006-01-31( Tue)
- Finish testing on the genes, tried about 20 different combinations of all the factors and the best result still come from composition model. The results do not show the improvement due to the new model design. The C terminus does not help the model, neither does N terminus. This might be caused by considering each factor with equal weight.
