Thursday, January 15, 2015

Worried about tight search tolerances and FDR?


Sequest and Mascot popped up when I was in high school.  They've been improved and tweaked and they are still the industry standard -- and for good reason.  However, the core algorithms were designed for a completely different generation of instruments.  I don't mean to disparage these algorithms but we may need to exercise caution when processing data and interpreting results obtained on today's instruments.

There has been some argument on how best to set up the processing of data when using target-decoy based approaches.  I'm going to simplify them to a ridiculous level into two camps.

1) The pre-filterers:  In this assumption, we say that "my instrument is extremely accurate, so I should only allow Sequest to look at data that tightly matches my MS1 and MS2 tolerances." For example, you would set your MS1 tolerance to 5ppm and your MS/MS tolerance to 0.02 Da.  This way you only know that good data is going into Sequest.
 The detractors from this method might say that if you are doing FDR after Sequest and all of your data is good, the FDR will be too strict and will throw out good data.  Some people would argue that FDR works better the more bad data that it gets.

2) The post-filterers:  In this method you ignore the high resolution accurate mass capabilities of your instrument until the end.  You allow your data to be searched with a high threshold.  You get lots of bad data out of your run but then your target decoy search has tons of bad data to throw out and train itself on, making your final results better. This allows the algorithms to work on the statistical magic on which they were originally developed.

Well, Ben, what's the best way to do it, then?

In this paper from Elena Bonzon-Kulichenko et. al., (which, btw, was the paper I found and lost and was raving about a few days ago) they take this issue apart systematically using both Sequest and Mascot.  After being ruthlessly systematic, they find that number 2 is the way to go, with this important caveat:  after running the data this way you should still use your mass accuracy as a filter.  In other words, line up your peptides so that you see the MS1 (and or MS/MS, thought that is much harder) experimental and theoretical masses and throw out the data that doesn't match closely (or use it as a metric to understand how well your search engine-target decoy pipeline worked).  The downside is that its a whole lot more data intensive to search a wider mass range.

It is an interesting topic that has been visited a lot (the title of this paper starts with "Revisiting"), but this analysis is pretty convincing.  I'm definitely interested to hear what y'all think!

In the end, I'm on the old school side of things. FDR, of any kind, is just a shortcut so that you don't have to manually verify so many MS/MS peptide spectra.  I assess a pipeline with an FDR calculator in this way:  the better the pipeline, the fewer MS/MS spectra I need to look at.  I still haven't seen a single one that I would trust enough to send a dataset to a journal without looking through a few hundred MS/MS spectra first, but that day is coming!

No comments:

Post a Comment