Update models to use GGI continuous-time approximators #12

ihh · 2023-05-10T21:48:41Z

The goal is to update the HMMs of Historian (ML and MCMC components) to use the systematic approximation to the indel model described here: https://academic.oup.com/genetics/article/216/4/1187/6065876

The primary difficulties in doing this are as follows:

The Pair HMM of Holmes 2020 (representing evolution along a single branch) is fully connected, whereas Historian assumes that the D->I transition probability is zero. This has knock-on ramifications for the composite HMMs used for parent-sibling triads (two branches), and grandparent-parent-sibling tetrads (three branches) (although the tetrads are never used directly, instead we use Redelings-Suchard kernels to propose moves)
- If(?) there are any parts of the code that group insertions before deletions (implicitly assuming that the I->D transition is allowed but the D->I transition isn't), these may need to be updated
- Alternatively (and probably better), when calculating likelihoods of gaps that include both deletions and insertions, use combinatoric formulae to calculate likelihood of gap summed over I/D ordering
Historian currently assumes that the transition probabilities of the Pair HMM can be computed in closed form; instead a numerical approximation (eg Runge-Kutta RK4) is needed
The current parameter-fitting method of Historian uses a (fudged) EM method to compute the indel rates; instead we will need a gradient ascent approach based on autodiff of RK4, and methods that accumulate indel counts will need to be adapted to tally transition counts sorted by branch length

(A limited strategy may be to leave the current inference code in place for the ML alignment/reconstruction phase (and associated HMMs), and update only the MCMC and parameter-fitting code.
Viability of this strategy is unclear though, it seems a little risky from a correctness standpoint.)

For a full update, at a bare minimum, the following code will need to be updated:

model.h, model.cpp
- ProbModel::transProb
- IndelCounts, EventCounts
pairhmm.h, pairhmm.cpp
- This is the triad HMM (parent + siblings)
- Needs a careful analysis of the states and transition probabilities
- There is already some approximation here, it may be acceptable to ignore some transition paths
forward.h, forward.cpp
- Dynamic programming using the triad HMM
sampler.h, sampler.cpp
- Dynamic programming using the branch & triad HMMs
refiner.h, refiner.cpp

One question is... would it be better to build a generic, parallelizable, MCMC-only system using Machine Boss?

Pros of doing in in Machine Boss:

it would be cleaner
it would be generic (multiple models), thus a better fit with the research direction of investigating richer models by transducer coarse-graining
it could be designed to be parallelizable
addressing a narrower algorithm (MCMC) avoids getting bogged down in fixing historian bugs like Large data set crashing #4
historian does not have a big user base and the code is a little messy, so maintaining it doesn't make a lot of sense

Cons of doing it in Machine Boss:

complete rewrites take time
MCMC without summing over substitutions mixes more slowly
there may be lots of annoying edge cases for generic version (eg null cycles) that can be heuristically avoided in specific version

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update models to use GGI continuous-time approximators #12

Update models to use GGI continuous-time approximators #12

ihh commented May 10, 2023 •

edited

Loading

Update models to use GGI continuous-time approximators #12

Update models to use GGI continuous-time approximators #12

Comments

ihh commented May 10, 2023 • edited Loading

ihh commented May 10, 2023 •

edited

Loading