Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update models to use GGI continuous-time approximators #12

Open
ihh opened this issue May 10, 2023 · 0 comments
Open

Update models to use GGI continuous-time approximators #12

ihh opened this issue May 10, 2023 · 0 comments

Comments

@ihh
Copy link
Member

ihh commented May 10, 2023

The goal is to update the HMMs of Historian (ML and MCMC components) to use the systematic approximation to the indel model described here: https://academic.oup.com/genetics/article/216/4/1187/6065876

The primary difficulties in doing this are as follows:

  1. The Pair HMM of Holmes 2020 (representing evolution along a single branch) is fully connected, whereas Historian assumes that the D->I transition probability is zero. This has knock-on ramifications for the composite HMMs used for parent-sibling triads (two branches), and grandparent-parent-sibling tetrads (three branches) (although the tetrads are never used directly, instead we use Redelings-Suchard kernels to propose moves)
    • If(?) there are any parts of the code that group insertions before deletions (implicitly assuming that the I->D transition is allowed but the D->I transition isn't), these may need to be updated
    • Alternatively (and probably better), when calculating likelihoods of gaps that include both deletions and insertions, use combinatoric formulae to calculate likelihood of gap summed over I/D ordering
  2. Historian currently assumes that the transition probabilities of the Pair HMM can be computed in closed form; instead a numerical approximation (eg Runge-Kutta RK4) is needed
  3. The current parameter-fitting method of Historian uses a (fudged) EM method to compute the indel rates; instead we will need a gradient ascent approach based on autodiff of RK4, and methods that accumulate indel counts will need to be adapted to tally transition counts sorted by branch length

(A limited strategy may be to leave the current inference code in place for the ML alignment/reconstruction phase (and associated HMMs), and update only the MCMC and parameter-fitting code.
Viability of this strategy is unclear though, it seems a little risky from a correctness standpoint.)

For a full update, at a bare minimum, the following code will need to be updated:

One question is... would it be better to build a generic, parallelizable, MCMC-only system using Machine Boss?

Pros of doing in in Machine Boss:

  • it would be cleaner
  • it would be generic (multiple models), thus a better fit with the research direction of investigating richer models by transducer coarse-graining
  • it could be designed to be parallelizable
  • addressing a narrower algorithm (MCMC) avoids getting bogged down in fixing historian bugs like Large data set crashing #4
  • historian does not have a big user base and the code is a little messy, so maintaining it doesn't make a lot of sense

Cons of doing it in Machine Boss:

  • complete rewrites take time
  • MCMC without summing over substitutions mixes more slowly
  • there may be lots of annoying edge cases for generic version (eg null cycles) that can be heuristically avoided in specific version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant