An introduction to viral phylogeography

Paul Bastide

MAP5, CNRS, Université Paris Cité

2025-10-17

Program

  • Introduction to phylogeography

    • Viral phylogeny
    • Viral spread
  • Practical

    • Simulate trajectory
    • Reconstruct the spread of the West Nile Virus

West Nile Virus

West Nile Virus (WNV)

  • First detected in New York in 1999
  • Spread across all the US
  • About 1200 death between 2000 and 2010

Questions

  • Origin of the outbreak ?
  • Routes of transmissions ?
  • Migrating birds ? Population mouvements ?

Phylo-geography of the WNV

Virus evolution

Central dogma of molecular biology

Source: National Human Genome Research Institute

Test PCR

PCR = ?

Prélèvement Covid Rhino-pharyngé ?

Polymerase Chain Reaction !

Source: Wikipedia, Enzoklop, CC BY-SA 4.0

ARN virus

Source: Wikipedia

Telephone whispers

AACUUUUGCGCGCGGGGAAAAAAAGCCCCAAAAUUUU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAAAAAAGCCCCAAAAUUUU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAUAAAAGCCCCAAAAUUUU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAUAAAAGCCCCAAAAUUGU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAUAAAGGCCCCAAAAUUGU
\(\qquad\qquad\downarrow\)
\(\qquad\qquad\cdots\)

Virus evolution

  1. Replication error.
  2. Creation of a variant.
  3. Variant can be neutral, more virulent, less virulent.
  4. Variant gets propagated.
  • ARN virus: many variants, most neutral.
  • 2 different patients ~ 2 different variant.
  • Allows for a tracking of the outbreak.

Transmission tree

Sampling

Sampling

Mutations

Mutations

Observations

Phylogenetic tree

WNV phylo-geography

Viral spacial spread

“Random” moves ?

  • Virus is at point (0,0) at time t=0.
  • It moves “randomly” during a time T=1.
  • Where is it at time T ?

Random walk on a grid

  1. Choose a direction.
  2. Move one step.

\[ \frac{1}{\sqrt{2}}\mathbf{Z}_i = \begin{cases} (\;\,0, \;\,1) & p = 1/4\\ (\;\,0, -1) & p = 1/4\\ (\;\,1, \;\,0) & p = 1/4\\ (-1, \;\,0) & p = 1/4\\ \end{cases} \]

\[ \mathbf{Z}_i \text{ i.i.d. } \quad \mathbf{E}[\mathbf{Z}] = \mathbf{0}_2 \quad \mathbf{V}[\mathbf{Z}] = \mathbf{I}_2 \]

\[ \mathbf{S}_n = \sum_{i=1}^n \mathbf{Z}_i \]

Random walk on a grid

  1. Refine the grid.
  2. Make time go faster.

\[ \mathbf{S}_n = \sum_{i=1}^n \mathbf{Z}_i \]

\[ \mathbf{W}^{(n)}_t = \frac{1}{\sqrt{n}} \mathbf{S}_{\lfloor tn \rfloor} \]

To the Brownian Motion…

  1. Refine the grid.
  2. Make time go faster.

\[ \mathbf{S}_n = \sum_{i=1}^n \mathbf{Z}_i \]

\[ \mathbf{W}^{(n)}_t = \frac{1}{\sqrt{n}} \mathbf{S}_{\lfloor tn \rfloor} \implies \mathbf{W}_t \]

The random walk converges weakly to the Brownian motion as \(n\to \infty\).

Brownian Motion

  • Robert Brown (1827): empirical observation
  • Louis Bachelier (1901): use in finance
  • Albert Einstein (1905): use in physics
  • Norbert Wiener (1923): mathematical construction
  • Paul Lévy (1933): rigorous modern definition

Univariate Brownian Motion

\[ X_0 = \mu; \qquad d X_t = \sigma d B_t \]

Brownian Motion:

  • Continuous time process with continuous trajectory (a.s.)

Univariate Brownian Motion

\[ X_0 = \mu; \qquad d X_t = \sigma d W_t \]

Brownian Motion:

  • Continuous time process with continuous trajectory (a.s.)
  • Independent increments: \((X_{t_3} - X_{t_2})\) ind. \((X_{t_2} - X_{t_1})\)

Univariate Brownian Motion

\[ X_0 = \mu; \qquad d X_t = \sigma d W_t \]

Brownian Motion:

  • Continuous time process with continuous trajectory (a.s.)
  • Independent increments: \((X_{t_3} - X_{t_2})\) ind. \((X_{t_2} - X_{t_1})\)
  • Stationary increments: \((X_{t_2} - X_{t_1}) \sim X_{t_2 - t_1} - X_0\)

Univariate Brownian Motion

Brownian Motion:

  • Continuous time process with continuous trajectory (a.s.)
  • Independent increments: \((X_{t_3} - X_{t_2})\) ind. \((X_{t_2} - X_{t_1})\)
  • Stationary increments: \((X_{t_2} - X_{t_1}) \sim X_{t_2 - t_1} - X_0\)
  • Gaussian: \(X_t \sim \mathcal{N}(X_0, \sigma^2 t)\)

Bi-variate Brownian Motion

Brownian Motion:

\[ \mathbf{X}_t \sim \mathcal{N}(\mathbf{X}_0, t \mathbf{\Sigma}) \]

Brownian Motion

Random trajectories

Brownian motion

Distribution of final points.

Brownian motion

Distribution of final points.

Brownian motion

Distribution of final points.

\[ \mathbf{X}_t \sim \mathcal{N}\left( \mathbf{X}_0, \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} t\right) \]

\[ \mathbf{\Sigma} = \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \]

Brownian motion

Distribution of final points.

\[ \mathbf{X}_t \sim \mathcal{N}\left( \mathbf{X}_0, \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} t\right) \]

\[ \mathbf{\Sigma} = \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} = \begin{pmatrix} 0.1 & 0 \\ 0 & 1 \end{pmatrix} \]

Brownian motion

Distribution of final points.

\[ \mathbf{X}_t \sim \mathcal{N}\left( \mathbf{X}_0, \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} t\right) \]

\[ \mathbf{\Sigma} = \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} = \begin{pmatrix} 0.55 & 0.45 \\ 0.45 & 0.55 \end{pmatrix} \]

WNV phylo-geography

Phylogeny + Geography = Phylo-geography

Phylo-geography

  • Phylo-: viral tree

  • -geography: on a map

Phylo-geography

  • Phylo-: viral tree

  • -geography: on a map

Phylo-geography

  • Phylo-: viral tree

  • -geography: on a map

Brownian Motion on a Tree

 

  • The trait evolves like a BM in time
  • Speciation \(\to\) two independent processes
  • Only tip values are measured

Brownian Motion on a Tree

 

  • SDE: \(d X_t = \sigma d W_t\)
  • Heredity: \(X_8 | X_7 \sim \mathcal{N}(X_7, \sigma^2 t_8)\)
  • Root: \(X_{\rho} = \mu\)

Variance Structure

 

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid

Variance Structure

 

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \mathbf{V}(X_8) + \sigma^2 t_9 \]

Variance Structure

 

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \mathbf{V}(X_8) + \sigma^2 t_9 = \sigma^2 t_8 + \sigma^2 t_9 \]

Variance Structure

 

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \mathbf{V}(X_8) + \sigma^2 t_9 = \sigma^2 t_8 + \sigma^2 t_9 = \sigma^2 V_{9} \]

Variance Structure

 

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \sigma^2 V_{9} \] \[ \mathbf{C}(Y_4, Y_5) = \mathbf{V}(X_9) = \sigma^2 V_{9} = \sigma^2 V_{45} \]

Variance Structure

 

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid

Covariances: \(\mathbf{C}(X_i, X_j) = \sigma^2 V_{ij}\)

Distribution: \(\mathbf{X} \sim \mathcal{N}(\mu\mathbf{1}_n, \sigma^2 \mathbf{V})\) \(\to\) Multivariate Gaussian

Maximum Likelihood

 

\[ \mathbf{Y} \sim \mathcal{N}(\mu\mathbf{1}_n, \sigma^2 \mathbf{V}) \]

\[ \hat{\mu} = (\mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{1}_n)^{-1} \mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{Y} \\ \hat{\sigma}^2 = \frac{1}{n-1} (\mathbf{Y} - \hat{\mu}\mathbf{1}_n)^T \mathbf{V}^{-1} (\mathbf{Y} - \hat{\mu}\mathbf{1}_n) \]

Ancestral State Reconstruction

 

\[ \begin{pmatrix} \mathbf{Z}\\ \mathbf{Y} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu\mathbf{1}_m\\ \mu\mathbf{1}_n \end{pmatrix} , \sigma^2 \begin{pmatrix} \mathbf{V}_{ZZ} & \mathbf{V}_{ZY}\\ \mathbf{V}_{YZ} & \mathbf{V}_{YY} \end{pmatrix} \right) \]

Ancestral State Reconstruction

\[ \begin{pmatrix} \mathbf{Z}\\ \mathbf{Y} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu\mathbf{1}_m\\ \mu\mathbf{1}_n \end{pmatrix} , \sigma^2 \begin{pmatrix} \mathbf{V}_{ZZ} & \mathbf{V}_{ZY}\\ \mathbf{V}_{YZ} & \mathbf{V}_{YY} \end{pmatrix} \right) \]

Conditional distribution:

\[ \mathbf{Z} \mid \{\mathbf{Y} = \mathbf{y}\} \sim \mathcal{N}\left( \bar{\boldsymbol{\mu}} , \sigma^2\bar{\mathbf{V}} \right) \]

\[ \bar{\boldsymbol{\mu}} = \mu\mathbf{1}_m + \mathbf{V}_{ZY} \mathbf{V}_{YY}^{-1} (\mathbf{y} - \mu\mathbf{1}_n) \quad \bar{\mathbf{V}} = \mathbf{V}_{ZZ} - \mathbf{V}_{ZY} \mathbf{V}_{YY}^{-1}\mathbf{V}_{YZ} \]

Bi-variate process

  • \(\mathbf{Y}_{i} = (Y_{i,1}, Y_{i,2})\)
    \(\hookrightarrow\) \(\mathbf{Y}\) is a \(n\times 2\) matrix.
  • \(Y_{i,k}\): trait \(k\) of of node \(i\)
  • BM: \(\mathbf{X}_t \sim \mathcal{N}(\mathbf{X}_0, t \mathbf{\Sigma})\)
  • \(\mathbf{C}[Y_{i,k}, Y_{j,l}] = \Sigma_{k,l} \times V_{ij}\)

Maximum Likelihood

\[ \mathbf{Y} \sim \mathcal{MN}_{n,2}(\mathbf{1}_n\boldsymbol{\mu}^T, \mathbf{V}, \mathbf{\Sigma}) \]

\[ \hat{\boldsymbol{\mu}}^T = (\mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{1}_n)^{-1} \mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{Y} \\ \hat{\mathbf{\Sigma}} = \frac{1}{n-1} (\mathbf{Y} - \mathbf{1}_n\hat{\boldsymbol{\mu}}^T)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{1}_n\hat{\boldsymbol{\mu}}^T) \]

Ancestral State Reconstruction

Vectorized version: \[ \begin{pmatrix} \mathbf{Y}_{\cdot,1}\\ \mathbf{Y}_{\cdot,2} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_1\mathbf{1}_n\\ \mu_2\mathbf{1}_n \end{pmatrix} , \begin{pmatrix} \Sigma_{11}\mathbf{V}_{YY} & \Sigma_{12}\mathbf{V}_{YY}\\ \Sigma_{21}\mathbf{V}_{YY} & \Sigma_{22}\mathbf{V}_{YY} \end{pmatrix} \right) \]

With ancestral states: \[ \begin{pmatrix} \mathbf{Z}_{\cdot,1}\\ \mathbf{Z}_{\cdot,2}\\ \mathbf{Y}_{\cdot,1}\\ \mathbf{Y}_{\cdot,2} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_1\mathbf{1}_m\\ \mu_2\mathbf{1}_m\\ \mu_1\mathbf{1}_n\\ \mu_2\mathbf{1}_n \end{pmatrix} , \begin{pmatrix} \Sigma_{11}\mathbf{V}_{ZZ} & \Sigma_{12}\mathbf{V}_{ZZ} & \Sigma_{11}\mathbf{V}_{ZY} & \Sigma_{12}\mathbf{V}_{ZY}\\ \Sigma_{21}\mathbf{V}_{ZZ} & \Sigma_{22}\mathbf{V}_{ZZ} & \Sigma_{21}\mathbf{V}_{ZY} & \Sigma_{22}\mathbf{V}_{ZY}\\ \Sigma_{11}\mathbf{V}_{YZ} & \Sigma_{12}\mathbf{V}_{YZ} & \Sigma_{11}\mathbf{V}_{YY} & \Sigma_{12}\mathbf{V}_{YY}\\ \Sigma_{21}\mathbf{V}_{YZ} & \Sigma_{22}\mathbf{V}_{YZ} & \Sigma_{21}\mathbf{V}_{YY} & \Sigma_{22}\mathbf{V}_{YY} \end{pmatrix} \right) \]

Ancestral State Reconstruction

\[ \begin{pmatrix} \mathbf{Z}\\ \mathbf{Y} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mathbf{m}_Z\\ \mathbf{m}_Y \end{pmatrix} , \begin{pmatrix} \mathbf{S}_{ZZ} & \mathbf{S}_{ZY}\\ \mathbf{S}_{YZ} & \mathbf{S}_{YY} \end{pmatrix} \right) \]

Conditional distribution:

\[ \mathbf{Z} \mid \{\mathbf{Y} = \mathbf{y}\} \sim \mathcal{N}\left( \bar{\mathbf{m}} , \bar{\mathbf{S}} \right) \]

\[ \bar{\mathbf{m}} = \mathbf{m}_Z + \mathbf{S}_{ZY} \mathbf{S}_{YY}^{-1} (\mathbf{y} - \mathbf{m}_Y) \quad \bar{\mathbf{S}} = \mathbf{S}_{ZZ} - \mathbf{S}_{ZY} \mathbf{S}_{YY}^{-1}\mathbf{S}_{YZ} \]

WNV Phylo-geography

WNV Phylo-geography

Beyond the Brownian Motion

Joint Bayesian inference

Joint posterior: \[ p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \mid \mathbf{Y}, \mathbf{S} \right) \propto p\left(\mathbf{Y}, \mathbf{S} \mid \boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \right) p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \right) \]

Assumption: \(\mathbf{Y}\) and \(\mathbf{S}\) independent conditionally on \(\mathcal{T}\).

\[ p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \mid \mathbf{Y}, \mathbf{S} \right) \propto p\left(\mathbf{Y} \mid \boldsymbol{\theta}, \mathcal{T} \right) p\left(\mathbf{S} \mid \mathcal{T}, \boldsymbol{\psi} \right) p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \right) \]

\(\to\) Run a MCMC, updating parameters sequentially.

Ressources