An introduction to viral phylogeography

Paul Bastide

MAP5, CNRS, Université Paris Cité

2025-10-17

Program

Introduction to phylogeography
- Viral phylogeny
- Viral spread
Practical
- Simulate trajectory
- Reconstruct the spread of the West Nile Virus

West Nile Virus

West Nile Virus (WNV)

First detected in New York in 1999
Spread across all the US
About 1200 death between 2000 and 2010

Questions

Origin of the outbreak ?
Routes of transmissions ?
Migrating birds ? Population mouvements ?

Phylo-geography of the WNV

Virus evolution

Central dogma of molecular biology

Source: National Human Genome Research Institute

Test PCR

PCR = ?

Prélèvement Covid Rhino-pharyngé ?

Polymerase Chain Reaction !

Source: Wikipedia, Enzoklop, CC BY-SA 4.0

ARN virus

Source: Wikipedia

Telephone whispers

AACUUUUGCGCGCGGGGAAAAAAAGCCCCAAAAUUUU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAAAAAAGCCCCAAAAUUUU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAUAAAAGCCCCAAAAUUUU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAUAAAAGCCCCAAAAUUGU
\(\qquad\qquad\downarrow\) AACUUUUGCGCGCGGGGAAUAAAGGCCCCAAAAUUGU
\(\qquad\qquad\downarrow\)
\(\qquad\qquad\cdots\)

Virus evolution

Replication error.
Creation of a variant.
Variant can be neutral, more virulent, less virulent.
Variant gets propagated.

ARN virus: many variants, most neutral.
2 different patients ~ 2 different variant.
Allows for a tracking of the outbreak.

Transmission tree

Sampling

Sampling

Mutations

Mutations

Observations

Phylogenetic tree

WNV phylo-geography

Viral spacial spread

“Random” moves ?

Virus is at point (0,0) at time t=0.
It moves “randomly” during a time T=1.
Where is it at time T ?

Random walk on a grid

Choose a direction.
Move one step.

\[ \frac{1}{\sqrt{2}}\mathbf{Z}_i = \begin{cases} (\;\,0, \;\,1) & p = 1/4\\ (\;\,0, -1) & p = 1/4\\ (\;\,1, \;\,0) & p = 1/4\\ (-1, \;\,0) & p = 1/4\\ \end{cases} \]

\[ \mathbf{Z}_i \text{ i.i.d. } \quad \mathbf{E}[\mathbf{Z}] = \mathbf{0}_2 \quad \mathbf{V}[\mathbf{Z}] = \mathbf{I}_2 \]

\[ \mathbf{S}_n = \sum_{i=1}^n \mathbf{Z}_i \]

Random walk on a grid

Refine the grid.
Make time go faster.

\[ \mathbf{S}_n = \sum_{i=1}^n \mathbf{Z}_i \]

\[ \mathbf{W}^{(n)}_t = \frac{1}{\sqrt{n}} \mathbf{S}_{\lfloor tn \rfloor} \]

To the Brownian Motion…

Refine the grid.
Make time go faster.

\[ \mathbf{S}_n = \sum_{i=1}^n \mathbf{Z}_i \]

\[ \mathbf{W}^{(n)}_t = \frac{1}{\sqrt{n}} \mathbf{S}_{\lfloor tn \rfloor} \implies \mathbf{W}_t \]

The random walk converges weakly to the Brownian motion as \(n\to \infty\).

Brownian Motion

Robert Brown (1827): empirical observation
Louis Bachelier (1901): use in finance
Albert Einstein (1905): use in physics
Norbert Wiener (1923): mathematical construction
Paul Lévy (1933): rigorous modern definition

Univariate Brownian Motion

\[ X_0 = \mu; \qquad d X_t = \sigma d B_t \]

Brownian Motion:

Continuous time process with continuous trajectory (a.s.)

Univariate Brownian Motion

\[ X_0 = \mu; \qquad d X_t = \sigma d W_t \]

Brownian Motion:

Continuous time process with continuous trajectory (a.s.)
Independent increments: \((X_{t_3} - X_{t_2})\) ind. \((X_{t_2} - X_{t_1})\)

Univariate Brownian Motion

\[ X_0 = \mu; \qquad d X_t = \sigma d W_t \]

Brownian Motion:

Continuous time process with continuous trajectory (a.s.)
Independent increments: \((X_{t_3} - X_{t_2})\) ind. \((X_{t_2} - X_{t_1})\)
Stationary increments: \((X_{t_2} - X_{t_1}) \sim X_{t_2 - t_1} - X_0\)

Univariate Brownian Motion

Brownian Motion:

Continuous time process with continuous trajectory (a.s.)
Independent increments: \((X_{t_3} - X_{t_2})\) ind. \((X_{t_2} - X_{t_1})\)
Stationary increments: \((X_{t_2} - X_{t_1}) \sim X_{t_2 - t_1} - X_0\)
Gaussian: \(X_t \sim \mathcal{N}(X_0, \sigma^2 t)\)

Bi-variate Brownian Motion

Brownian Motion:

\[ \mathbf{X}_t \sim \mathcal{N}(\mathbf{X}_0, t \mathbf{\Sigma}) \]

Brownian Motion

Random trajectories

Brownian motion

Distribution of final points.

Brownian motion

Distribution of final points.

Brownian motion

Distribution of final points.

\[ \mathbf{X}_t \sim \mathcal{N}\left( \mathbf{X}_0, \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} t\right) \]

\[ \mathbf{\Sigma} = \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \]

Brownian motion

Distribution of final points.

\[ \mathbf{X}_t \sim \mathcal{N}\left( \mathbf{X}_0, \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} t\right) \]

\[ \mathbf{\Sigma} = \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} = \begin{pmatrix} 0.1 & 0 \\ 0 & 1 \end{pmatrix} \]

Brownian motion

Distribution of final points.

\[ \mathbf{X}_t \sim \mathcal{N}\left( \mathbf{X}_0, \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} t\right) \]

\[ \mathbf{\Sigma} = \begin{pmatrix} \sigma^2_x & \sigma^2_{xy} \\ \sigma^2_{xy} & \sigma^2_{y} \end{pmatrix} = \begin{pmatrix} 0.55 & 0.45 \\ 0.45 & 0.55 \end{pmatrix} \]

WNV phylo-geography

Phylogeny + Geography = Phylo-geography

Phylo-geography

Phylo-: viral tree

-geography: on a map

Phylo-geography

Phylo-: viral tree

-geography: on a map

Phylo-geography

Phylo-: viral tree

-geography: on a map

Brownian Motion on a Tree

The trait evolves like a BM in time
Speciation \(\to\) two independent processes
Only tip values are measured

Brownian Motion on a Tree

SDE: \(d X_t = \sigma d W_t\)
Heredity: \(X_8 | X_7 \sim \mathcal{N}(X_7, \sigma^2 t_8)\)
Root: \(X_{\rho} = \mu\)

Variance Structure

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid

Variance Structure

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \mathbf{V}(X_8) + \sigma^2 t_9 \]

Variance Structure

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \mathbf{V}(X_8) + \sigma^2 t_9 = \sigma^2 t_8 + \sigma^2 t_9 \]

Variance Structure

Variance Structure

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid \[ \mathbf{V}(X_9) = \sigma^2 V_{9} \] \[ \mathbf{C}(Y_4, Y_5) = \mathbf{V}(X_9) = \sigma^2 V_{9} = \sigma^2 V_{45} \]

Variance Structure

Structure: \(X_i = X_{\text{pa}(i)} + \sigma \sqrt{t_{i}} \times \epsilon_i\), with \(\epsilon_i \sim \mathcal{N}(0, 1)\) iid

Covariances: \(\mathbf{C}(X_i, X_j) = \sigma^2 V_{ij}\)

Distribution: \(\mathbf{X} \sim \mathcal{N}(\mu\mathbf{1}_n, \sigma^2 \mathbf{V})\) \(\to\) Multivariate Gaussian

Maximum Likelihood

\[ \mathbf{Y} \sim \mathcal{N}(\mu\mathbf{1}_n, \sigma^2 \mathbf{V}) \]

\[ \hat{\mu} = (\mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{1}_n)^{-1} \mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{Y} \\ \hat{\sigma}^2 = \frac{1}{n-1} (\mathbf{Y} - \hat{\mu}\mathbf{1}_n)^T \mathbf{V}^{-1} (\mathbf{Y} - \hat{\mu}\mathbf{1}_n) \]

Ancestral State Reconstruction

\[ \begin{pmatrix} \mathbf{Z}\\ \mathbf{Y} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu\mathbf{1}_m\\ \mu\mathbf{1}_n \end{pmatrix} , \sigma^2 \begin{pmatrix} \mathbf{V}_{ZZ} & \mathbf{V}_{ZY}\\ \mathbf{V}_{YZ} & \mathbf{V}_{YY} \end{pmatrix} \right) \]

Ancestral State Reconstruction

Conditional distribution:

\[ \mathbf{Z} \mid \{\mathbf{Y} = \mathbf{y}\} \sim \mathcal{N}\left( \bar{\boldsymbol{\mu}} , \sigma^2\bar{\mathbf{V}} \right) \]

\[ \bar{\boldsymbol{\mu}} = \mu\mathbf{1}_m + \mathbf{V}_{ZY} \mathbf{V}_{YY}^{-1} (\mathbf{y} - \mu\mathbf{1}_n) \quad \bar{\mathbf{V}} = \mathbf{V}_{ZZ} - \mathbf{V}_{ZY} \mathbf{V}_{YY}^{-1}\mathbf{V}_{YZ} \]

Bi-variate process

\(\mathbf{Y}_{i} = (Y_{i,1}, Y_{i,2})\)
\(\hookrightarrow\) \(\mathbf{Y}\) is a \(n\times 2\) matrix.
\(Y_{i,k}\): trait \(k\) of of node \(i\)

BM: \(\mathbf{X}_t \sim \mathcal{N}(\mathbf{X}_0, t \mathbf{\Sigma})\)
\(\mathbf{C}[Y_{i,k}, Y_{j,l}] = \Sigma_{k,l} \times V_{ij}\)

Maximum Likelihood

\[ \mathbf{Y} \sim \mathcal{MN}_{n,2}(\mathbf{1}_n\boldsymbol{\mu}^T, \mathbf{V}, \mathbf{\Sigma}) \]

\[ \hat{\boldsymbol{\mu}}^T = (\mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{1}_n)^{-1} \mathbf{1}_n^T \mathbf{V}^{-1} \mathbf{Y} \\ \hat{\mathbf{\Sigma}} = \frac{1}{n-1} (\mathbf{Y} - \mathbf{1}_n\hat{\boldsymbol{\mu}}^T)^T \mathbf{V}^{-1} (\mathbf{Y} - \mathbf{1}_n\hat{\boldsymbol{\mu}}^T) \]

Ancestral State Reconstruction

Vectorized version: \[ \begin{pmatrix} \mathbf{Y}_{\cdot,1}\\ \mathbf{Y}_{\cdot,2} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_1\mathbf{1}_n\\ \mu_2\mathbf{1}_n \end{pmatrix} , \begin{pmatrix} \Sigma_{11}\mathbf{V}_{YY} & \Sigma_{12}\mathbf{V}_{YY}\\ \Sigma_{21}\mathbf{V}_{YY} & \Sigma_{22}\mathbf{V}_{YY} \end{pmatrix} \right) \]

With ancestral states: \[ \begin{pmatrix} \mathbf{Z}_{\cdot,1}\\ \mathbf{Z}_{\cdot,2}\\ \mathbf{Y}_{\cdot,1}\\ \mathbf{Y}_{\cdot,2} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_1\mathbf{1}_m\\ \mu_2\mathbf{1}_m\\ \mu_1\mathbf{1}_n\\ \mu_2\mathbf{1}_n \end{pmatrix} , \begin{pmatrix} \Sigma_{11}\mathbf{V}_{ZZ} & \Sigma_{12}\mathbf{V}_{ZZ} & \Sigma_{11}\mathbf{V}_{ZY} & \Sigma_{12}\mathbf{V}_{ZY}\\ \Sigma_{21}\mathbf{V}_{ZZ} & \Sigma_{22}\mathbf{V}_{ZZ} & \Sigma_{21}\mathbf{V}_{ZY} & \Sigma_{22}\mathbf{V}_{ZY}\\ \Sigma_{11}\mathbf{V}_{YZ} & \Sigma_{12}\mathbf{V}_{YZ} & \Sigma_{11}\mathbf{V}_{YY} & \Sigma_{12}\mathbf{V}_{YY}\\ \Sigma_{21}\mathbf{V}_{YZ} & \Sigma_{22}\mathbf{V}_{YZ} & \Sigma_{21}\mathbf{V}_{YY} & \Sigma_{22}\mathbf{V}_{YY} \end{pmatrix} \right) \]

Ancestral State Reconstruction

\[ \begin{pmatrix} \mathbf{Z}\\ \mathbf{Y} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mathbf{m}_Z\\ \mathbf{m}_Y \end{pmatrix} , \begin{pmatrix} \mathbf{S}_{ZZ} & \mathbf{S}_{ZY}\\ \mathbf{S}_{YZ} & \mathbf{S}_{YY} \end{pmatrix} \right) \]

Conditional distribution:

\[ \mathbf{Z} \mid \{\mathbf{Y} = \mathbf{y}\} \sim \mathcal{N}\left( \bar{\mathbf{m}} , \bar{\mathbf{S}} \right) \]

\[ \bar{\mathbf{m}} = \mathbf{m}_Z + \mathbf{S}_{ZY} \mathbf{S}_{YY}^{-1} (\mathbf{y} - \mathbf{m}_Y) \quad \bar{\mathbf{S}} = \mathbf{S}_{ZZ} - \mathbf{S}_{ZY} \mathbf{S}_{YY}^{-1}\mathbf{S}_{YZ} \]

WNV Phylo-geography

Beyond the Brownian Motion

Joint Bayesian inference

Joint posterior: \[ p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \mid \mathbf{Y}, \mathbf{S} \right) \propto p\left(\mathbf{Y}, \mathbf{S} \mid \boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \right) p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \right) \]

Assumption: \(\mathbf{Y}\) and \(\mathbf{S}\) independent conditionally on \(\mathcal{T}\).

\[ p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \mid \mathbf{Y}, \mathbf{S} \right) \propto p\left(\mathbf{Y} \mid \boldsymbol{\theta}, \mathcal{T} \right) p\left(\mathbf{S} \mid \mathcal{T}, \boldsymbol{\psi} \right) p\left(\boldsymbol{\theta}, \mathcal{T}, \boldsymbol{\psi} \right) \]

\(\to\) Run a MCMC, updating parameters sequentially.

Ressources

Wikipedia Viral phylodynamics (en).
Lecture on viral phylodynamics by Trevor Bedford (en).
Nextstrain.
EvoLaps.
Modèles et méthodes pour l’évolution biologique, éditeurs Gilles Didier et Stéphane Guindon, ISTE éditions (2022).