PML 10. Normalizing Flows (for Simulation-Based Inference)
Probabilistic Machine Learning Reading Group
Roi Naveiro
March 25, 2026
CUNEF Universidad
The satellite would satisfy
\ddot{\mathbf r}(t) = -\frac{\mu}{\|\mathbf r(t)\|^3}\mathbf r(t).
Then, to place a satellite in GEO, we would simply:
In that case, the ideal satellite orbit would satisfy
\mathbf{r}^*(t)= \begin{bmatrix} a\cos(\lambda_0+\omega t)\\ a\sin(\lambda_0+\omega t)\\ 0 \end{bmatrix}, \qquad \omega=\sqrt{\frac{\mu}{a^3}}.
with corresponding target velocity \mathbf{v}^*(t).
The dynamics are affected by more than central gravity:
\mathbf a(\mathbf r(t),t) \simeq \mathbf a_{\mathrm{grav}} + \mathbf a_{J_2} + \mathbf a_{3B} + \mathbf a_{R}.
So we need a controller that applies corrective acceleration \mathbf a_C to keep the satellite near the target orbit.
\mathbf a(\mathbf s(t),t,\boldsymbol\phi) \simeq \mathbf a_{\mathrm{grav}} + \mathbf a_{J_2} + \mathbf a_{3B} + \mathbf a_{R} + \textcolor{red}{\mathbf a_{C}(\mathbf s(t),t,\boldsymbol\phi)}.
\mathbf s(t)= \begin{bmatrix} \mathbf r(t)\\ \mathbf v(t) \end{bmatrix} \in\mathbb R^6.
Let
\Delta \mathbf r_t=\mathbf r(t)-\mathbf r^*(t), \qquad \Delta \mathbf v_t=\mathbf v(t)-\mathbf v^*(t).
Then a simple proportional-derivative policy is
\mathbf a_C^{PD}(\mathbf r_t,t,\boldsymbol\phi) = -K_p\,\Delta \mathbf r_t -K_d\,\Delta \mathbf v_t, \qquad \boldsymbol\phi=(K_p,K_d).
Suppose an observer collects a trajectory over an observation window:
\mathcal D=\{ \mathbf s(t_j),t_j\}_{j=1}^J,
where
\mathbf s(t_j)= \big[x(t_j),y(t_j),z(t_j),\dot x(t_j),\dot y(t_j),\dot z(t_j)\big]^\top.
Can we infer the controller parameters \boldsymbol\phi from the observed orbit?
The observer models the satellite with the SDE
d\mathbf s(t) = \mathbf f(\mathbf s(t),t,\boldsymbol\phi)\,dt + \mathbf G\,d\mathbf W(t),
where
\mathbf f(\mathbf s(t),t,\boldsymbol\phi) = \begin{bmatrix} \mathbf v(t)\\ \mathbf a(\mathbf s(t),t,\boldsymbol\phi) \end{bmatrix}.
The acceleration is
\mathbf a(\mathbf s(t),t,\boldsymbol\phi) = \mathbf a_{\mathrm{grav}} + \mathbf a_{J_2} + \mathbf a_{3B} + \mathbf a_R + \mathbf a_C(\mathbf s(t),t,\boldsymbol\phi).
Once \mathcal D is observed, Bayesian inference targets
p(\boldsymbol\phi\mid \mathcal D) \propto p(\mathcal D\mid \boldsymbol\phi)\,p(\boldsymbol\phi).
We do not have an explicit formula for the likelihood of the observed path:
p\big(\mathbf s(t_1),\mathbf s(t_2),\ldots,\mathbf s(t_J)\mid \boldsymbol\phi\big).
\mathbf s(t) \sim p(\cdot\mid \boldsymbol\phi).
Numerically solving the SDE, we can generate as many trajectories as we want for any given \boldsymbol\phi.
Keep only the observed times.
SBI notation:
\theta := \boldsymbol\phi, \qquad x := \big(\mathbf s(t_1),\ldots,\mathbf s(t_J)\big).
We can simulate from the prior and evaluate its density: \theta\sim p(\theta),
We can simulate from the likelihood but not evaluate its density: x\sim p(x\mid \theta).
p(\theta\mid x_o) \propto p(x_o\mid \theta)\,p(\theta).
\theta_i\sim p(\theta), \qquad x_i\sim p(x\mid \theta_i).
\hat\phi \approx \arg\min_\phi \mathbb E_{p(\theta,x)}[-\log q_\phi(\theta\mid x)].
\mathbb E_{p(x)} \Big[ \mathrm{KL}\!\big(p(\theta\mid x)\,\|\,q_\phi(\theta\mid x)\big) \Big] + C
where C does not depend on \phi.
q_\phi(\theta\mid x)\approx p(\theta\mid x).
q_\phi(\theta\mid x_o).
Recall that for training we need pairs (\theta_i,x_i) where \theta_i\sim p(\theta) and x_i\sim p(x\mid \theta_i).
Round 1 samples from the prior: \tilde p_1(\theta)=p(\theta).
Train a first approximation q_{\phi_1}(\theta\mid x).
Use the current approximation at the observed data as a proposal: \tilde p_r(\theta)\approx q_{\phi_{r-1}}(\theta\mid x_o).
Then simulate new pairs \theta_i\sim \tilde p_r(\theta), \qquad x_i\sim p(x\mid \theta_i),
We need a model q_\phi(\theta\mid x) that
Sample easily (for later rounds): \theta \sim q_\phi(\theta\mid x)
Evaluate densities exactly (for training each round): \log q_\phi(\theta\mid x)
Be flexible enough to learn complicated posteriors.
Say we want to model a complex density p_x(x) on \mathbb R^D.
Start from a simple base density u \sim p_u(u), e.g. a standard Gaussian.
Apply an invertible map f:\mathbb R^D\to\mathbb R^D.
The transformed variable x=f(u) now has a richer density.
u\sim p_u(u), \qquad x=f(u).
If: x = f(u), \qquad g(x)=f^{-1}(x)=u.
Then
p_x(x) = p_u(u)\left|\det J(f)(u)\right|^{-1}.
Therefore
\log p_x(x)=\log p_u(u)-\log\left|\det J(f)(u)\right|.
If f_1,\dots,f_N are invertible maps then f = f_N \circ \cdots \circ f_1,
is also invertible and:
\log \left|\det J(g)(x)\right| = \sum_{i=1}^N \log \left|\det J(g_i)(u_i)\right|.
\max_\omega \sum_{n=1}^N \log p_\omega(x_n).
\log p_\omega(x) = \log p_u\!\big(f_\omega^{-1}(x)\big) + \log\left|\det J\!\big(f_\omega^{-1}\big)(x)\right|.
We can learn the parameters \omega by gradient ascent on the exact log likelihood, provided we can:
And everything must remain differentiable in \omega.
For sampling, in addtion to the above, we also need to:
How do we propose normalizing flows that satisfy the above properties?
f = f_N \circ \cdots \circ f_1,
f(u)=\big(h(u_1),\dots,h(u_D)\big).
\log |\det J(f)(u)| = \sum_{i=1}^D \log \left|h_i'(u_i)\right|.
Partition the input as u=(u_A,u_B) and define
\begin{aligned} x_A &= \hat f\big(u_A;\eta(u_B)\big) \\ x_B &= u_B. \end{aligned}
u_B = x_B, \qquad u_A = \hat f^{-1}\big(x_A;\eta(x_B)\big).
J(f)= \begin{bmatrix} J(\hat f) & * \\ 0 & I \end{bmatrix}
so
\det J(f)=\det J(\hat f).
If C_\eta denotes the cost of one conditioner-network evaluation, and |A| is the size of the transformed block, then:
Define
x_i = h\big(u_i;\eta_i(x_{1:i-1})\big), \qquad i=1,\dots,D.
Then the Jacobian is triangular:
\frac{\partial x_i}{\partial u_j}=0 \qquad \text{for } j>i,
u_i = h^{-1}\big(x_i;\eta_i(x_{1:i-1})\big), \qquad i=1,\dots,D.
\log \left|\det J\!\big(f^{-1}\big)(x)\right| = \sum_{i=1}^D \log \left| \frac{\partial}{\partial x_i} h^{-1}\big(x_i;\eta_i(x_{1:i-1})\big) \right|.
x_i = h\big(u_i;\eta_i(x_{1:i-1})\big), \qquad i=1,\dots,D.
Notebook 1
Notebook 2 satellite_sbi/notebooks/minimal_sbi_satellite_joint_demo.ipynb