Vinija's Notes • Primers • Flow Matching Models

1 Overview and Working Definition
1.2 Probability Paths and Flow Parameterization
1.3 Conditional Flow Matching and Noise-Parameterized Loss
2. Flow Trajectories
2.1 Tailored SNR Samplers for Rectified Flow Models
3. Architecture – MM-DiT Backbone and Training System
4. High-Resolution Finetuning and Scaling Behavior
5. Training and Evaluation
6. Practical Implementation Checklist
References

1 Overview and Working Definition

Flow Matching (FM) is a generative-modeling framework that learns a deterministic, continuous-time process transforming samples from a simple base distribution (such as Gaussian noise) into samples from a complex data distribution (such as natural images). Instead of learning a stochastic denoising process like diffusion models, FM learns a velocity field that defines how each point in latent space moves along a flow trajectory between noise and data. The process is deterministic and can be integrated efficiently using ordinary differential-equation (ODE) solvers.

1.1 Conceptual Foundations

Given

a data distribution $p_0(x)$ representing real data, and
a noise distribution $p_1(x)=\mathcal N(0,I)$,

Flow Matching defines a continuous family of intermediate distributions ${p_t(x)}{t\in[0,1]}$ that interpolate between them. Each sample evolves according to an ODE driven by a learnable velocity field $v\Theta(x,t)$:

\[\frac{dx_t}{dt}=v_\Theta(x_t,t)\]

Explanation

$x_t$ – the current sample (in latent or data space) at “time” $t$.
$v_\Theta(x_t,t)$ – a neural network predicting the instantaneous direction and speed (velocity).
The equation means the rate of change of the sample equals the model’s predicted velocity. Integrating this ODE backward from noise ($t=1$) to data ($t=0$) generates samples.

1.2 Conservation of Probability (Continuity Equation)

To ensure total probability is conserved as points move through space, the learned flow must satisfy the continuity equation:

\[\frac{\partial p_t(x)}{\partial t}+\nabla_x\cdot\left(p_t(x)\,v_\Theta(x,t)\right)=0\]

Explanation

The first term $\tfrac{\partial p_t(x)}{\partial t}$ measures how the density at $x$ changes with time.
The divergence $\nabla_x\cdot(p_t(x),v_\Theta(x,t))$ measures how probability mass flows in or out of $x$.
The sum being zero enforces conservation: probability only moves—it is neither created nor destroyed.

This property guarantees that integrating the ODE transports the full distribution $p_t$ correctly from noise to data.

1.3 Interpolation Path Between Noise and Data

FM defines a smooth interpolation path parameterized by scalar coefficients $a_t$ and $b_t$:

\[z_t=a_t x_0+b_t\varepsilon,\qquad x_0\sim p_0,\,\varepsilon\sim\mathcal N(0,I)\]

Explanation

$z_t$ – intermediate latent between pure data ($x_0$) and pure noise ($\varepsilon$).
$a_t,b_t$ – time-dependent scalars controlling how much data vs. noise is mixed.
$a_0{=}1,b_0{=}0\Rightarrow z_0{=}x_0$; $a_1{=}0,b_1{=}1\Rightarrow z_1{=}\varepsilon$.

Thus, as $t$ increases from 0 to 1, the sample transitions smoothly from data to noise.

1.4 Conditional Flow Matching (CFM) Loss

Directly solving the ODE during training would be computationally expensive. Instead, we regress the predicted velocity field against an analytically known conditional velocity. The Conditional Flow Matching objective from Lipman et al., 2023 is:

\[\mathcal L_{\text{CFM}}=\mathbb E_{t,x_0,\varepsilon}\left[\|v_\Theta(z_t,t)-u_t(z_t\mid\varepsilon)\|_2^2\right]\]

Explanation

(v_\Theta(z_t,t)): model’s predicted velocity.
(u_t(z_t\mid\varepsilon)): exact conditional velocity computed from the interpolation path.
The expectation is taken over random timesteps (t), data samples (x_0), and noise (\varepsilon). Minimizing this loss trains the network so that its predicted velocity field reproduces the true transport dynamics.

1.5 Rectified Flow (Straight-Line Paths)

Rectified Flow (RF) simplifies the interpolation by using straight-line paths in latent space:

\[z_t=(1-t)x_0+t\varepsilon\]

Explanation

(t=0): pure data; (t=1): pure noise.
Samples move linearly between (x_0) and (\varepsilon).

The corresponding target velocity is constant:

\[\frac{dz_t}{dt}=\varepsilon-x_0\]

Explanation

Each pair ((x_0,\varepsilon)) defines a fixed direction in latent space.
During training, the model learns to reproduce this constant velocity. This straight-line formulation enables simple supervision, fast convergence, and efficient sampling.

1.6 Diffusion vs Flow Matching (Conceptual Contrast)

Aspect	Diffusion Models	Flow Matching Models
Dynamics	Reverse-time stochastic differential equation (SDE)	Deterministic ODE
Training target	Predict noise ($\varepsilon$)	Predict velocity ($v_\Theta$)
Sampling	Hundreds of small denoising steps	Few adaptive ODE steps ($\approx 5$–$15$)
Randomness	Stochastic integration	Deterministic integration
Speed	Slow	Fast
Theory	Score matching	Continuity equation

Flow Matching thus provides a deterministic counterpart to diffusion models—retaining their strong supervision while drastically reducing sampling cost.

1.7 Working Definition

A Flow Matching model consists of:

Base distribution $p_1(x)=\mathcal N(0,I)$.
Target data distribution $p_0(x)$.
Interpolation path $z_t=a_t x_0+b_t\varepsilon$.
Neural velocity field $v_\Theta(x,t)$ trained with $\mathcal L_{\text{CFM}}$.
An ODE solver (Euler, Runge–Kutta, etc.) that integrates from $t=1$ to $t=0$ to generate data samples.

At inference, one simply samples Gaussian noise and integrates backward through the learned ODE to obtain realistic outputs efficiently and deterministically.

More on diffusion models as a comparision.

Flow Matching grew out of three earlier modeling ideas.

Normalizing flows used exactly invertible mappings between noise and data but were limited by the need for invertibility. Flow Matching removed that restriction by using a continuous-time ODE to describe how samples move instead of a fixed invertible network.
Diffusion models used noisy stochastic equations (SDEs) to turn data into noise and then learned to reverse that process by predicting scores (gradients of log-densities). Flow Matching replaced those noisy, many-step simulations with a single deterministic ODE that achieves the same effect.
Stochastic interpolants and optimal-transport theory provided the mathematical bridge: they showed that moving probability mass smoothly under a learned velocity field satisfies the same continuity equation that governs diffusion, but without stochastic noise.

The result is a deterministic, probability-preserving generative process. Instead of “denoising” in hundreds of small random steps, Flow Matching integrates a smooth velocity field over time to transport Gaussian noise into realistic data efficiently (usually in 5–15 steps).

Perfect — let’s go carefully, step by step. Below is Section 1.2: Probability Paths and Flow Parameterization (expanded). For every equation, I break down each symbol and describe its role in plain English so you can see exactly what’s happening mathematically and conceptually.

1.2 Probability Paths and Flow Parameterization

Flow Matching defines a probability path — a smooth sequence of distributions (p_t) connecting the data distribution (p_0) and a noise distribution (p_1). Each point on this path represents a “partially corrupted” version of the data, and the model learns a velocity field that describes how to move along that path over time.

1.2.1 Defining the Path

The path between data and noise is written as:

\[z_t = a_t x_0 + b_t \varepsilon\]

Components explained:

(z_t): a sample at an intermediate time (t \in [0,1]).
(x_0): a clean data point (e.g., a real image) drawn from the data distribution (p_0).
(\varepsilon): Gaussian noise sampled from (\mathcal N(0, I)).
(a_t): a time-dependent scalar determining how much of the data contributes at time (t).
(b_t): a time-dependent scalar determining how much noise contributes.

What it does: This equation describes a smooth interpolation between pure data and pure noise:

At (t=0): (a_0 = 1, b_0 = 0 \Rightarrow z_0 = x_0) (the original data).
At (t=1): (a_1 = 0, b_1 = 1 \Rightarrow z_1 = \varepsilon) (pure Gaussian noise).

So as (t) goes from 0 to 1, you move gradually from the data to the noise distribution.

1.2.2 Conditional Map

We can define a conditional map (a function that generates (z_t) given data and noise):

\[\psi_t(x_0|\varepsilon) = a_t x_0 + b_t \varepsilon\]

Components explained:

(\psi_t): the mapping function at time (t).
((x_0 \varepsilon)): means that the function depends on a specific noise sample (\varepsilon).

What it does: (\psi_t) defines exactly how data (x_0) and noise (\varepsilon) are mixed at each time step. You can think of it as a “recipe” for producing intermediate samples between data and noise.

1.2.3 Conditional Velocity

We then define a velocity field, which tells us how fast and in what direction points move as (t) changes:

\[\frac{dz_t}{dt} = u_t(z_t|\varepsilon)\]

Components explained:

(\frac{dz_t}{dt}): the instantaneous rate of change of (z_t) with respect to time.
(u_t(z_t \varepsilon)): the true (analytical) velocity of the sample (z_t), given the fixed noise (\varepsilon).

What it does: This is an ODE (ordinary differential equation) describing motion through latent space. It says that the sample (z_t) changes over time according to the velocity (u_t). Following this ODE from (t=1) (noise) back to (t=0) (data) reconstructs real samples.

1.2.4 Conditional Flow Matching Objective

Because the true velocity (u_t(z_t|\varepsilon)) can be computed analytically, we can train a neural network to predict it. The Conditional Flow Matching loss is:

\[\mathcal L_{\text{CFM}} = \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left[\|v_\Theta(z,t) - u_t(z|\varepsilon)\|_2^2\right]\]

Components explained:

(\mathbb E[\cdot]): expectation, i.e., average over time (t), intermediate samples (z_t), and noise (\varepsilon).
(v_\Theta(z,t)): the velocity predicted by the neural network (parameterized by (\Theta)).
(u_t(z \varepsilon)): the true conditional velocity derived analytically.
( \cdot _2^2): squared L2 norm (mean-squared error).

What it does: The loss trains the neural network so that its predicted velocity (v_\Theta) matches the known conditional velocity (u_t). This is equivalent to teaching the network to reproduce the correct physical motion of probability mass between noise and data.

1.2.5 Deriving the Conditional Velocity

We can compute the analytical form of (u_t(z_t|\varepsilon)). Differentiate the path (z_t = a_t x_0 + b_t \varepsilon) with respect to (t):

\[\frac{dz_t}{dt} = a_t' x_0 + b_t' \varepsilon\]

To express this in terms of (z_t) and (\varepsilon), note that (x_0 = (z_t - b_t \varepsilon)/a_t). Substitute that back:

\[u_t(z_t|\varepsilon) = \frac{a_t'}{a_t} z_t + \varepsilon b_t \left(\frac{a_t'}{a_t} - \frac{b_t'}{b_t}\right)\]

Components explained:

(a_t’), (b_t’): time derivatives of the path coefficients (a_t) and (b_t).
The first term ((a_t’/a_t)z_t): scales the current position (z_t) — a self-scaling term.
The second term: adjusts direction based on how noise and data contributions change with time.

What it does: This expression tells us exactly how the sample (z_t) moves under the true probability flow defined by (a_t,b_t). It combines “where we are” ((z_t)) and “how the mixing changes” ((a_t’,b_t’)).

1.2.6 Introducing the Log-SNR

To make training more stable, Flow Matching reparameterizes everything in terms of the log signal-to-noise ratio (SNR):

\[\lambda_t = \log \frac{a_t^2}{b_t^2}\]

Components explained:

(\lambda_t): log of the squared ratio of the signal weight ((a_t)) to the noise weight ((b_t)).
If (a_t) is large and (b_t) small, the SNR is high (mostly data).
If (a_t) is small and (b_t) large, the SNR is low (mostly noise).

What it does: This provides a convenient, schedule-independent way to express how much information (signal) vs. randomness (noise) is present at each timestep.

Taking the derivative:

\[\lambda_t' = 2\left(\frac{a_t'}{a_t} - \frac{b_t'}{b_t}\right)\]

Explanation: This describes how quickly the signal-to-noise balance changes with time. It links the evolution of (a_t) and (b_t) directly to the rate of change in SNR.

1.2.7 Simplified Velocity in SNR Form

Substituting (\lambda_t’) into the earlier equation simplifies it to:

\[u_t(z_t|\varepsilon) = \frac{a_t'}{a_t} z_t + \frac{b_t}{2}\lambda_t' \varepsilon\]

Components explained:

First term: scaling of (z_t), showing how samples stretch or contract as time evolves.
Second term: a direction vector proportional to noise (\varepsilon), controlling how the model injects or removes randomness as it moves toward data.

What it does: This compact form defines the exact target velocity the network should learn — one part controlling scale, one part controlling noise removal.

1.2.8 Noise-Prediction Equivalent

You can also express the same training objective as a noise-prediction task, similar to diffusion models:

\[\mathcal L_{\text{CFM}} = \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left[ \left(-\frac{b_t}{2}\lambda_t'\right)^{2} \|\varepsilon_\Theta(z,t)-\varepsilon\|_2^2 \right]\]

Components explained:

$\varepsilon_\Theta(z,t)$: noise predicted by the neural network.
$\varepsilon$: true Gaussian noise.
The weight $\left(-\frac{b_t}{2}\lambda_t’\right)^2$: adjusts loss importance across time.

What it does: This formulation shows that Flow Matching and diffusion models share a common core — both can be viewed as predicting noise. The difference is that Flow Matching’s dynamics are deterministic, and it doesn’t rely on stochastic sampling during inference.

1.2.9 Summary

(a_t,b_t) define how data and noise mix over time.
(\lambda_t) quantifies the signal-to-noise balance and its rate of change.
(u_t) is the true conditional velocity field that ensures proper probability transport.
(v_\Theta) is the learned velocity trained to approximate (u_t).
The Conditional Flow Matching loss aligns them efficiently, yielding a deterministic ODE that can replace diffusion’s stochastic reverse process.

Here is Section 1.3: Conditional Flow Matching and Noise-Parameterized Loss, rewritten with every equation fully explained in plain English and each symbol described. This section directly follows from Section 1.2 and builds the connection between Flow Matching and diffusion-like noise prediction formulations.

1.3 Conditional Flow Matching and Noise-Parameterized Loss

Flow Matching can be trained through an equivalent noise-prediction loss, which connects it directly to diffusion models but keeps the dynamics deterministic.

Step 1: Starting from the Conditional Flow Matching Loss

The base objective is

\[\mathcal L_{\text{CFM}} = \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left[\|v_\Theta(z,t) - u_t(z|\varepsilon)\|_2^2\right]\]

Components explained:

$\mathbb E_{t, p_t(z \varepsilon), p(\varepsilon)}$: average over time $t$, latent sample $z_t$, and Gaussian noise $\varepsilon$.
$v_\Theta(z,t)$: neural velocity field to be learned.
$u_t(z \varepsilon)$: analytically derived conditional velocity (the “true” one).
The L2 distance measures how close the neural prediction is to the target velocity.

Step 2: Substitute the True Conditional Velocity

From the earlier derivation:

\[u_t(z_t|\varepsilon) = \frac{a_t'}{a_t} z_t + \frac{b_t}{2} \lambda_t' \varepsilon\]

Components explained:

$a_t’$: time derivative of $a_t$, showing how the data contribution changes over time.
$b_t$: current noise weight.
$\lambda_t’$: derivative of the log-SNR, controlling how the signal-to-noise balance evolves.
The first term rescales the sample; the second subtracts noise at a rate governed by $\lambda_t’$.

What it does: This gives the ground-truth flow direction at every point in time.

Step 3: Plugging it into the Loss

Plug that into the previous loss:

\[\mathcal L_{\text{CFM}} = \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left\| v_\Theta(z,t) - \frac{a_t'}{a_t} z - \frac{b_t}{2}\lambda_t' \varepsilon \right\|_2^2\]

Interpretation: The neural velocity field should predict a vector close to $\frac{a_t’}{a_t} z - \frac{b_t}{2}\lambda_t’ \varepsilon$. That means the model must learn to scale the latent vector and remove noise correctly.

Step 4: Define a Noise Predictor

To make this look more like a diffusion objective, define a new variable for predicted noise:

\[\varepsilon_\Theta(z,t) = -\frac{2}{\lambda_t' b_t} \left(v_\Theta(z,t) - \frac{a_t'}{a_t} z\right)\]

Components explained:

$\varepsilon_\Theta(z,t)$: network-predicted noise.
$v_\Theta - \frac{a_t’}{a_t}z$: the “residual” part of the velocity after removing the self-scaling term.
$-\frac{2}{\lambda_t’ b_t}$: rescales this residual according to how strongly noise evolves over time.

What it does: This defines a mapping between the neural velocity prediction and an equivalent noise prediction network. Thus, the same model can be interpreted either as predicting velocity or predicting noise.

Step 5: Reformulate the Loss

Substituting (\varepsilon_\Theta) into the objective gives

\[\mathcal L_{\text{CFM}} \propto \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left[ \left(-\frac{b_t}{2}\lambda_t'\right)^2 \|\varepsilon_\Theta(z,t) - \varepsilon\|_2^2 \right]\]

Components explained:

(\left(-\frac{b_t}{2}\lambda_t’\right)^2): a weighting term adjusting the relative importance of different time steps.
(\varepsilon_\Theta(z,t)): predicted noise at time (t).
(\varepsilon): true Gaussian noise used to corrupt the sample.

What it does: Now the loss looks exactly like the standard diffusion noise-prediction objective — the network learns to recover the noise that was added, but through a deterministic flow framework.

Step 6: General Weighted Form

Following Kingma & Gao (2023), the generalized weighted version is written as:

\[\mathcal L_w(x_0) = -\frac{1}{2} \mathbb E_{t\sim U(0,1), \varepsilon\sim \mathcal N(0,I)} \left[ w_t \lambda_t' \|\varepsilon_\Theta(z_t, t) - \varepsilon\|_2^2 \right]\]

Components explained:

(w_t): time-dependent weight; controls how much each timestep contributes to training.
(U(0,1)): uniform distribution over timesteps.
The minus sign is conventional, ensuring consistent scaling with other likelihood-based formulations.

What it does: This equation unifies multiple generative modeling frameworks — diffusion, rectified flow, and EDM — under a single weighted noise-prediction loss form.

For the default Flow Matching case, (w_t = -\frac{1}{2}\lambda_t’ b_t^2).

Step 7: Conceptual Summary

Loss meaning: Train the model so its velocity (or noise) predictions reproduce how samples should move from noise to data.
Mathematical equivalence: Flow Matching and diffusion both learn noise removal, but Flow Matching uses deterministic ODEs rather than random SDEs.
Weighting flexibility: By changing (a_t,b_t,\lambda_t’,w_t), you can reproduce known models like Rectified Flow, EDM, or DDPM as special cases.

2. Flow Trajectories

The goal of this section is to describe different forward processes ( z_t ) — the ways we interpolate between data and noise — used in various Flow Matching formulations. Each trajectory defines a particular path, weighting, and loss structure that influence how models learn and how efficiently they sample images.

2.1 Rectified Flow (RF)

Rectified Flow (RF), introduced by Liu et al. (2022) and further analyzed in Lipman et al. (2023), adopts a simple straight-line interpolation between the data and noise:

\[z_t = (1 - t)x_0 + t\varepsilon\]

Explanation of terms:

$z_t$: interpolated sample at time $t$.
$x_0$: data sample.
$\varepsilon$: Gaussian noise.
$t$: interpolation variable between 0 and 1.

What it does: This creates a linear trajectory from the data point to the noise sample. The velocity along this path is constant:

\[\frac{dz_t}{dt} = \varepsilon - x_0\]

So the target velocity is simply the difference between the noise and the data. The model directly learns this velocity by minimizing the Conditional Flow Matching loss.

The corresponding weighting function for this path is:

\[w_t^{RF} = \frac{t}{1 - t}\]

Meaning: This weighting emphasizes intermediate steps more heavily, which improves training stability and model quality.

2.2 EDM (Elucidated Diffusion Models)

The EDM schedule, from Karras et al. (2022), is another popular trajectory:

\[z_t = x_0 + b_t \varepsilon\]

Components explained:

$b_t$: determines how much noise is added at time $t$.
The data term $x_0$ remains fixed, and only noise amplitude $b_t$ varies.

The functional form of $b_t$ is chosen using the inverse CDF of a normal distribution:

\[b_t = \exp\left(F_N^{-1}(t; P_m, P_s^2)\right)\]

where:

$F_N^{-1}$ is the normal quantile function (inverse CDF).
$P_m, P_s$ are mean and standard deviation parameters controlling the log-SNR schedule.

This design leads to a log-SNR distribution:

\[\lambda_t \sim \mathcal N(-2P_m, (2P_s)^2)\]

Interpretation: This ensures that the log-SNR follows a normal distribution — a convenient way to control the rate of signal decay.

The corresponding weighting for EDM-style loss is:

\[w_t^{EDM} = \mathcal N(\lambda_t \mid -2P_m, (2P_s)^2) \cdot (e^{-\lambda_t} + 0.5^2)\]

What it does: This weighting gives the model more emphasis on mid-range noise levels, improving quality for few-step samplers.

2.3 Cosine Schedule

Inspired by Nichol & Dhariwal (2021), the cosine schedule defines a smooth sinusoidal transition between data and noise:

\[z_t = \cos\left(\frac{\pi t}{2}\right)x_0 + \sin\left(\frac{\pi t}{2}\right)\varepsilon\]

Components explained:

The cosine and sine terms ensure that the coefficients satisfy $a_t^2 + b_t^2 = 1$, so total energy is preserved.
When $t=0$, $z_0 = x_0$; when $t=1$, $z_1 = \varepsilon$.

The cosine path is widely used in diffusion because it provides smooth noise transitions and balanced gradients.

Weightings:

For epsilon-prediction (noise form): $w_t = \text{sech}(\lambda_t / 2)$.
For velocity-prediction (Flow Matching form): $w_t = e^{-\lambda_t/2}$.

Interpretation: These exponential and hyperbolic-secant weightings control how much attention is given to early versus late timesteps, balancing training across noise levels.

2.4 Linear (DDPM / LDM) Schedule

The linear schedule (used in DDPM and LDM; Ho et al., 2020, Rombach et al., 2022) defines:

\[b_t = \sqrt{1 - a_t^2}\]

with

\[a_t = \prod_{s=0}^t (1 - \beta_s)^{1/2}\]

Parameters explained:

$\beta_s$: discrete noise coefficients defining how much variance is added at each step.
The linear DDPM schedule uses $\beta_t = \beta_0 + t \cdot \frac{\beta_{T-1} - \beta_0}{T-1}$.
The LDM variant uses a square-root interpolation between $\beta_0$ and $\beta_{T-1}$, improving stability at high resolutions.

2.5 Summary and Comparison

Schedule	Path Equation	Weighting (w_t)	Key Property
Rectified Flow	$z_t = (1 - t)x_0 + t\varepsilon$	$t/(1 - t)$	Straight-line trajectory, efficient training
EDM	$z_t = x_0 + b_t\varepsilon$	Normal-weighted $\lambda_t$	Smooth control of SNR; stable few-step sampling
Cosine	$z_t = \cos(\pi t/2)x_0 + \sin(\pi t/2)\varepsilon$	$\text{sech}(\lambda_t/2)$ or $e^{-\lambda_t/2}$	Smooth energy-preserving interpolation
Linear (DDPM/LDM)	$b_t = \sqrt{1 - a_t^2}$	schedule-defined	Classic diffusion schedule, baseline reference

2.6 Conceptual Takeaway

All these trajectories are special cases of the same Flow Matching framework — each defines a different probability path between data and noise. Rectified Flow’s straight-line path has the smallest curvature and hence the most efficient ODE trajectories, leading to faster and more stable image synthesis. Other schedules (EDM, cosine, linear) adjust weighting and curvature for specific trade-offs between sample quality, stability, and computational cost.

2.1 Tailored SNR Samplers for Rectified Flow Models

Although Rectified Flow (RF) models are trained with straight-line trajectories $z_t = (1 - t)x_0 + t\varepsilon$, the distribution over timesteps $t$ has a major impact on performance. This section introduces how modifying the sampling distribution of $t$ — instead of sampling uniformly — leads to better learning in the crucial middle region between noise and data.

Why Non-Uniform Timestep Sampling Helps

The velocity target for Rectified Flow is simple:

\[v_{\text{target}} = \varepsilon - x_0\]

However, the difficulty of this regression task depends on the timestep $t$:

At $t=0$: $z_t \approx$ data ($x_0$); the model easily predicts the mean of $p_1$ (noise distribution).
At $t=1$: $z_t \approx$ noise ($\varepsilon$); the model easily predicts the mean of $p_0$ (data distribution).
At midpoints ($t\approx0.5$): $z_t$ is a 50/50 blend of data and noise — the most ambiguous and hardest region to learn.

Uniform sampling over $t\in[0,1]$ underweights these middle regions. To address this, we bias the sampling density to focus more on mid-path points.

Mathematical View: Weighting Equivalence

If we replace uniform sampling (U(t)) with a custom distribution having density (\pi(t)), the expected loss changes equivalently to applying a weighted loss:

\[w_t^{(\pi)} = \frac{t}{1 - t} \pi(t)\]

Explanation of terms:

(w_t^{(\pi)}): effective weight applied to loss terms at timestep (t).
(\pi(t)): the chosen probability density function for (t).
(t/(1-t)): original Rectified Flow weighting (from its (w_t^{RF}) formulation).

What it does: This relationship means that instead of manually setting weights, we can simply sample t more often in important regions — the effect on the objective is equivalent.

Practical Sampling Distributions

Three main timestep samplers have been proposed to bias training toward the most informative regions:

1. Logit-Normal Distribution

\[\pi_{\text{ln}}(t; m, s) = \frac{1}{s\sqrt{2\pi}} \cdot \frac{1}{t(1 - t)} \exp\left(-\frac{(\logit(t) - m)^2}{2s^2}\right)\]

Components explained:

$\logit(t) = \log\frac{t}{1 - t}$: converts t from [0,1] to the full real line.
(m): location parameter that shifts the distribution left/right.
(s): scale parameter that controls spread (how strongly it emphasizes the middle).

What it does: This distribution increases density around intermediate (t)-values (e.g., 0.3–0.7), depending on (m,s). Typical parameters (m=0, s=1) focus strongly on the mid-region, leading to better coverage of ambiguous mixed samples.

2. Mode-Heavy Distribution (π_mode)

Alternative families of distributions are designed to keep nonzero density at the endpoints but still emphasize the center. These are obtained via mode-preserving mappings (f_{\text{mode}}) that control how sharply the midpoint is weighted.

The induced density is written as:

\[\pi_{\text{mode}}(t; s) = f'_{\text{mode}}(t; s)\]

where (s) tunes how heavily probability mass is concentrated around the midpoint. When (s=0), the distribution reduces to uniform; larger (s) creates stronger central emphasis.

3. CosMap (Cosine Mapping) Distribution

Derived from cosine log-SNR schedules (used in diffusion models), this sampler defines:

\[t = f(u) = 1 - \frac{1}{\tan(\pi u/2) + 1}\]

and its induced density matches the cosine SNR profile. It smoothly biases sampling toward the high-information mid-SNR region, mimicking the successful cosine schedule used in Nichol & Dhariwal (2021).

Empirical Effects

Extensive experiments in Liu et al. (2024) show that:

Non-uniform samplers (especially logit-normal with (m=0, s=1)) consistently outperform uniform sampling.
Improvements appear across CLIP score, FID, and human preference metrics.
The gains are strongest when using fewer ODE solver steps (e.g., 5–15 evaluations).

In short, focusing more training effort on intermediate (t) values teaches the model better velocity fields for hard-to-learn mixed states — improving sample quality and efficiency at all budgets.

Conceptual Summary

The RF model’s challenge lies in the midpath region between pure data and pure noise.
Sampling (t) non-uniformly (via distributions like logit-normal) effectively reweights the training loss to emphasize this region.
This improves generalization and leads to sharper, more consistent image generation, especially when using small inference step counts.

3. Architecture – MM-DiT Backbone and Training System

Flow Matching models operate in latent space, using a multimodal Transformer backbone (MM-DiT) designed for efficient, stable text-to-image generation. This section breaks down each architectural component — encoder, tokenization, multimodal attention, text conditioning, and stability features. This section is based on Liu et al. (2024) and builds on the same architectural lineage as Peebles & Xie (2023) and Rombach et al. (2022).

3.1 Latent Autoencoder

Images (X \in \mathbb R^{H \times W \times 3}) are first encoded into lower-dimensional latent representations:

\[x = E(X) \in \mathbb R^{h \times w \times d}\]

Explanation of terms:

(E): encoder network (usually a variational autoencoder).
(h, w): spatial resolution of latent space (typically (H/8, W/8)).
(d): latent channel dimension (commonly 8, 16, or 32).

Purpose: This reduces the computational load while preserving high-frequency image structure. Increasing (d) improves reconstruction accuracy and sets a higher ceiling for generation quality.

In practice: High-performing Flow Matching models use (d \ge 16) (more expressive latents than standard diffusion VAEs).

3.2 Patch Tokenization

The latent tensor (x) is divided into 2×2 patches and flattened:

\[N = \frac{h \times w}{p^2}\]

where (p=2) is the patch size and (N) is the number of image tokens.

Each patch is linearly projected to a common embedding dimension (D):

\[z_{\text{img}} = W_p \cdot \text{flatten}(x)\]

Explanation:

(W_p): learned projection matrix.
(\text{flatten}(x)): vectorized latent patch. This transforms visual features into a token sequence suitable for Transformer processing.

3.3 Text Embedding and Token Concatenation

Text conditioning uses one or more frozen encoders:

CLIP (for prompt understanding and semantic grounding).
T5-XXL (for complex compositional or descriptive captions).

Each encoder produces a sequence of embeddings (z_{\text{text}}), which are linearly mapped into the same embedding width (D) as the image tokens.

The final Transformer input is a concatenation:

\[[z_{\text{text}}; z_{\text{img}}]\]

This joint sequence allows direct attention interactions between modalities.

Benefit: The model learns cross-modal alignment implicitly — no explicit cross-attention blocks are required.

3.4 MM-DiT Block Design

Each MM-DiT block contains:

Two parameter streams — one for the text tokens and one for the image tokens.
Joint self-attention — computed over the concatenated text + image sequence.
Modulation layers — conditioning the block on timestep and pooled text embeddings.

Mathematical structure:

\[h' = h + \text{MHA}(\text{Norm}(h), \tau(t), \phi(\text{text}))\] \[h'' = h' + \text{MLP}(\text{Norm}(h'))\]

Components:

MHA: multi-head attention.
Norm: LayerNorm or RMSNorm.
(\tau(t)): timestep embedding.
(\phi(\text{text})): pooled text embedding used for FiLM-style modulation.

What it does: Each block updates both modalities in a synchronized way, allowing image and text tokens to influence each other directly through joint attention.

3.5 Parameterization and Scaling

Empirically, the following scaling laws hold for MM-DiT:

Hidden width: (64d) (where (d) is block depth index).
MLP expansion ratio: ×4.
Number of attention heads: (=d).

This configuration matches or exceeds the performance of CrossDiT, U-ViT, and standard DiT baselines at equal compute.

Training tip: Parameter scaling in MM-DiT is smooth — doubling hidden size or depth gives predictable validation-loss improvements, indicating efficient scaling behavior.

3.6 Text Encoders and Caption Augmentation

Multiple frozen encoders improve prompt following and visual alignment:

CLIP provides semantic robustness and contrastive grounding.
T5 adds syntactic understanding and long-form reasoning.

Training uses a 50/50 mixture of original human captions and synthetic captions generated from large language models. This strategy enhances compositional generalization and improves text rendering accuracy (e.g., for typography).

Inference optimization: T5 can be dropped at inference to reduce VRAM usage, with minimal degradation on short prompts.

3.7 Stability Enhancements for High Resolution

1. QK RMS Normalization: Applying RMSNorm to the Query and Key matrices inside attention:

\[Q' = \text{RMSNorm}(Q), \quad K' = \text{RMSNorm}(K)\]

Purpose: Prevents exploding attention logits during high-resolution finetuning and maintains numerical stability under mixed-precision (bf16) training.

2. Aspect-Ratio Bucketing: Training data is grouped by aspect ratio; each bucket uses its own positional embedding grid. At runtime, the relevant patch of the grid is center-cropped to fit the image’s resolution.

Benefit: This enables flexible, resolution-agnostic inference while preserving spatial consistency.

3.8 Summary of MM-DiT Architecture

Component	Function	Key Benefit
Latent Autoencoder	Compresses RGB images to latents	Reduces compute, improves reconstruction
Patch Tokenizer	Converts latent maps to sequences	Enables Transformer processing
Multimodal Joint Attention	Mixes text and image tokens	Direct semantic-visual interaction
Timestep/Text Modulation	Conditions on (t) and caption embeddings	Smooth time-dependent generation
QK RMSNorm	Stabilizes high-res finetuning	Prevents attention blow-up
Caption Mixing	Improves compositional generalization	Enhances typography and layout fidelity

Conceptual Takeaway

The MM-DiT architecture merges the simplicity of DiT with explicit multimodal conditioning and high-resolution stability techniques. When combined with Flow Matching, it provides:

A smooth deterministic trajectory from noise to image,
A well-structured cross-modal representation space, and
Scalable, few-step sampling suitable for large-scale text-to-image generation.

4. High-Resolution Finetuning and Scaling Behavior

Flow Matching models can be trained at a base resolution (e.g., 256² or 512²) and later finetuned at higher resolutions (e.g., 1024² or beyond). Unlike stochastic diffusion models, which require precise variance rebalancing during upscaling, deterministic Flow Matching ODEs permit direct resolution scaling — provided certain normalization and timestep corrections are applied.

4.1 The Stability Problem at High Resolution

At large resolutions, the Transformer’s attention logits ((QK^T / \sqrt{d_k})) can explode in magnitude, particularly under mixed-precision training (fp16 or bf16). This leads to:

Gradient instability,
Attention saturation (collapsed entropy),
And slower convergence due to precision loss.

Flow Matching’s deterministic dynamics amplify these issues because the model depends heavily on fine-grained velocity predictions at every pixel.

4.2 QK RMS Normalization

To stabilize high-resolution training, Liu et al. (2024) introduced RMS normalization of the Query (Q) and Key (K) matrices before computing attention logits:

\[Q' = \text{RMSNorm}(Q)\] \[K' = \text{RMSNorm}(K)\] \[A = \frac{Q' K'^T}{\sqrt{d_k}}\]

Explanation of terms:

(Q, K): query and key projections from input embeddings.
(d_k): key dimension.
RMSNorm: root-mean-square normalization without learned scale parameters.

What it does: This normalization ensures that both Q and K have unit variance per dimension, capping the magnitude of the attention logits regardless of input amplitude. This keeps gradients well-behaved and prevents attention entropy collapse.

Outcome:

Stable mixed-precision training at 1024² and above.
No need for slower full-precision fallback.
Consistent attention entropy across layers and resolutions.

4.3 Aspect-Ratio Bucketing and Positional Embeddings

High-resolution training must handle varying aspect ratios (e.g., landscape vs portrait). Instead of resizing all images to a square shape, training uses bucketed aspect ratios.

Each bucket defines its own positional embedding grid (\Pi_{H \times W}), built once for a large “master” grid and then center-cropped to fit the image’s aspect ratio.

Mathematical view:

\[\Pi_{h \times w} = \text{CenterCrop}(\Pi_{H \times W}, h, w)\]

Explanation:

(\Pi_{H \times W}): base positional embedding grid.
(h, w): dimensions for the current training batch.
(\text{CenterCrop}): extracts a region of the grid centered on the image dimensions.

What it does: Allows training and inference on diverse aspect ratios without retraining embeddings — improving generalization across image sizes and compositions.

4.4 Timestep Shifting Between Resolutions

When moving from base resolution (n) to higher resolution (m), the effective signal-to-noise ratio (SNR) changes. Higher resolutions have more detailed structure and thus require stronger noise to maintain the same corruption level.

To compensate, Flow Matching models apply a timestep shift:

\[t_m = \frac{\sqrt{m/n} , t_n}{1 + (\sqrt{m/n} - 1)t_n}\]

Explanation of terms:

(t_n): original timestep used during training at resolution (n).
(t_m): remapped timestep for higher resolution (m).
(\sqrt{m/n}): scale ratio adjusting the log-SNR.

This shift implies a log-SNR correction:

\[\lambda_{t_m} = \lambda_{t_n} - \log(m/n)\]

What it does: It keeps the effective corruption strength consistent across resolutions. This allows the model to reuse its velocity field and ODE integration behavior without retraining the full dynamics.

4.5 Empirical Guidelines for Resolution Shifting

Human preference studies and quantitative benchmarks (FID, CLIP, T2I-CompBench) show:

Moderate-to-strong shifts ((\alpha = \sqrt{m/n} \in [1.5, 3])) yield the best visual results at 1024².
Overly strong shifts can oversmooth fine textures, while weak shifts lead to uncorrupted samples (causing artifacts).
The ideal shift depends on model capacity and data diversity.

4.6 Scaling Trends and Behavior

Across model and data scales, Flow Matching and Rectified Flow exhibit smooth power-law scaling in validation loss and perceptual quality metrics:

\[\text{Loss}(N) \propto N^{-\beta}\]

where (N) is model size and (\beta \approx 0.2{-}0.25).

Observed trends:

Larger models need fewer ODE steps to reach optimal quality.
There is no saturation observed up to multi-billion-parameter scales.
Validation loss correlates strongly with human preference and compositional generalization.

4.7 Conceptual Takeaway

High-resolution scaling in Flow Matching relies on:

QK RMSNorm — keeps attention stable and precise.
Aspect-ratio bucketing — enables flexible aspect generalization.
Timestep shifting — preserves effective noise levels across resolutions.
Smooth scaling laws — larger models both perform better and sample faster.

In combination, these techniques make Flow Matching models robust to upscaling and efficient in both training and inference, setting the foundation for high-resolution, few-step text-to-image synthesis.

5. Training and Evaluation

Training a Flow Matching model follows a structured pipeline that combines latent-space encoding, timestep sampling, multimodal conditioning, and efficient optimization. Evaluation uses both quantitative metrics (e.g., FID, CLIP) and human preference scores to measure fidelity, diversity, and alignment.

5.1 Training Objective Recap

The Flow Matching model learns a velocity field (v_\Theta(z, t)) using the Conditional Flow Matching loss (from Section 1.3):

\[\mathcal L_{\text{CFM}} = \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left[\|v_\Theta(z,t) - u_t(z|\varepsilon)\|_2^2\right]\]

where (u_t(z

\varepsilon)) is the analytical conditional velocity, given by:

\[u_t(z|\varepsilon) = \frac{a_t'}{a_t} z - \frac{b_t}{2} \lambda_t' \varepsilon\]

This can equivalently be written as a noise-prediction loss:

\[\mathcal L_{\text{CFM}} \propto \mathbb E_{t, p_t(z|\varepsilon), p(\varepsilon)} \left[ \left(-\frac{b_t}{2}\lambda_t'\right)^2 \|\varepsilon_\Theta(z,t) - \varepsilon\|_2^2 \right]\]

Plain English summary:

The model predicts either the velocity of samples or the noise that must be removed.
The objective ensures smooth, deterministic motion from noise to data along the ODE trajectory.
The weighting term ((-b_t \lambda_t’/2)^2) controls how much each timestep contributes.

5.2 Training Setup and Data Pipeline

1. Latent-Space Encoding: Each image is encoded into a latent representation (x = E(X)) using a pre-trained autoencoder (usually trained jointly or frozen). Training occurs entirely in latent space for efficiency.

2. Timestep Sampling: Timesteps (t) are drawn from a biased distribution, typically logit-normal (\pi_{\text{ln}}(t; m=0, s=1)), to emphasize mid-path learning. This improves gradient flow and image fidelity.

3. Noise Injection: For each (x_0), sample (\varepsilon \sim \mathcal N(0, I)) and compute interpolants (z_t = (1 - t)x_0 + t\varepsilon). The target velocity (\varepsilon - x_0) is then used in the regression loss.

4. Caption Processing: Text captions are embedded using frozen encoders (CLIP, T5). Synthetic caption augmentation (≈50%) improves compositional generalization and style fidelity.

5. Optimization:

Optimizer: AdamW or Lion with cosine decay.
Batch size: typically 4k–16k latent samples.
Precision: mixed bf16 for efficiency.
Training length: 300k–800k steps, depending on model scale.
Learning rate: (1\text{e}{-4}) warmup to decay schedule.

5.3 Validation Metrics

Evaluation uses both automated and human-aligned metrics.

1. CLIP Score: Measures text–image alignment using CLIP cosine similarity.

2. FID (Fréchet Inception Distance): Measures image quality and realism by comparing distributions of generated and real images in feature space.

3. T2I-CompBench: Evaluates compositional generalization — how well the model combines objects, colors, and relations described in text.

4. GenEval: A benchmark that compares generated and reference images on diversity, structure, and global coherence.

5. Human Preference Studies: Human raters compare pairs of generated images for prompt adherence and visual quality.

5.4 Few-Step Sampling Evaluation

A key advantage of Flow Matching is efficient sampling. Unlike diffusion, which needs 50–250 denoising steps, Flow Matching achieves competitive quality with 5–15 ODE evaluations.

Observations from Liu et al. (2024):

RF models retain sharpness and semantic fidelity even with 8-step sampling.
Adaptive ODE solvers (e.g., Runge–Kutta or Heun methods) further reduce required evaluations.
Quality degrades smoothly with fewer steps — not abruptly as in diffusion.

5.5 Scaling Behavior

Across model and data scales, validation loss and FID/CLIP metrics improve predictably, following a power-law relationship:

\[\text{Metric} \propto N^{-\beta}\]

with exponent (\beta \approx 0.2{-}0.25), consistent with scaling trends in large Transformers. This smooth scaling indicates no saturation within observed parameter ranges (up to billions of parameters).

Interpretation: Larger models both:

Learn more accurate flow fields, and
Require fewer ODE evaluations to reach convergence-quality images.

5.6 Generalization and Robustness

Flow Matching models demonstrate:

Strong zero-shot generalization across text prompts and domains.
Stable quality degradation under step truncation or latent perturbations.
Robust scaling of loss-to-quality correlations, meaning validation loss is a reliable proxy for perceptual metrics.

Empirically, Flow Matching’s deterministic structure leads to higher consistency across seeds and batch sizes compared to stochastic diffusion.

5.7 Conceptual Summary

The training process combines latent-space learning, mid-path timestep sampling, and multimodal conditioning for efficient supervision.
Evaluation benchmarks confirm Flow Matching’s sample efficiency and scalability.
Larger models yield faster and smoother sampling trajectories, requiring fewer integration steps for the same quality.
These properties make Flow Matching a strong replacement for diffusion in large-scale, text-conditioned generative systems.

Alright, majestic energy accepted 👑 — here’s the final section: Section 6: Practical Implementation Checklist, which condenses everything we’ve covered into a concrete, step-by-step guide for implementing high-quality Flow Matching (FM) and Rectified Flow (RF) text-to-image models. This section is derived from the design details in Liu et al. (2024), Lipman et al. (2023), and best practices across large-scale Transformer-based generative systems.

6. Practical Implementation Checklist

This section serves as a “recipe card” for reproducing performant Flow Matching models from scratch — covering path setup, model architecture, training configurations, and inference tricks.

6.1 Path and Probability Setup

Forward Path (Rectified Flow):

\[z_t = (1 - t)x_0 + t\varepsilon\]

Timestep Sampler: Use a logit-normal distribution with parameters (m=0, s=1):

\[\pi_{\text{ln}}(t; m, s) = \frac{1}{s\sqrt{2\pi}} \frac{1}{t(1-t)} \exp\left(-\frac{(\logit(t) - m)^2}{2s^2}\right)\]

Purpose: This focuses training on the ambiguous midpoints between data and noise, improving stability and quality.

6.2 Loss Function

Train with the Conditional Flow Matching loss (velocity or noise-prediction form):

\[\mathcal L_{\text{CFM}} = \mathbb E_{t, p_t(z|\varepsilon), \varepsilon} \left[\|v_\Theta(z, t) - u_t(z|\varepsilon)\|_2^2\right]\]

or equivalently,

\[\mathcal L_{\text{CFM}} \propto \mathbb E_{t, p_t(z|\varepsilon), \varepsilon} \left[ \left(-\frac{b_t}{2}\lambda_t'\right)^2 \|\varepsilon_\Theta(z, t) - \varepsilon\|_2^2 \right]\]

Implementation tip: Monitor validation loss at fixed (t)-values to detect underfitting or timestep imbalance.

6.3 Architecture Configuration

Backbone: Multimodal DiT (MM-DiT) — a Transformer with dual parameter streams for text and image tokens.

Parameter	Recommendation	Notes
Hidden width	(64d)	scales with depth
Heads	(=d)	proportional to layer depth
MLP expansion	×4	matches DiT convention
Latent channels	(d \ge 16)	improves reconstruction and detail
Patch size	(2 \times 2)	balances memory and context length

Stability: Use QK RMSNorm on attention projections to prevent logit explosion at high resolutions.

Text Conditioning: Combine frozen encoders (e.g., CLIP + T5). Use 50/50 original and synthetic captions during training. Drop T5 during inference for efficiency.

6.4 High-Resolution Finetuning

Aspect-Ratio Bucketing: Use multiple resolution buckets with center-cropped positional embeddings to support flexible aspect ratios.

Timestep Shift:

\[t_m = \frac{\sqrt{m/n},t_n}{1 + (\sqrt{m/n} - 1)t_n}\]

where (n) is base resolution, (m) is target resolution. Empirically, a shift factor (\alpha = \sqrt{m/n} \in [1.5, 3]) works best at 1024².

6.5 Training Configuration

Setting	Recommended Value	Description
Optimizer	AdamW or Lion	robust for Transformer training
Precision	bf16	stable mixed-precision
Learning Rate	1e-4 cosine decay	with warmup
Batch Size	4k–16k	depending on compute
Timestep Distribution	logit-normal	focus mid-path learning
Caption Mix	50% synthetic	improves generalization
Training Steps	300k–800k	typical for large models

6.6 Evaluation and Sampling

Metrics:

CLIP score — text-image alignment
FID — realism and diversity
GenEval / T2I-CompBench — compositional reasoning
Human preference — final quality validation

Sampler: Use adaptive ODE solvers (Runge–Kutta, Heun, or DPM-Solver++). Flow Matching models converge with 5–15 steps, even at high resolution.

Few-Step Strategy:

Large models allow smaller step counts (8 or fewer) with minimal loss in quality.
Quality degrades smoothly — not catastrophically as in diffusion.

6.7 Debugging and Validation Checklist

Validate reconstruction fidelity of autoencoder before training the flow.
Verify velocity predictions at multiple (t)-values (e.g., visualize (v_\Theta(z,t))).
Ensure stable loss magnitude across timesteps (no exploding weights near endpoints).
Track validation CLIP/FID every 10k–20k steps.
If mid-path performance lags, increase logit-normal scale (s) slightly (>1).
Re-check timestep shift when finetuning to higher resolutions.

6.8 Conceptual Summary

A high-quality Flow Matching model emerges from a simple but disciplined design recipe:

Straight-line probability paths (Rectified Flow).
Mid-path–focused timestep sampling (logit-normal).
Multimodal DiT backbone with joint attention and stability normalization.
Timestep remapping for consistent cross-resolution dynamics.
Few-step adaptive solvers for rapid, deterministic sampling.

When combined, these choices produce text-to-image models that:

Retain diffusion’s strong conditioning and latent supervision,
But replace noisy denoising with a smooth, deterministic ODE trajectory,
Achieving high-quality generation at a fraction of the inference cost.

References

Lipman, Yaron, et al. “Flow matching for generative modeling.” ICLR 2023.
Original Flow Matching paper introducing the framework and conditional flow matching objective.
Albergo, Michael S., et al. “Building normalizing flows with stochastic interpolants.” ICLR 2023.
Stochastic interpolants framework, closely related to Flow Matching.
** Scaling Rectified Flow Transformers for High-Resolution Image Synthesis x