Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
tiger shark, castle, cassette player, abacus, carwheel, bald eagle, bakery, academic gown, accordion, carousel, ambulance, and
leatherback turtle. A noisy input is optimized using an aggregated
loss that combines adapted cross-entropy \(
\mathcal{L}_{\mathrm{SCE}} \), base feature alignment \(
\mathcal{L}_{\mathrm{base}} \), and regularizers \( \mathcal{R} \)
to ensure semantic fidelity and visual coherence.
Method. We initialize an
updatable input \(
\color{#B49}\widehat{\mathbf{v}}\color{#000} \in \mathbb{R}^{C \times H
\times W} \) with \(C\) channels, \(H\) height, and \(W\) width. As VLMs
can respond to queries given a multimodal context window, we include a text prompt template \( \mathbf{t} \) alongside our updatable input. For example: What is shown in the picture: a.
[target] concept, or b.
[negative] concept. Text is tokenized by \(
\mathcal{G}(\mathbf{t}) \) into a sequence of embeddings. Similarly,
vision input \( \color{#B49}\widehat{\mathbf{v}} \) is encoded by \(
\mathcal{E} \) to embeddings \(
\mathcal{E}(\color{#B49}\widehat{\mathbf{v}}\color{#000};\theta_e) \in
\mathbb{R}^{D \times \Omega} \), with \( D \) image tokens and
\(\Omega\) channels. Both are combined into a concatenated input: \(
\color{#b49} \widehat{\mathbf{x}}\color{#000} = [
\mathcal{G}(\mathbf{t}),
\mathcal{E}(\color{#B49}\widehat{\mathbf{v}}\color{#000};\theta_e) ]
\)
VLM Inversion. The backbone LLM model \( \Phi(\cdot;{\theta_\phi}) \) with \(\theta_\phi\) frozen params, infers \( \color{#b49} \widehat{\mathbf{x}}\color{#000} \) and returns a probabilistic distribution of token logits. Each logit over a fixed-length dictionary \( \color{#b49} \widehat{\mathbf{y}}\color{#000} = \Phi(\color{#b49} \widehat{\mathbf{x}}\color{#000},\color{#b49} \widehat{\mathbf{y}}_{< i}\color{#000};\theta_\phi) \) where \( \color{#b49} \widehat{\mathbf{y}}_{< i}\color{#000} \) are the previously-generated \(i-1\) logits.
We define an adapted cross-entropy loss \( \mathcal{L}_{SCE} \) given
the token index with the highest logit for [target] as: $$
\mathcal{L}_{SCE}(\color{#b49} \widehat{\mathbf{y}}\color{#000}) = -
\sum \mathbf{1}{(\text{sg}(\widehat{\mathbf{y}}),i, \texttt{[target]})}
\log(\color{#b49} \widehat{\mathbf{y}}\color{#000}) $$ where \(
\mathbf{1}(\text{sg}(\widehat{\mathbf{y}}),i,\texttt{[target]}) \) is the indicator function, \( \text{sg}(\cdot)
\) denotes stop-gradient.
Base Feature Loss. To align synthesized images with the internal representations of the vision encoder, we extract per-layer features: \( \color{#B49} \widehat{\mathbf{z}}\color{#000}_l = \mathcal{E}(\color{#B49}\widehat{\mathbf{v}}\color{#000};\theta_e,< l) \) with the internal representations of the vision encoder, we approximate the manifold's mean \( \mu(\mathcal{Z}_l) \) and variance \( \sigma(\mathcal{Z}_l) \) from \( \texttt{[target]} \) images given \( \theta_{\mathcal{E},l} \) weights, across \( l \in \Lambda = \{1,\dots,L\} \) layers.
Regularizers. We enhance optimization with three regularizers inspired by vision-only inversion methods to smooths transitions across patch boundaries with a composite image prior \( \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) \) which includes total variation (TV) and \( \ell_2 \)-norm penalties for smoothness and range control. We also include a feature distribution regularizer \( \mathcal{R}_V \) encourages alignment with BN feature statistics from a verifier network \( \mathcal{F} \). The aggregated regularization objective becomes: $$ \mathcal{R}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) = \beta_1 \mathcal{R}_V(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \beta_2 \mathcal{R}_{\text{patch}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}), $$ where \( \beta_1, \beta_2, \alpha_1, \alpha_2, \alpha_3 \) are scaling hyperparameters.
Aggregated Optimization Objective. We iteratively update \( \color{#B49}\widehat{\mathbf{v}}\color{#000} \) over steps \( s \rightarrow s+1 \) by minimizing the total objective: $$ \color{#B49}\widehat{\mathbf{v}}\color{#000}^{s+1} = \min_{\widehat{\mathbf{v}}^s} \gamma_1 \mathcal{L}_{SCE} \left(\Phi([\mathcal{G}(\mathbf{t}), \mathcal{E}(\color{#b49} \widehat{\mathbf{v}}\color{#000},\theta_e)];\theta_\phi\right) + \gamma_2 \mathcal{L}_{\text{base}} + \mathcal{R}(\color{#b49} \widehat{\mathbf{v}}\color{#000}), $$ where \( \gamma_1, \gamma_2 \) are loss scaling factors. This objective guides reconstructions that reveal internal VLM encodings, making their learned concepts visually interpretable.
We invert visual-instruct-tuned LLaMA3-8B, Mistral-7B, and Vicuna-7/13B. Models only run inference with the updatable input \( \hat{\mathbf{v}} \in \mathbb{R}^{3 \times 448 \times 448} \) initialized from a Gaussian \( \hat{\mathbf{v}} \sim \mathcal{N}(0,1) \).
[target] tokens.
MIMIC synthesizes coherent features across models with various target tokens.
Descriptive VLM features learned for target semantics are often based on
distinct shapes such as the examples for [airliner] and
[offshore rig]. Positive correlations between materials and
colors are also learned for instances such as [school bus],
[dome], and [minivan].
Ablation studies. We further ablate template
t across vision tokens \( \hat{\mathbf{v}} \).
MIMIC can robustly visualize the main learned features, such as water
reflections in [dock] and the dial plate in
[magnetic compass] using different text prompts while remaining consistent across prompt variations.
dock, magnetic compass, and obelisk.
\( \mathbf{t}_1 \): What is shown in the image? a.[target] or b.[negative],
\( \mathbf{t}_2 \): Does the image show an instance of [target] or [negative]?, and
\( \mathbf{t}_3 \): The image depicts a scene that corresponds to [target] or [negative]?
@article{jain2025mimic,
title = {MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization},
author = {Jain, Animesh and Stergiou, Alexandros},
year = {2025},
journal = {arXiv}
}