Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
goldfish
, golden retriever
, and
corn
. A noisy input is optimized using an aggregated
loss that combines adapted cross-entropy \(
\mathcal{L}_{\mathrm{SCE}} \), base feature alignment \(
\mathcal{L}_{\mathrm{base}} \), and regularizers \( \mathcal{R} \)
to ensure semantic fidelity and visual coherence.
Method. We initialize an
updatable input \(
\color{#B49}\widehat{\mathbf{v}}\color{#000} \in \mathbb{R}^{C \times H
\times W} \) with \(C\) channels, \(H\) height, and \(W\) width. As VLMs
can respond to queries given a multimodal context window, we create a
text prompt template \( \mathbf{t} \) for a concept we want to
visualize. For example: What is shown in the picture: a.
[target]
concept, or b.
[negative]
concept — which can be, e.g.,
a. tiger and b. dog. Text is tokenized by \(
\mathcal{G}(\mathbf{t}) \) into a sequence of embeddings. Similarly,
vision input \( \color{#B49}\widehat{\mathbf{v}} \) is encoded by \(
\mathcal{E}_{\theta_e} \) to embeddings \(
\mathcal{E}_{\theta_e}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) \in
\mathbb{R}^{D \times \Omega} \), with \( D \) image tokens and
\(\Omega\) channels. Both are combined into a concatenated input: \(
\color{#b49} \widehat{\mathbf{x}}\color{#000} = [
\mathcal{G}(\mathbf{t}),
\mathcal{E}_{\theta_e}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) ]
\)
VLM Inversion. The backbone LLM model \( \Phi_{\theta_\phi} \) attends to the multimodal prompt \( \color{#b49} \widehat{\mathbf{x}}\color{#000} \). to a probabilistic distribution over a fixed-length dictionary \( \color{#b49} \widehat{\mathbf{y}}\color{#000} = s(\Phi_{\theta_\phi}(\color{#b49} \widehat{\mathbf{x}}\color{#000})) \)
We define an adapted cross-entropy loss \( \mathcal{L}_{SCE} \) given
the token index with the highest logit for [target]
as: $$
\mathcal{L}_{SCE}(\color{#b49} \widehat{\mathbf{y}}\color{#000}) = -
\sum \mathbf{1}_{[\text{sg}(\widehat{\mathbf{y}}) = \texttt{[target]}]}
\log(\color{#b49} \widehat{\mathbf{y}}\color{#000}) $$ where \(
\mathbf{1}_{[\cdot]} \) is the indicator function, \( \text{sg}(\cdot)
\) denotes stop-gradient.
Base Feature Loss. To align synthesized images with the internal representations of the vision encoder, we extract per-layer features: \( \widehat{\mathbf{z}}_l = \mathcal{E}_{\theta_e}(\color{#B49}\widehat{\mathbf{v}}\color{#000}, l) \) and compute their mean and variance across layers \( l \in \Lambda = \{1,\dots,L\} \) compared to reference pre-computed target statistics \( \bar{\mathbf{Z}}_l \).
Triplet Feature Regularizer. We enhance optimization with three regularizers inspired by vision-only inversion methods to smooths transitions across patch boundaries with a composite image prior \( \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) \) which includes total variation (TV) and \( \ell_2 \)-norm penalties for smoothness and range control. We also include a feature distribution regularizer \( \mathcal{R}_V \) encourages alignment with BN feature statistics from a verifier network \( \mathcal{F} \). The aggregated regularization objective becomes: $$ \mathcal{R}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) = \beta_1 \mathcal{R}_V(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \beta_2 \mathcal{R}_{\text{patch}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}), $$ where \( \beta_1, \beta_2, \alpha_1, \alpha_2, \alpha_3 \) are scaling hyperparameters.
Aggregated Optimization Objective. We iteratively update \( \color{#B49}\widehat{\mathbf{v}}\color{#000} \) over steps \( i \rightarrow i+1 \) by minimizing the total objective: $$ \color{#B49}\widehat{\mathbf{v}}\color{#000}_{i+1} = \min_{\widehat{\mathbf{v}}_i} \gamma_1 \mathcal{L}_{SCE} \left(s(\Phi_{\theta_\phi}([\mathcal{G}(\mathbf{t}), \mathcal{E}_{\theta_e}(\color{#b49} \widehat{\mathbf{v}}\color{#000})]))\right) + \gamma_2 \mathcal{L}_{\text{base}} + \mathcal{R}(\color{#b49} \widehat{\mathbf{v}}\color{#000}), $$ where \( \gamma_1, \gamma_2 \) are loss scaling factors. This objective guides reconstructions that reveal internal VLM encodings, making their learned concepts visually interpretable.
We invert LLaVA-1.5 that consists of a CLIP ViT-L/14 vision encoder and a LLaMA-3-8B-Instruct language model.
Baseline results of our method's performance across configurations are shown that the aggregated objective achieves a balance across metrics, with 41.25% Top-1 accuracy, and CLIPScore of 27.42.
Qualitative results show improvements across semantic, perceptual, and distributional metrics with each objective component. Additional experiments on varying output lengths \( |\color{#b49} \widehat{\mathbf{y}}\color{#000}| \) demonstrate that image quality remains visually consistent, confirming the robustness of our approach.
@article{jain2025mimic, title = {MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization}, author = {Jain, Animesh and Stergiou, Alexandros}, year = {2025}, journal = {arXiv} }