MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain, Alexandros Stergiou

University of Twente, NL

Paper ArXiv GitHub


Abstract


Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.



Overview


VLM Inversion Overview
Fig. 1: MIMIC inversion pipeline and synthesized outputs for goldfish, golden retriever, and corn. A noisy input is optimized using an aggregated loss that combines adapted cross-entropy \( \mathcal{L}_{\mathrm{SCE}} \), base feature alignment \( \mathcal{L}_{\mathrm{base}} \), and regularizers \( \mathcal{R} \) to ensure semantic fidelity and visual coherence.


MIMIC Framework


Method. We initialize an updatable input \( \color{#B49}\widehat{\mathbf{v}}\color{#000} \in \mathbb{R}^{C \times H \times W} \) with \(C\) channels, \(H\) height, and \(W\) width. As VLMs can respond to queries given a multimodal context window, we create a text prompt template \( \mathbf{t} \) for a concept we want to visualize. For example: What is shown in the picture: a. [target] concept, or b. [negative] concept — which can be, e.g., a. tiger and b. dog. Text is tokenized by \( \mathcal{G}(\mathbf{t}) \) into a sequence of embeddings. Similarly, vision input \( \color{#B49}\widehat{\mathbf{v}} \) is encoded by \( \mathcal{E}_{\theta_e} \) to embeddings \( \mathcal{E}_{\theta_e}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) \in \mathbb{R}^{D \times \Omega} \), with \( D \) image tokens and \(\Omega\) channels. Both are combined into a concatenated input: \( \color{#b49} \widehat{\mathbf{x}}\color{#000} = [ \mathcal{G}(\mathbf{t}), \mathcal{E}_{\theta_e}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) ] \)

VLM Inversion. The backbone LLM model \( \Phi_{\theta_\phi} \) attends to the multimodal prompt \( \color{#b49} \widehat{\mathbf{x}}\color{#000} \). to a probabilistic distribution over a fixed-length dictionary \( \color{#b49} \widehat{\mathbf{y}}\color{#000} = s(\Phi_{\theta_\phi}(\color{#b49} \widehat{\mathbf{x}}\color{#000})) \)

We define an adapted cross-entropy loss \( \mathcal{L}_{SCE} \) given the token index with the highest logit for [target] as: $$ \mathcal{L}_{SCE}(\color{#b49} \widehat{\mathbf{y}}\color{#000}) = - \sum \mathbf{1}_{[\text{sg}(\widehat{\mathbf{y}}) = \texttt{[target]}]} \log(\color{#b49} \widehat{\mathbf{y}}\color{#000}) $$ where \( \mathbf{1}_{[\cdot]} \) is the indicator function, \( \text{sg}(\cdot) \) denotes stop-gradient.

Base Feature Loss. To align synthesized images with the internal representations of the vision encoder, we extract per-layer features: \( \widehat{\mathbf{z}}_l = \mathcal{E}_{\theta_e}(\color{#B49}\widehat{\mathbf{v}}\color{#000}, l) \) and compute their mean and variance across layers \( l \in \Lambda = \{1,\dots,L\} \) compared to reference pre-computed target statistics \( \bar{\mathbf{Z}}_l \).

Triplet Feature Regularizer. We enhance optimization with three regularizers inspired by vision-only inversion methods to smooths transitions across patch boundaries with a composite image prior \( \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) \) which includes total variation (TV) and \( \ell_2 \)-norm penalties for smoothness and range control. We also include a feature distribution regularizer \( \mathcal{R}_V \) encourages alignment with BN feature statistics from a verifier network \( \mathcal{F} \). The aggregated regularization objective becomes: $$ \mathcal{R}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) = \beta_1 \mathcal{R}_V(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \beta_2 \mathcal{R}_{\text{patch}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}), $$ where \( \beta_1, \beta_2, \alpha_1, \alpha_2, \alpha_3 \) are scaling hyperparameters.

Aggregated Optimization Objective. We iteratively update \( \color{#B49}\widehat{\mathbf{v}}\color{#000} \) over steps \( i \rightarrow i+1 \) by minimizing the total objective: $$ \color{#B49}\widehat{\mathbf{v}}\color{#000}_{i+1} = \min_{\widehat{\mathbf{v}}_i} \gamma_1 \mathcal{L}_{SCE} \left(s(\Phi_{\theta_\phi}([\mathcal{G}(\mathbf{t}), \mathcal{E}_{\theta_e}(\color{#b49} \widehat{\mathbf{v}}\color{#000})]))\right) + \gamma_2 \mathcal{L}_{\text{base}} + \mathcal{R}(\color{#b49} \widehat{\mathbf{v}}\color{#000}), $$ where \( \gamma_1, \gamma_2 \) are loss scaling factors. This objective guides reconstructions that reveal internal VLM encodings, making their learned concepts visually interpretable.



Results


VLM Inversion Results
Fig. 2: Image and semantic quality across accuracy, perceptual (IS, LPIPS, FID), and semantic (CLIPScore) metrics. The aggregated objective performs best overall, with each component contributing to improved alignment or image quality. Bold and underlined indicate top and second-best results.


We invert LLaVA-1.5 that consists of a CLIP ViT-L/14 vision encoder and a LLaMA-3-8B-Instruct language model.

Baseline results of our method's performance across configurations are shown that the aggregated objective achieves a balance across metrics, with 41.25% Top-1 accuracy, and CLIPScore of 27.42.

Qualitative results show improvements across semantic, perceptual, and distributional metrics with each objective component. Additional experiments on varying output lengths \( |\color{#b49} \widehat{\mathbf{y}}\color{#000}| \) demonstrate that image quality remains visually consistent, confirming the robustness of our approach.




Citation



@article{jain2025mimic,
  title     = {MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization},
  author    = {Jain, Animesh and Stergiou, Alexandros},
  year      = {2025},
  journal   = {arXiv}
}


Contact


For questions about MIMIC, send an email to:

 animesh.jain1203@gmail.com