MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain, Alexandros Stergiou

University of Twente, NL

Paper ArXiv GitHub


Abstract


Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.



Overview


VLM Inversion Overview
Fig. 1: MIMIC inversion pipeline and synthesized outputs for tiger shark, castle, cassette player, abacus, carwheel, bald eagle, bakery, academic gown, accordion, carousel, ambulance, and leatherback turtle. A noisy input is optimized using an aggregated loss that combines adapted cross-entropy \( \mathcal{L}_{\mathrm{SCE}} \), base feature alignment \( \mathcal{L}_{\mathrm{base}} \), and regularizers \( \mathcal{R} \) to ensure semantic fidelity and visual coherence.


MIMIC Framework


Method. We initialize an updatable input \( \color{#B49}\widehat{\mathbf{v}}\color{#000} \in \mathbb{R}^{C \times H \times W} \) with \(C\) channels, \(H\) height, and \(W\) width. As VLMs can respond to queries given a multimodal context window, we include a text prompt template \( \mathbf{t} \) alongside our updatable input. For example: What is shown in the picture: a. [target] concept, or b. [negative] concept. Text is tokenized by \( \mathcal{G}(\mathbf{t}) \) into a sequence of embeddings. Similarly, vision input \( \color{#B49}\widehat{\mathbf{v}} \) is encoded by \( \mathcal{E} \) to embeddings \( \mathcal{E}(\color{#B49}\widehat{\mathbf{v}}\color{#000};\theta_e) \in \mathbb{R}^{D \times \Omega} \), with \( D \) image tokens and \(\Omega\) channels. Both are combined into a concatenated input: \( \color{#b49} \widehat{\mathbf{x}}\color{#000} = [ \mathcal{G}(\mathbf{t}), \mathcal{E}(\color{#B49}\widehat{\mathbf{v}}\color{#000};\theta_e) ] \)

VLM Inversion. The backbone LLM model \( \Phi(\cdot;{\theta_\phi}) \) with \(\theta_\phi\) frozen params, infers \( \color{#b49} \widehat{\mathbf{x}}\color{#000} \) and returns a probabilistic distribution of token logits. Each logit over a fixed-length dictionary \( \color{#b49} \widehat{\mathbf{y}}\color{#000} = \Phi(\color{#b49} \widehat{\mathbf{x}}\color{#000},\color{#b49} \widehat{\mathbf{y}}_{< i}\color{#000};\theta_\phi) \) where \( \color{#b49} \widehat{\mathbf{y}}_{< i}\color{#000} \) are the previously-generated \(i-1\) logits.

We define an adapted cross-entropy loss \( \mathcal{L}_{SCE} \) given the token index with the highest logit for [target] as: $$ \mathcal{L}_{SCE}(\color{#b49} \widehat{\mathbf{y}}\color{#000}) = - \sum \mathbf{1}{(\text{sg}(\widehat{\mathbf{y}}),i, \texttt{[target]})} \log(\color{#b49} \widehat{\mathbf{y}}\color{#000}) $$ where \( \mathbf{1}(\text{sg}(\widehat{\mathbf{y}}),i,\texttt{[target]}) \) is the indicator function, \( \text{sg}(\cdot) \) denotes stop-gradient.

Base Feature Loss. To align synthesized images with the internal representations of the vision encoder, we extract per-layer features: \( \color{#B49} \widehat{\mathbf{z}}\color{#000}_l = \mathcal{E}(\color{#B49}\widehat{\mathbf{v}}\color{#000};\theta_e,< l) \) with the internal representations of the vision encoder, we approximate the manifold's mean \( \mu(\mathcal{Z}_l) \) and variance \( \sigma(\mathcal{Z}_l) \) from \( \texttt{[target]} \) images given \( \theta_{\mathcal{E},l} \) weights, across \( l \in \Lambda = \{1,\dots,L\} \) layers.

Regularizers. We enhance optimization with three regularizers inspired by vision-only inversion methods to smooths transitions across patch boundaries with a composite image prior \( \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) \) which includes total variation (TV) and \( \ell_2 \)-norm penalties for smoothness and range control. We also include a feature distribution regularizer \( \mathcal{R}_V \) encourages alignment with BN feature statistics from a verifier network \( \mathcal{F} \). The aggregated regularization objective becomes: $$ \mathcal{R}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) = \beta_1 \mathcal{R}_V(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \beta_2 \mathcal{R}_{\text{patch}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}) + \mathcal{R}_{\text{prior}}(\color{#B49}\widehat{\mathbf{v}}\color{#000}), $$ where \( \beta_1, \beta_2, \alpha_1, \alpha_2, \alpha_3 \) are scaling hyperparameters.

Aggregated Optimization Objective. We iteratively update \( \color{#B49}\widehat{\mathbf{v}}\color{#000} \) over steps \( s \rightarrow s+1 \) by minimizing the total objective: $$ \color{#B49}\widehat{\mathbf{v}}\color{#000}^{s+1} = \min_{\widehat{\mathbf{v}}^s} \gamma_1 \mathcal{L}_{SCE} \left(\Phi([\mathcal{G}(\mathbf{t}), \mathcal{E}(\color{#b49} \widehat{\mathbf{v}}\color{#000},\theta_e)];\theta_\phi\right) + \gamma_2 \mathcal{L}_{\text{base}} + \mathcal{R}(\color{#b49} \widehat{\mathbf{v}}\color{#000}), $$ where \( \gamma_1, \gamma_2 \) are loss scaling factors. This objective guides reconstructions that reveal internal VLM encodings, making their learned concepts visually interpretable.



Results


We invert visual-instruct-tuned LLaMA3-8B, Mistral-7B, and Vicuna-7/13B. Models only run inference with the updatable input \( \hat{\mathbf{v}} \in \mathbb{R}^{3 \times 448 \times 448} \) initialized from a Gaussian \( \hat{\mathbf{v}} \sim \mathcal{N}(0,1) \).


Additional qualitative examples of synthesized features across target tokens
Qualitative examples of synthesized features across [target] tokens. MIMIC synthesizes coherent features across models with various target tokens. Descriptive VLM features learned for target semantics are often based on distinct shapes such as the examples for [airliner] and [offshore rig]. Positive correlations between materials and colors are also learned for instances such as [school bus], [dome], and [minivan].

Ablation studies. We further ablate template t across vision tokens \( \hat{\mathbf{v}} \). MIMIC can robustly visualize the main learned features, such as water reflections in [dock] and the dial plate in [magnetic compass] using different text prompts while remaining consistent across prompt variations.

Synthesized images over varying text prompts
Synthesized images over varying text prompts for dock, magnetic compass, and obelisk. \( \mathbf{t}_1 \): What is shown in the image? a.[target] or b.[negative], \( \mathbf{t}_2 \): Does the image show an instance of [target] or [negative]?, and \( \mathbf{t}_3 \): The image depicts a scene that corresponds to [target] or [negative]?



Citation



@article{jain2025mimic,
  title     = {MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization},
  author    = {Jain, Animesh and Stergiou, Alexandros},
  year      = {2025},
  journal   = {arXiv}
}


Contact


For questions about MIMIC, send an email to:

 animesh.jain1203@gmail.com