Asymmetric Idiosyncrasies in Multimodal Models

Abstract

Synthetic captions are widely used to train and scale multimodal systems, yet different captioning models may embed distinct stylistic signatures into their outputs. Do these idiosyncrasies propagate into the images produced by downstream text-to-image (T2I) models? We investigate with a simple "name-that-model" attribution framework.

We find a striking asymmetry:

In text: caption source is identifiable at 99.70% accuracy, and fingerprints survive paraphrasing (over 95%) and T2I text encoders (T5: 99.74%, CLIP: 94.14%).
In images: the best attribution drops to only 49.85% with Flux-schnell, barely above the 33.3% chance level, while the same classifier reaches 76.7% on natural images.
Root cause: T2I models fail to faithfully realize fine-grained caption differences in detail level, color/texture vocabulary, and scene composition.

Key Results

We collect 30k captions per model from Claude-3.5-Sonnet, Gemini-1.5-Pro, GPT-4o, and Qwen3-VL across images sampled from CC3M, COCO, ImageNet, and MNIST. A fine-tuned BERT classifier performs text attribution; a ResNet-18 classifier (trained for 300 epochs with Mixup + CutMix) performs image attribution on outputs from SD 1.5, SD 2.1, SDXL, and Flux-schnell.

Text Attribution

99.70%

Near-perfect across all four captioning models.

Image Attribution (Flux)

49.85%

Best among all T2I models tested.

3-way Chance

33.3%

Image attribution is only modestly above random.

Natural Images

76.7%

Same classifier on real images. The gap is generation-specific.

Ablation studies confirm the finding is robust: adding original images as a fourth class yields 82.1% for natural images but only around 50% for generated ones, and switching to CLIP features with a linear probe gives similarly low results (41 to 46%), ruling out classifier architecture as the explanation.

Classification performance on generated images

Image attribution accuracy across four T2I models. Even the best (Flux-schnell) only reaches around 50%.

Why The Gap Exists

T2I text encoders preserve nearly all stylistic signal (T5: 99.74%, CLIP: 94.14%), so the bottleneck lies in the generation process itself. We trace the loss to three dimensions where captions diverge but images converge:

Detail Level

Captions show a clear detail hierarchy: Gemini is ranked most detailed in 84% of cases, GPT least detailed in 72%. But after generation, the ranking nearly reverses. All three models produce images with similar detail, and GPT images are even judged slightly richer.

Color & Texture

Gemini uses 5.18 color terms per caption while GPT uses only 2.09, with similar gaps in nuanced texture vocabulary. Yet these pronounced lexical differences do not yield proportionate visual separability. T2I models normalize fine-grained color and texture instructions.

Composition

Claude leads in compositional cues: 93% of its captions mention spatial layers, 87% reference guiding elements. Gemini and GPT score lower. But per-class image accuracy remains uniformly low, implying composition instructions are partially lost by the generator.

Detail-level rankings of captions vs generated images

Detail-level ranking on captions (left) vs. generated images (right). Captions exhibit a clear model hierarchy (Gemini dominates "most detailed"), but images show near-uniform distributions across all three models.

Per-class classification accuracy for generated images

Per-class image attribution accuracy across four T2I models. Despite large differences in color, texture, and composition vocabulary, all caption sources remain hard to distinguish in the image domain.

Qualitative Examples

Each row below shows an original image, the three captions generated by Claude, Gemini, and GPT, and the corresponding synthesized images. Despite substantial differences in the captions, the generated images reveal several systematic failures:

Color: a simple term like "blue" without texture specification leads all models to produce similar color effects, none matching the true darker tone in the original.
Viewpoint: explicit view descriptions are inconsistently realized. A "high-angle" caption may yield an eye-level rendering, and vice versa.
Appearance: different color terms for the same object still produce visually similar images that diverge from the original.

Qualitative comparison of captions and generated images across models

Captions differ substantially across models in color, viewpoint, and detail, yet generated images look unexpectedly similar.