Synthetic captions are widely used to train and scale multimodal systems, yet different captioning models may embed distinct stylistic signatures into their outputs. Do these idiosyncrasies propagate into the images produced by downstream text-to-image (T2I) models? We investigate with a simple "name-that-model" attribution framework.
We find a striking asymmetry:
We collect 30k captions per model from Claude-3.5-Sonnet, Gemini-1.5-Pro, GPT-4o, and Qwen3-VL across images sampled from CC3M, COCO, ImageNet, and MNIST. A fine-tuned BERT classifier performs text attribution; a ResNet-18 classifier (trained for 300 epochs with Mixup + CutMix) performs image attribution on outputs from SD 1.5, SD 2.1, SDXL, and Flux-schnell.
Text Attribution
99.70%
Near-perfect across all four captioning models.
Image Attribution (Flux)
49.85%
Best among all T2I models tested.
3-way Chance
33.3%
Image attribution is only modestly above random.
Natural Images
76.7%
Same classifier on real images. The gap is generation-specific.
Ablation studies confirm the finding is robust: adding original images as a fourth class yields 82.1% for natural images but only around 50% for generated ones, and switching to CLIP features with a linear probe gives similarly low results (41 to 46%), ruling out classifier architecture as the explanation.
T2I text encoders preserve nearly all stylistic signal (T5: 99.74%, CLIP: 94.14%), so the bottleneck lies in the generation process itself. We trace the loss to three dimensions where captions diverge but images converge:
Captions show a clear detail hierarchy: Gemini is ranked most detailed in 84% of cases, GPT least detailed in 72%. But after generation, the ranking nearly reverses. All three models produce images with similar detail, and GPT images are even judged slightly richer.
Gemini uses 5.18 color terms per caption while GPT uses only 2.09, with similar gaps in nuanced texture vocabulary. Yet these pronounced lexical differences do not yield proportionate visual separability. T2I models normalize fine-grained color and texture instructions.
Claude leads in compositional cues: 93% of its captions mention spatial layers, 87% reference guiding elements. Gemini and GPT score lower. But per-class image accuracy remains uniformly low, implying composition instructions are partially lost by the generator.
Each row below shows an original image, the three captions generated by Claude, Gemini, and GPT, and the corresponding synthesized images. Despite substantial differences in the captions, the generated images reveal several systematic failures:
BibTeX coming soon.