demo
An image is just a sequence of patches
Vision Transformers don't see images differently from text models — they cut images into patches and feed them as tokens. Same transformer block, same self-attention, different input shape.
Anchored to 12-multimodal/vision-language-models.