demo

An image is just a sequence of patches

Vision Transformers don't see images differently from text models — they cut images into patches and feed them as tokens. Same transformer block, same self-attention, different input shape.

Anchored to 12-multimodal/vision-language-models.