CLIP

Contrastive Language-Image Pre-training

๐ŸŽจ Apa itu CLIP?

CLIP (Contrastive Language-Image Pre-training) adalah model multimodal dari OpenAI yang belajar menghubungkan gambar dan teks dalam embedding space yang sama.

Key Innovation:

CLIP dilatih dengan 400 juta (image, text) pairs dari internet menggunakan contrastive learning. Hasil: model yang bisa melakukan zero-shot classification tanpa perlu training tambahan!

๐Ÿ’ก Mengapa CLIP penting?

  • ๐ŸŽฏ Zero-shot transfer: Classify images tanpa training examples
  • ๐ŸŒ Multimodal understanding: Bridge vision & language
  • โšก Flexible: Text prompt sebagai classifier
  • ๐Ÿš€ Foundation model: Base untuk DALL-E, Stable Diffusion

๐ŸŽฏ Yang Akan Dipelajari

๐Ÿ”—

Contrastive Learning

InfoNCE loss dan pairing

๐Ÿ—๏ธ

Dual Encoders

Image & text encoders

๐Ÿ“Š

Training

Similarity matrix & loss

๐ŸŽฏ

Zero-Shot

Classification tanpa examples

๐ŸŒˆ Multimodal Duality

๐Ÿ–ผ๏ธ Vision

Images processed dengan Vision Transformer (ViT) atau ResNet

Output: 512-dim embedding vector

๐Ÿ“ Language

Text processed dengan Transformer encoder

Output: 512-dim embedding vector

Kedua encoder di-align dalam shared embedding space sehingga image dan matching text punya embedding yang similar!

Contrastive Learning

Belajar dari Perbandingan

๐Ÿ”— Prinsip Contrastive Learning

Contrastive learning melatih model untuk membedakan:

  • โœ… Positive pairs: (image, matching caption) - harus dekat
  • โŒ Negative pairs: (image, non-matching caption) - harus jauh

๐Ÿ“ InfoNCE Loss

CLIP menggunakan InfoNCE (Noise Contrastive Estimation) loss:

L = -log(exp(sim(i,t)/ฯ„) / ฮฃj exp(sim(i,tj)/ฯ„))
sim(i,t) = (IยทT) / (||I|| ||T||) (cosine similarity)
ฯ„: temperature parameter (learnable)

Maximize similarity untuk positive pair, minimize untuk negatives

๐ŸŽจ Example: Batch dengan N=4 pairs

๐Ÿ• [Image: Golden retriever]
"a golden retriever dog"
โœ“ Positive pair (diagonal)
๐Ÿ• [Image: Golden retriever]
"a red sports car"
โœ— Negative pair (off-diagonal)
๐Ÿš— [Image: Sports car]
"a golden retriever dog"
โœ— Negative pair (off-diagonal)
๐Ÿš— [Image: Sports car]
"a red sports car"
โœ“ Positive pair (diagonal)

๐ŸŽฌ Contrastive Pairing Animation

Visualize positive & negative pairs

Dual Encoders

Image & Text Encoders

๐Ÿ—๏ธ CLIP Architecture

CLIP terdiri dari dua encoder terpisah yang dilatih secara joint:

๐Ÿ–ผ๏ธ Image Encoder
Input: Image (224ร—224)
โ†“
Vision Transformer (ViT) or ResNet-50
โ†“
Linear projection
โ†“
L2 normalize
โ†“
Output: 512-dim embedding
๐Ÿ“ Text Encoder
Input: Text (max 77 tokens)
โ†“
Tokenization + embedding
โ†“
Transformer (12 layers)
โ†“
Linear projection
โ†“
L2 normalize
โ†“
Output: 512-dim embedding

โš™๏ธ Key Components

  • ๐Ÿ–ผ๏ธ Vision Transformer (ViT): Patch-based image processing
  • ๐Ÿ“ Text Transformer: Masked self-attention untuk text
  • ๐ŸŽฏ Projection heads: Map ke shared embedding space
  • ๐Ÿ“ L2 normalization: Ensure embeddings pada unit sphere

๐ŸŽฌ Encoder Architecture Animation

See how encoders process inputs

๐Ÿ’ก Embedding Space

Kedua encoder menghasilkan embeddings dalam same 512-dimensional space. Contrastive training membuat:

  • โœ… Matched image-text memiliki high cosine similarity
  • โŒ Unmatched pairs memiliki low similarity

Training Process

Symmetric Contrastive Loss

๐Ÿ“Š Training Batch

Training CLIP dilakukan dengan batch besar (e.g., N=32,768):

  1. Batch of N (image, text) pairs dari dataset
  2. Encode images โ†’ N image embeddings (Iโ‚, Iโ‚‚, ..., I_N)
  3. Encode texts โ†’ N text embeddings (Tโ‚, Tโ‚‚, ..., T_N)
  4. Compute Nร—N similarity matrix
  5. Calculate symmetric loss

๐Ÿ”ฅ Similarity Matrix

Untuk batch N=4, similarity matrix S menunjukkan cosine similarity antara semua pairs:

Tโ‚
Tโ‚‚
Tโ‚ƒ
Tโ‚„
Iโ‚
0.89
0.12
0.05
0.18
Iโ‚‚
0.15
0.91
0.08
0.11
Iโ‚ƒ
0.09
0.14
0.87
0.10
Iโ‚„
0.13
0.07
0.16
0.93

โ— Diagonal = positive pairs (high similarity)
โ— Off-diagonal = negative pairs (low similarity)

๐Ÿ“ Symmetric Loss

L = (L_Iโ†’T + L_Tโ†’I) / 2
L_Iโ†’T: image-to-text (row-wise softmax)
L_Tโ†’I: text-to-image (column-wise softmax)

Symmetric loss ensures bidirectional alignment!

๐ŸŽฌ Training Batch Animation

Visualize batch processing & similarity matrix

โšก Training Details

  • ๐Ÿ“Š Dataset: 400M (image, text) pairs dari internet
  • ๐ŸŽฏ Batch size: 32,768 (very large!)
  • โฑ๏ธ Training time: ~12 days pada 592 V100 GPUs
  • ๐Ÿ”ง Optimizer: AdamW dengan cosine learning rate schedule

Zero-Shot Classification

Classification Tanpa Training Examples

๐ŸŽฏ Apa itu Zero-Shot?

Zero-shot classification: Model bisa classify ke class yang belum pernah dilihat saat training!

How?

Gunakan text prompts sebagai classifiers. Untuk classify image ke {dog, cat, car}:

  • Generate prompts: "a photo of a dog", "a photo of a cat", "a photo of a car"
  • Encode semua prompts โ†’ text embeddings
  • Encode image โ†’ image embedding
  • Compute similarity dengan semua class prompts
  • Argmax โ†’ predicted class!

๐Ÿ“Š Example: Image Classification

Task: Classify image ke 3 classes

Input Image
๐Ÿ•
Golden Retriever
Class Prompts & Scores
"a photo of a dog" 92.5% โœ“
"a photo of a cat" 4.8%
"a photo of a car" 2.7%

๐ŸŽฌ Zero-Shot Demo Animation

Try zero-shot classification interactively

๐Ÿ’ก Prompt Engineering

Prompt design sangat mempengaruhi akurasi! Tips:

  • ๐ŸŽฏ Template: "a photo of a {class}" works well
  • ๐Ÿ“ Ensemble: Use multiple prompts per class
  • ๐Ÿ” Context: Add context, e.g. "a photo of a {class}, a type of pet"
  • ๐ŸŒ Domain: Adjust untuk domain-specific (medical, satellite)

Applications

CLIP Use Cases

๐Ÿš€ CLIP Applications

CLIP telah menjadi foundation model untuk berbagai aplikasi multimodal:

๐Ÿ” Image-Text Retrieval

Search images dengan query text, atau sebaliknya

Example: "Find sunrise beach photos" โ†’ retrieves matching images

๐ŸŽจ Text-to-Image Generation

CLIP guides generation models (DALL-E, Stable Diffusion)

Example: CLIP loss steers diffusion process to match prompt

๐Ÿ“Š Zero-Shot Classification

Classify tanpa training examples untuk new classes

Example: Classify medical images ke rare diseases

โ“ Visual Question Answering

Answer questions tentang image content

Example: "What color is the car?" โ†’ "Red"

๐Ÿท๏ธ Image Captioning

Generate descriptive captions untuk images

Example: Image โ†’ "A golden retriever playing in the park"

๐ŸŽฏ Object Detection

Open-vocabulary object detection dengan text queries

Example: Detect "person wearing red hat" tanpa training

๐ŸŒŸ Notable Projects Using CLIP

  • ๐ŸŽจ DALL-E 2: Text-to-image generation dengan CLIP-guided diffusion
  • ๐Ÿ–ผ๏ธ Stable Diffusion: Open-source generation model using CLIP text encoder
  • ๐Ÿ” OpenCLIP: Open reproduction of CLIP dengan larger datasets
  • ๐ŸŽฌ Video understanding: Extend CLIP to video domain (CLIP4Clip)
  • ๐Ÿฅ Medical imaging: Zero-shot diagnosis dengan domain-specific prompts

๐Ÿ’ก Why CLIP is Powerful

Key advantages:

  • โœ… No labeled data needed for new tasks
  • โœ… Flexible via text - just change prompts
  • โœ… Generalizes well across domains
  • โœ… Composable - combine with other models

Implementation

PyTorch Code

๐Ÿ’ป CLIP Model Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIP(nn.Module):
    def __init__(self, image_encoder, text_encoder, embed_dim=512):
        super().__init__()
        self.image_encoder = image_encoder  # ViT or ResNet
        self.text_encoder = text_encoder    # Transformer
        
        # Projection heads
        self.image_proj = nn.Linear(image_encoder.output_dim, embed_dim)
        self.text_proj = nn.Linear(text_encoder.output_dim, embed_dim)
        
        # Learnable temperature
        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1/0.07))
    
    def encode_image(self, images):
        # images: (batch, 3, 224, 224)
        image_features = self.image_encoder(images)
        image_embeds = self.image_proj(image_features)
        image_embeds = F.normalize(image_embeds, dim=-1)
        return image_embeds
    
    def encode_text(self, text):
        # text: (batch, max_length) token ids
        text_features = self.text_encoder(text)
        text_embeds = self.text_proj(text_features)
        text_embeds = F.normalize(text_embeds, dim=-1)
        return text_embeds
    
    def forward(self, images, texts):
        image_embeds = self.encode_image(images)  # (N, embed_dim)
        text_embeds = self.encode_text(texts)      # (N, embed_dim)
        
        # Scaled cosine similarity
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_embeds @ text_embeds.T  # (N, N)
        logits_per_text = logits_per_image.T
        
        return logits_per_image, logits_per_text

๐ŸŽ“ Training Loop

def train_clip(model, dataloader, optimizer, device):
    model.train()
    
    for images, texts in dataloader:
        images = images.to(device)
        texts = texts.to(device)
        
        # Forward pass
        logits_per_image, logits_per_text = model(images, texts)
        
        # Ground truth: diagonal matrix (positive pairs)
        batch_size = images.shape[0]
        labels = torch.arange(batch_size, device=device)
        
        # Symmetric loss
        loss_img = F.cross_entropy(logits_per_image, labels)
        loss_txt = F.cross_entropy(logits_per_text, labels)
        loss = (loss_img + loss_txt) / 2
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        return loss.item()

๐ŸŽฏ Zero-Shot Inference

def zero_shot_classify(model, image, class_names, device):
    """
    Classify image to one of class_names without training.
    """
    model.eval()
    
    # Prepare image
    image = preprocess(image).unsqueeze(0).to(device)
    
    # Generate text prompts
    prompts = [f"a photo of a {name}" for name in class_names]
    text_tokens = tokenize(prompts).to(device)
    
    with torch.no_grad():
        # Encode
        image_embed = model.encode_image(image)      # (1, 512)
        text_embeds = model.encode_text(text_tokens)  # (num_classes, 512)
        
        # Compute similarities
        logit_scale = model.logit_scale.exp()
        similarities = logit_scale * image_embed @ text_embeds.T  # (1, num_classes)
        
        # Softmax to get probabilities
        probs = F.softmax(similarities, dim=-1).squeeze(0)
        
        # Prediction
        pred_idx = probs.argmax().item()
        pred_class = class_names[pred_idx]
        pred_conf = probs[pred_idx].item()
        
    return pred_class, pred_conf, probs

# Example usage
class_names = ["dog", "cat", "car", "airplane"]
pred, conf, all_probs = zero_shot_classify(model, image, class_names, device)
print(f"Prediction: {pred} ({conf*100:.1f}% confidence)")

๐Ÿ”ง Using Pre-trained CLIP

import clip

# Load pre-trained CLIP
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Load and preprocess image
from PIL import Image
image = preprocess(Image.open("dog.jpg")).unsqueeze(0).to(device)

# Prepare text
text = clip.tokenize(["a dog", "a cat", "a car"]).to(device)

# Get predictions
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # [[0.92, 0.06, 0.02]]

Advanced Topics

Beyond Basic CLIP

๐Ÿš€ CLIP Variants

  • ๐ŸŒ OpenCLIP: Open-source reproduction dengan larger datasets (LAION-2B)
  • ๐Ÿ“Š MetaCLIP: Curated training data for better quality
  • ๐ŸŽฌ CLIP4Clip: Extend to video understanding
  • ๐Ÿ”Š AudioCLIP: Add audio modality
  • ๐Ÿฅ MedCLIP: Domain-specific for medical imaging

๐Ÿ”ง Fine-tuning Strategies

When to fine-tune:

  • โœ… Domain dengan visual concepts sangat specific
  • โœ… Ada labeled data untuk target task
  • โœ… Zero-shot performance tidak cukup

Fine-tuning approaches:

  • ๐ŸŽฏ Full fine-tuning: Update all parameters
  • โšก Adapter layers: Add small trainable modules
  • ๐Ÿ”’ Prompt tuning: Learn continuous prompts
  • ๐Ÿ“Š Linear probe: Only train classifier head

๐Ÿ’ก Tips & Best Practices

  • ๐Ÿ“ Prompt engineering: Experiment with different templates
  • ๐ŸŽฏ Ensemble: Average predictions across multiple prompts
  • ๐Ÿ” Image preprocessing: Follow CLIP's normalization
  • โš–๏ธ Scaling: Larger models (ViT-L/14) perform better but slower
  • ๐ŸŒ Multilingual: Use M-CLIP for non-English text

๐Ÿ”ฎ Future Directions

  • ๐ŸŽจ Generative models: Better integration dengan diffusion models
  • ๐ŸŽฌ Video understanding: Temporal consistency
  • ๐Ÿง  3D vision: Extend to 3D scenes
  • ๐ŸŒ Multilingual & multicultural: Better global coverage
  • โšก Efficiency: Smaller models for edge deployment

โœ… Selamat!

๐ŸŽ‰ Tutorial Selesai!

Anda telah mempelajari:

  • โœ… Contrastive learning dengan InfoNCE loss
  • โœ… Dual encoder architecture (Image + Text)
  • โœ… Training process dengan similarity matrix
  • โœ… Zero-shot classification tanpa examples
  • โœ… Applications & PyTorch implementation
  • โœ… Advanced variants & fine-tuning

๐Ÿš€ Next Steps

โ€ข Try CLIP dengan your own images/texts

โ€ข Explore OpenCLIP for larger models

โ€ข Read paper: "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)

โ€ข Build applications: retrieval, generation, VQA