CLIP
Contrastive Language-Image Pre-training
๐จ Apa itu CLIP?
CLIP (Contrastive Language-Image Pre-training) adalah model multimodal dari OpenAI yang belajar menghubungkan gambar dan teks dalam embedding space yang sama.
Key Innovation:
CLIP dilatih dengan 400 juta (image, text) pairs dari internet menggunakan contrastive learning. Hasil: model yang bisa melakukan zero-shot classification tanpa perlu training tambahan!
๐ก Mengapa CLIP penting?
- ๐ฏ Zero-shot transfer: Classify images tanpa training examples
- ๐ Multimodal understanding: Bridge vision & language
- โก Flexible: Text prompt sebagai classifier
- ๐ Foundation model: Base untuk DALL-E, Stable Diffusion
๐ฏ Yang Akan Dipelajari
Contrastive Learning
InfoNCE loss dan pairing
Dual Encoders
Image & text encoders
Training
Similarity matrix & loss
Zero-Shot
Classification tanpa examples
๐ Multimodal Duality
Images processed dengan Vision Transformer (ViT) atau ResNet
Output: 512-dim embedding vector
Text processed dengan Transformer encoder
Output: 512-dim embedding vector
Kedua encoder di-align dalam shared embedding space sehingga image dan matching text punya embedding yang similar!
Contrastive Learning
Belajar dari Perbandingan
๐ Prinsip Contrastive Learning
Contrastive learning melatih model untuk membedakan:
- โ Positive pairs: (image, matching caption) - harus dekat
- โ Negative pairs: (image, non-matching caption) - harus jauh
๐ InfoNCE Loss
CLIP menggunakan InfoNCE (Noise Contrastive Estimation) loss:
Maximize similarity untuk positive pair, minimize untuk negatives
๐จ Example: Batch dengan N=4 pairs
๐ฌ Contrastive Pairing Animation
Dual Encoders
Image & Text Encoders
๐๏ธ CLIP Architecture
CLIP terdiri dari dua encoder terpisah yang dilatih secara joint:
โ๏ธ Key Components
- ๐ผ๏ธ Vision Transformer (ViT): Patch-based image processing
- ๐ Text Transformer: Masked self-attention untuk text
- ๐ฏ Projection heads: Map ke shared embedding space
- ๐ L2 normalization: Ensure embeddings pada unit sphere
๐ฌ Encoder Architecture Animation
๐ก Embedding Space
Kedua encoder menghasilkan embeddings dalam same 512-dimensional space. Contrastive training membuat:
- โ Matched image-text memiliki high cosine similarity
- โ Unmatched pairs memiliki low similarity
Training Process
Symmetric Contrastive Loss
๐ Training Batch
Training CLIP dilakukan dengan batch besar (e.g., N=32,768):
- Batch of N (image, text) pairs dari dataset
- Encode images โ N image embeddings (Iโ, Iโ, ..., I_N)
- Encode texts โ N text embeddings (Tโ, Tโ, ..., T_N)
- Compute NรN similarity matrix
- Calculate symmetric loss
๐ฅ Similarity Matrix
Untuk batch N=4, similarity matrix S menunjukkan cosine similarity antara semua pairs:
โ Diagonal = positive pairs (high
similarity)
โ Off-diagonal = negative pairs (low similarity)
๐ Symmetric Loss
Symmetric loss ensures bidirectional alignment!
๐ฌ Training Batch Animation
โก Training Details
- ๐ Dataset: 400M (image, text) pairs dari internet
- ๐ฏ Batch size: 32,768 (very large!)
- โฑ๏ธ Training time: ~12 days pada 592 V100 GPUs
- ๐ง Optimizer: AdamW dengan cosine learning rate schedule
Zero-Shot Classification
Classification Tanpa Training Examples
๐ฏ Apa itu Zero-Shot?
Zero-shot classification: Model bisa classify ke class yang belum pernah dilihat saat training!
How?
Gunakan text prompts sebagai classifiers. Untuk classify image ke {dog, cat, car}:
- Generate prompts: "a photo of a dog", "a photo of a cat", "a photo of a car"
- Encode semua prompts โ text embeddings
- Encode image โ image embedding
- Compute similarity dengan semua class prompts
- Argmax โ predicted class!
๐ Example: Image Classification
Task: Classify image ke 3 classes
๐ฌ Zero-Shot Demo Animation
๐ก Prompt Engineering
Prompt design sangat mempengaruhi akurasi! Tips:
- ๐ฏ Template: "a photo of a {class}" works well
- ๐ Ensemble: Use multiple prompts per class
- ๐ Context: Add context, e.g. "a photo of a {class}, a type of pet"
- ๐ Domain: Adjust untuk domain-specific (medical, satellite)
Applications
CLIP Use Cases
๐ CLIP Applications
CLIP telah menjadi foundation model untuk berbagai aplikasi multimodal:
๐ Image-Text Retrieval
Search images dengan query text, atau sebaliknya
๐จ Text-to-Image Generation
CLIP guides generation models (DALL-E, Stable Diffusion)
๐ Zero-Shot Classification
Classify tanpa training examples untuk new classes
โ Visual Question Answering
Answer questions tentang image content
๐ท๏ธ Image Captioning
Generate descriptive captions untuk images
๐ฏ Object Detection
Open-vocabulary object detection dengan text queries
๐ Notable Projects Using CLIP
- ๐จ DALL-E 2: Text-to-image generation dengan CLIP-guided diffusion
- ๐ผ๏ธ Stable Diffusion: Open-source generation model using CLIP text encoder
- ๐ OpenCLIP: Open reproduction of CLIP dengan larger datasets
- ๐ฌ Video understanding: Extend CLIP to video domain (CLIP4Clip)
- ๐ฅ Medical imaging: Zero-shot diagnosis dengan domain-specific prompts
๐ก Why CLIP is Powerful
Key advantages:
- โ No labeled data needed for new tasks
- โ Flexible via text - just change prompts
- โ Generalizes well across domains
- โ Composable - combine with other models
Implementation
PyTorch Code
๐ป CLIP Model Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class CLIP(nn.Module):
def __init__(self, image_encoder, text_encoder, embed_dim=512):
super().__init__()
self.image_encoder = image_encoder # ViT or ResNet
self.text_encoder = text_encoder # Transformer
# Projection heads
self.image_proj = nn.Linear(image_encoder.output_dim, embed_dim)
self.text_proj = nn.Linear(text_encoder.output_dim, embed_dim)
# Learnable temperature
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1/0.07))
def encode_image(self, images):
# images: (batch, 3, 224, 224)
image_features = self.image_encoder(images)
image_embeds = self.image_proj(image_features)
image_embeds = F.normalize(image_embeds, dim=-1)
return image_embeds
def encode_text(self, text):
# text: (batch, max_length) token ids
text_features = self.text_encoder(text)
text_embeds = self.text_proj(text_features)
text_embeds = F.normalize(text_embeds, dim=-1)
return text_embeds
def forward(self, images, texts):
image_embeds = self.encode_image(images) # (N, embed_dim)
text_embeds = self.encode_text(texts) # (N, embed_dim)
# Scaled cosine similarity
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_embeds @ text_embeds.T # (N, N)
logits_per_text = logits_per_image.T
return logits_per_image, logits_per_text
๐ Training Loop
def train_clip(model, dataloader, optimizer, device):
model.train()
for images, texts in dataloader:
images = images.to(device)
texts = texts.to(device)
# Forward pass
logits_per_image, logits_per_text = model(images, texts)
# Ground truth: diagonal matrix (positive pairs)
batch_size = images.shape[0]
labels = torch.arange(batch_size, device=device)
# Symmetric loss
loss_img = F.cross_entropy(logits_per_image, labels)
loss_txt = F.cross_entropy(logits_per_text, labels)
loss = (loss_img + loss_txt) / 2
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
๐ฏ Zero-Shot Inference
def zero_shot_classify(model, image, class_names, device):
"""
Classify image to one of class_names without training.
"""
model.eval()
# Prepare image
image = preprocess(image).unsqueeze(0).to(device)
# Generate text prompts
prompts = [f"a photo of a {name}" for name in class_names]
text_tokens = tokenize(prompts).to(device)
with torch.no_grad():
# Encode
image_embed = model.encode_image(image) # (1, 512)
text_embeds = model.encode_text(text_tokens) # (num_classes, 512)
# Compute similarities
logit_scale = model.logit_scale.exp()
similarities = logit_scale * image_embed @ text_embeds.T # (1, num_classes)
# Softmax to get probabilities
probs = F.softmax(similarities, dim=-1).squeeze(0)
# Prediction
pred_idx = probs.argmax().item()
pred_class = class_names[pred_idx]
pred_conf = probs[pred_idx].item()
return pred_class, pred_conf, probs
# Example usage
class_names = ["dog", "cat", "car", "airplane"]
pred, conf, all_probs = zero_shot_classify(model, image, class_names, device)
print(f"Prediction: {pred} ({conf*100:.1f}% confidence)")
๐ง Using Pre-trained CLIP
import clip
# Load pre-trained CLIP
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Load and preprocess image
from PIL import Image
image = preprocess(Image.open("dog.jpg")).unsqueeze(0).to(device)
# Prepare text
text = clip.tokenize(["a dog", "a cat", "a car"]).to(device)
# Get predictions
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # [[0.92, 0.06, 0.02]]
Advanced Topics
Beyond Basic CLIP
๐ CLIP Variants
- ๐ OpenCLIP: Open-source reproduction dengan larger datasets (LAION-2B)
- ๐ MetaCLIP: Curated training data for better quality
- ๐ฌ CLIP4Clip: Extend to video understanding
- ๐ AudioCLIP: Add audio modality
- ๐ฅ MedCLIP: Domain-specific for medical imaging
๐ง Fine-tuning Strategies
When to fine-tune:
- โ Domain dengan visual concepts sangat specific
- โ Ada labeled data untuk target task
- โ Zero-shot performance tidak cukup
Fine-tuning approaches:
- ๐ฏ Full fine-tuning: Update all parameters
- โก Adapter layers: Add small trainable modules
- ๐ Prompt tuning: Learn continuous prompts
- ๐ Linear probe: Only train classifier head
๐ก Tips & Best Practices
- ๐ Prompt engineering: Experiment with different templates
- ๐ฏ Ensemble: Average predictions across multiple prompts
- ๐ Image preprocessing: Follow CLIP's normalization
- โ๏ธ Scaling: Larger models (ViT-L/14) perform better but slower
- ๐ Multilingual: Use M-CLIP for non-English text
๐ฎ Future Directions
- ๐จ Generative models: Better integration dengan diffusion models
- ๐ฌ Video understanding: Temporal consistency
- ๐ง 3D vision: Extend to 3D scenes
- ๐ Multilingual & multicultural: Better global coverage
- โก Efficiency: Smaller models for edge deployment
โ Selamat!
๐ Tutorial Selesai!
Anda telah mempelajari:
- โ Contrastive learning dengan InfoNCE loss
- โ Dual encoder architecture (Image + Text)
- โ Training process dengan similarity matrix
- โ Zero-shot classification tanpa examples
- โ Applications & PyTorch implementation
- โ Advanced variants & fine-tuning