- Feature Alignment Stage - Requirement
- Single GPU sufficient for 558K LAION-CC-SBU subset with frozen vision encoder and LLM
- Feature Alignment Stage - Tradeoff
- Minimal compute vs establishing vision-language connection
- Feature Alignment Stage - Optimization Level
- Critical
- Visual Instruction Tuning - Requirement
- Multiple A100 GPUs recommended for 665K multimodal instruction data (150K GPT + 515K VQA)
- Visual Instruction Tuning - Tradeoff
- Quality gains vs compute scaling
- Visual Instruction Tuning - Optimization Level
- Critical
- Memory Optimization - Requirement
- Frozen pretrained encoders reduce active parameter training; projector-only updates in stage 1
- Memory Optimization - Tradeoff
- Lower memory footprint vs full model fine-tuning
- Memory Optimization - Optimization Level
- Critical
- Inference Efficiency - Requirement
- CLIP ViT-L/14 + Vicuna lightweight compared to proprietary multimodal models
- Inference Efficiency - Tradeoff
- Open weights enable edge deployment and quantization
- Inference Efficiency - Optimization Level
- Important
- LLaVA-GM Sparsification - Requirement
- Multi-stage MoE training: MLP adaptation β full Gemma β MoE-only for lightweight deployment
- LLaVA-GM Sparsification - Tradeoff
- Reduced inference compute while maintaining performance
- LLaVA-GM Sparsification - Optimization Level
- Important
- Batch Processing - Requirement
- Heterogeneous batching for mixed text-image inputs; vision tokenization before language processing
- Batch Processing - Tradeoff
- Throughput optimization vs latency consistency
- Batch Processing - Optimization Level
- Optional