AdaSVD (Adaptive SVD-based Merging)¶
AdaSVD is an adaptive model merging algorithm specifically designed for CLIP vision encoders. It combines ideas from Sparse Mixture of Low-Rank Experts (SMILE) with data-driven routing weight computation to produce a merged model that adaptively combines multiple fine-tuned experts. The algorithm first upscaling the pretrained model into a Mixture-of-Experts (MoE) architecture, then reduces the MoE back to dense linear layers by computing average routing weights over a budget of calibration samples.
Algorithm Overview¶
SMILE Upscaling¶
The algorithm begins by replacing each linear layer in the CLIP vision encoder with a SmileMoELinear module. Each MoE module consists of:
- The pretrained linear layer: Serves as the base/expert 0.
- Expert linear layers: One per fine-tuned model, each containing the corresponding layer's parameters from a fine-tuned model.
- A gating network: Routes input tokens to a subset of experts based on SVD-derived routing scores.
The gating uses the difference between expert weights and the pretrained weight (\(\Delta W = W_{\text{expert}} - W_{\text{pretrained}}\)) to compute routing scores. The gate_k parameter controls the top-k selection.
Routing Weight Accumulation¶
After upscaling, the algorithm performs a forward pass through the model on a set of calibration samples (controlled by num_samples). For each layer, hooks accumulate the routing weights produced by the gating network:
where \(g_i(x_n)\) is the gate's output for expert \(i\) on sample \(x_n\).
MoE Reduction¶
Once routing weights are accumulated, each MoE linear layer is reduced back to a standard linear layer via weighted averaging:
This is equivalent to a weighted average where the pretrained model has weight 1 and each expert \(i\) has weight \(\bar{w}_i\) (optionally scaled by scaling_factor).
Non-Linear Module Handling¶
For non-linear modules (e.g., layer norms), the average_experts flag controls behavior:
true: Average the parameters of all expert models.false: Keep the pretrained model's parameters.
Mathematical Formulation¶
Upscaling¶
For each linear layer \(l\) with pretrained weight \(W_0^{(l)}\) and fine-tuned weights \(\{W_i^{(l)}\}_{i=1}^{M}\):
- Construct a
SmileMoELinearmodule with experts \(\{W_0^{(l)}, W_1^{(l)}, ..., W_M^{(l)}\}\). - The gate computes routing scores using the SVD of the weight differences.
- The
top_kparameter is set to \(M\) (all experts), with dense routing.
Routing Weight Computation¶
For input hidden states \(h\) passed through layer \(l\):
Over \(N\) calibration samples:
Reduction (MoE to Dense)¶
The final merged weight for layer \(l\) is computed as a weighted combination:
where \(\alpha\) is the optional scaling_factor. When scaling_factor is a list, it provides per-expert scaling. The WeightedAverageAlgorithm with normalize=False is used, meaning weights are not renormalized.
Hidden State Propagation¶
The algorithm processes the CLIP transformer layer by layer:
- Extract hidden states from the input embeddings through the pre-layernorm.
- For each encoder layer, propagate hidden states through all MoE modules.
- Accumulate routing weights at each layer.
- Use the output hidden states as input to the next layer.
Configuration¶
_target_: fusion_bench.method.AdaSVDMergingForCLIPVisionModel
scaling_factor: null
num_samples: 256
gate_k: 16
average_experts: false
device: cuda
upscaling_accelerator: null
seed: 0
Key configuration parameters:
| Parameter | Description | Default |
|---|---|---|
scaling_factor |
Scaling for expert weights (float or list) | null |
num_samples |
Number of calibration samples per dataset | 256 |
gate_k |
Top-k experts for routing | 16 |
average_experts |
Average non-linear expert modules | false |
device |
Device for computation: cuda or cpu |
cuda |
upscaling_accelerator |
Accelerator for SMILE upscaling | null |
seed |
Random seed for reproducibility | 0 |
Examples¶
CLI Usage¶
fusion_bench \
method=ada_svd/clip_vision \
method.num_samples=256 \
method.gate_k=16 \
method.scaling_factor=1.0 \
method.average_experts=false \
method.device=cuda \
modelpool=CLIPVisionModelPool/clip-vit-base-patch32_TA8 \
taskpool=CLIPVisionModelTaskPool/clip-vit-classification_TA8
API Usage¶
from fusion_bench.method.ada_svd.clip_vision import AdaSVDMergingForCLIPVisionModel
algorithm = AdaSVDMergingForCLIPVisionModel(
scaling_factor=None,
num_samples=256,
gate_k=16,
average_experts=False,
device="cuda",
upscaling_accelerator=None,
seed=0,
)
merged_model = algorithm.run(modelpool)
Implementation Details¶
Data Preparation¶
The prepare_data method samples num_samples from each training dataset in the model pool. It uses random_split to select a subset, then wraps them in CLIPDataset with the model pool's processor. All datasets are concatenated into a single ConcatDataset.
Model Preparation¶
The prepare_model method loads the pretrained model and all fine-tuned models, then calls merge to perform SMILE upscaling. Models are moved to the specified device (GPU if cuda and available).
Upscaling Process¶
- Linear layers: Each
nn.Linearmodule is replaced with aSmileMoELinearusing the pretrained and expert weights. Thek=-1setting enables dense experts, androuting_use_diff=Trueuses the weight difference for routing. - Leaf modules: When
average_experts=True, leaf modules (modules with no sub-modules) are averaged across experts usingsimple_average.
Routing Weight Collection¶
The AvgRoutingWeightsMetric class registers as a forward hook on each SmileMoELinear. It computes the softmax routing weights for each forward pass and accumulates them. After processing all samples, compute() returns the average routing weights.
Layer-by-Layer Processing¶
The algorithm processes transformer layers sequentially. Hidden states are propagated layer by layer, with routing weights accumulated at each layer. This is more memory-efficient than processing the entire network at once.
Memory Efficiency¶
After upscaling a linear layer, the original modules in the fine-tuned models are set to None to free memory. The algorithm also clears the fine-tuned model references progressively.
References¶
-
(arXiv 2024) SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models. http://arxiv.org/abs/2408.10174. Introduces the SMILE upscaling technique that AdaSVD builds upon. ↩