Weight-Ensembling Mixture of Experts (Data-Adaptive Model Merging)¶
This method is designed to handle a wide range of tasks by segregating shared information and task-specific knowledge. It dynamically combines these elements based on the input samples.
The Weight-Ensembling MoE module consists of three main components: the router, the pre-trained MLP weights, and a collection of task vectors. The router, which is an MLP, processes the input data and generates routing weights. These weights determine how the knowledge from different tasks is combined. The pre-trained MLP weights are crucial as they have been trained to recognize a wide range of data patterns. The task vectors represent the differences between the MLPs that have been fine-tuned for specific tasks and the pre-trained ones, capturing the unique adjustments made to optimize them for specific tasks. The routing weights are averaged across the input tokens, and these weights are used to select task vectors from a dictionary matrix. These task vectors are then added to the pre-trained MLP weights to create input-conditioned weights.
Algorithm Requirements:
Method | Access to labeled tasks data | Access to validation data (labeled) | Test time adaptation |
---|---|---|---|
Fisher Merging | Yes (Estimate Fisher information matrix) | No | No |
RegMean | Yes (compute Gram Matrix) | No | No |
Task Arithmetic | No | Yes (select sacling factor) | No |
Ties-Merging | No | Yes (select sacling factor) | No |
AdaMerging | No | No | Yes |
Ours | No | No | Yes |
WEMoE V2: E-WEMoE¶
L. Shen, A. Tang, E. Yang et al. Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging. Oct, 2024.3
Parameters Comparison¶
Tip for reducing the parameter count
Here we present the parameter count for the method outlined in the original paper1. An effective strategy to minimize the number of parameters involves employing Singular Value Decomposition (SVD) to compress the task vectors. This approach significantly cuts down on the number of parameters while only marginally impacting performance. For additional information, please refer to the Twin-Merging paper2. Which not only reduces the number of parameters but also conducts extensive experiments to demonstrate the effectiveness of data-adaptive merging on language domain.
Here is the number of parameters compared to a single pre-trained model (OpenCLIP CLIP-ViT-B/32):
Method | Trainable Parameters | Total Parameters | Paremeters Reduced by Merging |
---|---|---|---|
Single Pre-trained | 113.45M (100%) | 113.45M | - |
WEMoE (2-layer, 1 task) | 7.10M (4.00%) | 177.21M | - |
WEMoE (2-layer, 2 tasks) | 7.11M (3.04%) | 233.89M | 2*113.45-233.89=-6.99M |
WEMoE (2-layer, 3 tasks) | 7.11M (2.45%) | 290.57M | 3*113.45-290.57=49.78M |
WEMoE (2-layer, 4 tasks) | 7.12M (2.02%) | 347.25M | 4*113.45-347.25=106.55M |
WEMoE (2-layer, 5 tasks) | 7.13M (1.77%) | 403.93M | 5*113.45-403.93=163.32M |
WEMoE (2-layer, 6 tasks) | 7.14M (1.55%) | 460.61M | 6*113.45-460.61=220.09M |
WEMoE (2-layer, 7 tasks) | 7.15M (1.38%) | 517.28M | 7*113.45-517.28=276.87M |
WEMoE (2-layer, 8 tasks) | 7.16M (1.25%) | 573.96M | 8*113.45-573.96=333.64M |
The number of parameter count of HuggingFace CLIP vision models (of type transformers.models.clip.modeling_clip.CLIPVisionModel
) are different from the OpenCLIP models downloaded from the task arithmetic repo, because the OpenCLIP models (of type src.modeling.ImageEncoder
) include the embedding layer for text tokens, while the HuggingFace CLIP vision models do not.
Therefore, the relative parameter count of the upscaled model using Transformer CLIP vision models will be larger than the OpenCLIP models.
ImageEncoder( # (1)
(model): CLIP(
(visual): VisualTransformer( # (2)
(conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
(ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(transformer): Transformer(
(resblocks): ModuleList(
(0-11): 12 x ResidualAttentionBlock(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(ln_attn): Identity()
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(ln): Identity()
(gelu): QuickGELU()
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
)
(ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(token_embedding): Embedding(49408, 512) # (3)
(ln_final): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
- trainable params: 113.45M || all params: 113.45M || trainable%: 100.0000
- trainable params: 87.85M || all params: 87.85M || trainable%: 100.0000
- trainable params: 25.30M || all params: 25.30M || trainable%: 100.0000
CLIPVisionModel( # (1)
(vision_model): CLIPVisionTransformer(
(embeddings): CLIPVisionEmbeddings(
(patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
(position_embedding): Embedding(50, 768)
)
(pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-11): 12 x CLIPEncoderLayer(
(self_attn): CLIPAttention(
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
(layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
)
(post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
- trainable params: 87.85M || all params: 87.85M || trainable%: 100.0000
Loss Landscape Visualization¶
Hyperparameter Tuning¶
In the below figure, we show the performance of the merged models with varying numbers of steps. Figure (b) shows the performance of the merged WEMoE models with varying number of steps. In Figure (a), we merge CLIP-ViT-B/32 models with different learning rate configurations. We observe that the performance of the merged model shows an upward trend with an increase in the number of training steps, and it converges rapidly, reaching a high accuracy level in just 200 steps. Furthermore, the influence of different learning rates is not significant, suggesting that our method is insensitive to the learning rate parameter. This is a desirable property as it reduces the need for hyperparameter tuning.
Ablations of Router Depth¶
Table: Parameter comparison of WEMoE (1-layer) and WEMoE (2-layer) on CLIP-ViT-B/32 models (OpenCLIP).
Method | Number of Trainable Parameters |
---|---|
AdaMerging (layer-wise) | 1.3K |
WEMoE (1-layer) | 73.8K (0.01%) |
WEMoE (2-layer) | 7.16M (1.25%) |
Table: Ablation study of the router depth on the performance of the up-scaled CLIP-ViT-B/32 models (OpenCLIP).
Method | SUN397 | CARS | RESISC45 | EuroSAT | SVHN | GRSRB | MNIST | DTD | Avg. |
---|---|---|---|---|---|---|---|---|---|
AdaMerging (layer-wise) | 66.6 | 68.3 | 82.4 | 92.5 | 86.5 | 93.7 | 97.7 | 61.1 | 80.9 |
WEMoE (1-layer) | 73.2 | 76.7 | 93.8 | 98.6 | 95.7 | 98.6 | 99.5 | 74.5 | 88.3 |
WEMoE (2-layer) | 74.1 | 77.4 | 93.7 | 99.1 | 96.2 | 98.9 | 99.6 | 76.4 | 89.4 |
To explore the influence of router depth on the performance of the scaled-up model, we perform an ablation study where the router depth is varied. In WEMoE modules, the router is implemented as a multi-layer perceptron (MLP).
- WEMoE (0-layer) functions as a bias-only model, representing a special case of an MLP with no hidden layers. It generates a constant routing weight for all inputs, captured by the formula as \(r(h) = b_0\), indicating that it does not adjust based on the input. When we only up-scale the MLP modules of the vision Transformers to MoE modules, WEMoE (0-layer) can be considered as a partial implementation of AdaMerging. Add when we up-scale the vision Transformers layer-wisely, WEMoE (0-layer) can be considered equivalent to AdaMerging. For WEMoE (0-layer), the MoE modules can be unloaded, thus no additional parameters and inference cost are introduced.
- For WEMoE (1-layer), each router is a one-layer MLP that takes the input sample \(h\) and outputs the routing weight \(r(h)\), which is adaptive to the input. The routing weight is calculated as \(r(h) = W_1 h + b_1\).
- For WEMoE (2-layer), each router is a two-layer MLP and the routing weight is calculated as \(r(h) = W_2 ReLU(W_1 h + b_1) + b_2\).
In the above two Tables, we present additional findings to support our argument. We compare the number of trainable parameters and performance between WEMoE (1-layer) and WEMoE (2-layer). The data reveal that WEMoE (1-layer) possesses 73.8K trainable parameters, which constitute only 0.01% of the total parameters in the merged model. Notably, the performance of WEMoE (1-layer) is significantly better than AdaMerging and nearly matches that of WEMoE (2-layer) across all tasks. This evidence underscores our claim that the MoE design is crucial for performance enhancement.
Code Integration¶
multi-task model fusion experiment on eight image classification tasks.
# merge eight CLIP-ViT-B/32 models using WE MoE
fusion_bench \
method=weight_ensembling_moe \
method.name=clip_weight_ensembling_moe \
method.use_grad_accumulate=false \
method.save_checkpoint=outputs/clip-vit-base-patch32_TA8_weight_ensembling_moe_checkpoint.ckpt \
modelpool=clip-vit-base-patch32_TA8 \
taskpool=clip-vit-classification_TA8
merge eight CLIP-ViT-L/14 models:
# merge eight CLIP-ViT-L/14 models using WE MoE, fine-tune the routers
fusion_bench print_config=false \
method=weight_ensembling_moe \
method.name=clip_weight_ensembling_moe \
method.use_grad_accumulate=true \
method.save_checkpoint=outputs/clip-vit-large-patch14_TA8_weight_ensembling_moe_checkpoint.ckpt \
method.batch_size=4 method.devices=4 \
modelpool=clip-vit-large-patch14_TA8 \
taskpool=dummy &&
# load the checkpoint and evaluate the model
fusion_bench \
method=weight_ensembling_moe \
method.name=clip_weight_ensembling_moe \
method.checkpoint=outputs/clip-vit-large-patch14_TA8_weight_ensembling_moe_checkpoint.ckpt \
modelpool=clip-vit-large-patch14_TA8 \
taskpool=clip-vit-classification_TA8 \
taskpool.clip_model=openai/clip-vit-large-patch14
Reference¶
we_moe
¶
WeightEnsemblingMoEAlgorithm
¶
Bases: ModelFusionAlgorithm
Algorithm for fusing models using Weight Ensembling Mixture of Experts (MoE).
This class provides methods for constructing the MoE model, performing test-time adaptation, and running the fusion process.
Attributes:
-
_fabric
(Fabric
) –The fabric for distributed training.
-
modelpool
(ModelPool
) –The pool of models to be fused.
-
profiler
(SimpleProfiler
) –The profiler for measuring performance.
Source code in fusion_bench/method/we_moe/we_moe.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
|
__init__(algorithm_config)
¶
Initialize the WeightEnsemblingMoEAlgorithm with the given configuration.
Parameters:
-
algorithm_config
¶DictConfig
) –The configuration for the algorithm.
Source code in fusion_bench/method/we_moe/we_moe.py
compute_logits(module, batch, task)
abstractmethod
¶
Compute the logits for a given batch and task.
Parameters:
-
module
¶The model module to use for computing logits.
-
batch
¶The batch of data.
-
task
¶The task for which to compute logits.
Returns:
-
Tensor
(Tensor
) –The computed logits.
Source code in fusion_bench/method/we_moe/we_moe.py
construct_moe_model()
abstractmethod
¶
Construct the Mixture of Experts model using the models in the model pool.
Returns:
-
WeightEnsemblingMoE
(WeightEnsemblingMoE
) –The constructed MoE model.
get_shuffled_test_loader_iter(task)
abstractmethod
¶
Get an iterator for the shuffled test data loader for a specific task.
Parameters:
-
task
¶str
) –The task for which to get the test data loader.
Returns:
-
DataLoader
(DataLoader
) –The shuffled test data loader iterator.
Source code in fusion_bench/method/we_moe/we_moe.py
load_checkpoint(model, checkpoint)
abstractmethod
¶
Load the checkpoint file.
Parameters:
on_test_time_adaptation_start()
¶
run(modelpool)
¶
Run the WeightEnsemblingMoEAlgorithm to fuse models using Weight Ensembling Mixture of Experts.
Parameters:
-
modelpool
¶ModelPool
) –The pool of models to be fused.
Returns:
-
WeightEnsemblingMoE
–The fused MoE model.
Source code in fusion_bench/method/we_moe/we_moe.py
save_checkpoint(model, checkpoint)
abstractmethod
¶
Save the checkpoint file.
Parameters:
test_time_adaptation(module)
¶
Perform test-time adaptation for the given module.
Parameters:
-
module
¶WeightEnsemblingMoE
) –The MoE module to adapt.
Returns:
-
WeightEnsemblingMoE
–The adapted MoE module.
Source code in fusion_bench/method/we_moe/we_moe.py
entropy_loss(logits)
¶
Compute the entropy loss of a set of logits.
Parameters:
-
logits
¶Tensor
) –The logits to compute the entropy loss of.
Returns:
-
Tensor
(Tensor
) –The entropy loss of the logits.
Source code in fusion_bench/method/we_moe/we_moe.py
clip_we_moe
¶
CLIPWeightEnsemblingMoEAlgorithm
¶
Bases: WeightEnsemblingMoEAlgorithm
, CLIPClassificationMixin
CLIPWeightEnsemblingMoEAlgorithm is a class that implements the WeightEnsemblingMoEAlgorithm for CLIP models. It extends the WeightEnsemblingMoEAlgorithm and CLIPClassificationMixin classes.
Attributes:
-
modelpool
(CLIPVisionModelPool
) –The model pool containing the CLIP models.
Source code in fusion_bench/method/we_moe/clip_we_moe.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
compute_logits(module, batch, task)
¶
Compute the logits for the given batch and task.
Parameters:
Returns:
-
Tensor
(Tensor
) –The computed logits.
Source code in fusion_bench/method/we_moe/clip_we_moe.py
construct_moe_model()
¶
Construct the Mixture of Experts (MoE) model using the models in the model pool.
Returns:
-
WeightEnsemblingMoE
(WeightEnsemblingMoE
) –The constructed MoE model.
Source code in fusion_bench/method/we_moe/clip_we_moe.py
get_shuffled_test_loader_iter(tta_dataset)
cached
¶
Get an iterator for the shuffled test data loader.
Parameters:
-
tta_dataset
¶str
) –The name of the test-time adaptation dataset.
Returns:
-
Iterator
–An iterator for the shuffled test data loader.
Source code in fusion_bench/method/we_moe/clip_we_moe.py
load_checkpoint(model, checkpoint)
¶
Load the checkpoint file.
Parameters:
Source code in fusion_bench/method/we_moe/clip_we_moe.py
on_test_time_adaptation_start()
¶
Load the CLIP processor and construct the zero-shot classification head for each task.
save_checkpoint(model, checkpoint)
¶
Save the checkpoint file.
Parameters:
Source code in fusion_bench/method/we_moe/clip_we_moe.py
-
Anke Tang et.al. ICML 2024. Merging Multi-Task Models via Weight-Ensembling Mixture of Experts. http://arxiv.org/abs/2402.00433 ICML 2024. ↩
-
Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y. Cheng, “Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging,” doi: 10.48550/arXiv.2406.15479. NeurIPS 2024. ↩
-
L. Shen, A. Tang, E. Yang et al. Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging. Oct, 2024. ↩