MoE-based Model Model Upscaling (Sparse Upcycling)¶
Sparse upcycling is a technique used to initialize a sparsely activated Mixture-of-Experts (MoE) model from a dense checkpoint. This approach leverages previously incurred training costs to improve the performance of large models while reducing the computational expense. In the process, dense Transformer blocks are partially replaced with MoE blocks, where the MLPs in a Transformer block are replaced by multiple experts. The experts are chosen based on routing probabilities determined by a router. The initialized MoE model is then further trained to recover the performance. This method results in improved performance for both language and vision models while using only a fraction of the original dense pretraining cost 1.
Examples¶
Here’s an example demonstrating how to upscale a pre-trained Mistral model to a Mixtral model:
import os
from omegaconf import DictConfig
from transformers import MistralForCausalLM
from fusion_bench.method.mixture_of_experts.mixtral_upcycling import (
MixtralForCausalLMUpscalingAlgorithm,
)
from fusion_bench.utils import print_parameters
# Load a pre-trained Mistral model
pretrained_model = MistralForCausalLM.from_pretrained(
os.path.expanduser("path_to_mistral_model")
)
print("Pretrained model:")
print_parameters(pretrained_model)
# Output:
# Pretrained model:
# trainable params: 7.24B || all params: 7.24B || trainable%: 100.0000
# Define the configuration for Mixtral
config = {
"num_experts": 4, # Number of expert channels
"experts_per_token": 2, # Experts to choose per token
}
# Initialize the upscaling algorithm
upscaling_for_causal_lm_algorithm = MixtralForCausalLMUpscalingAlgorithm(
DictConfig(config)
)
# Run the upscaling process to get a Mixtral model
mixtral_for_causal_lm_model = upscaling_for_causal_lm_algorithm.run(pretrained_model)
print("Mixtral model:")
print_parameters(mixtral_for_causal_lm_model)
# Outputs:
# Mixtral model:
# trainable params: 24.15B || all params: 24.15B || trainable%: 100.0000
# Save the upscaled Mixtral model
mixtral_for_causal_lm_model.save_pretrained("path_to_save_mixtral_model")
A Jupyter notebook example is also available at our repo.
Code Integration¶
This is a guide on how to use the fusion_bench
command-line interface to upscale a Mistral model to a Mixtral model.
The first code block is a YAML configuration file for the upscaling method. The name field specifies the name of the upscaling method. The num_experts
field specifies the number of experts to use in the upscaling process. The experts_per_token
field specifies the number of experts to use per token. The save_checkpoint
field specifies the path where the upscaled model will be saved, if provided.
name: mixtral_for_causal_lm_moe_upscaling # or "mixtral_moe_upscaling"
num_experts: 4
experts_per_token: 2
# path to save the upscaled model
save_checkpoint: null
The second code block is another YAML configuration file, this time for the model pool. The type
field specifies the type of model pool to use. The models
field is a list of models to include in the pool. Each model should have a name
and a path
, and the model is loaded from the path
.
type: AutoModelForCausalLMPool
# each model should have a name and a path, and the model is loaded from the path
# this is equivalent to `AutoModelForCausalLM.from_pretrained(path)`
models:
- name: _pretrained_
path: path_to_your_pretrained_model
Finally, the third code block is a bash command that runs the fusion_bench command-line interface with the specified method, model pool, and task pool. The method argument specifies the upscaling method to use. The modelpool argument specifies the model pool to use. The modelpool.models.0.path argument specifies the path to the pretrained model to use. The taskpool argument specifies the task pool to use. In this case, a dummy task pool is used that does nothing but print the parameter counts of the merged model.
fusion_bench \
method=mixtral_moe_upscaling \
modelpool=mixtral_moe_upscaling \
modelpool.models.0.path=path_to_your_pretrained_model \
taskpool=dummy # this is a dummy taskpool that does nothing but print the parameter counts of the merged model
References¶
mixtral_upcycling
¶
MixtralForCausalLMUpscalingAlgorithm
¶
Bases: BaseAlgorithm
This class is responsible for upscaling a model to a MixtralForCausalLM. It inherits from the ModelFusionAlgorithm class.
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
|
__init__(num_experts, experts_per_token, save_checkpoint, **kwargs)
¶
Initialize the MixtralForCausalLMUpscalingAlgorithm.
Parameters:
-
num_experts
¶int
) –The number of experts in the Mixtral model.
-
experts_per_token
¶int
) –The number of experts per token.
-
save_checkpoint
¶str
) –The path to save the checkpoint.
-
**kwargs
¶Additional keyword arguments.
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
run(modelpool)
¶
Runs the upscaling process.
Parameters:
-
modelpool
¶ModelPool | LlamaForCausalLM | MistralForCausalLM
) –The model to be upscaled.
Returns:
-
MixtralForCausalLM
(MixtralForCausalLM
) –The upscaled model.
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
MixtralUpscalingAlgorithm
¶
Bases: BaseAlgorithm
This class is responsible for upscaling a model to a MixtralModel. It inherits from the ModelFusionAlgorithm class.
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
__init__(num_experts, experts_per_token, save_checkpoint, **kwargs)
¶
Initialize the MixtralUpscalingAlgorithm.
Parameters:
-
num_experts
¶int
) –The number of experts in the Mixtral model.
-
experts_per_token
¶int
) –The number of experts per token.
-
save_checkpoint
¶str
) –The path to save the checkpoint.
-
**kwargs
¶Additional keyword arguments.
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
run(modelpool)
¶
Runs the upscaling process.
Parameters:
-
modelpool
¶ModelPool | LlamaModel | MistralModel
) –The model to be upscaled.
Returns:
-
MixtralModel
(MixtralModel
) –The upscaled model.
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
upscale_to_mixtral_for_causal_lm(input_model, output_model)
¶
A helper function.
Upscales a LlamaForCausalLM or MistralForCausalLM to a MixtralForCausalLM.
Parameters:
-
input_model
¶LlamaForCausalLM | MistralForCausalLM
) –The input model to be upscaled.
-
output_model
¶MixtralForCausalLM
) –The output model where the upscaled weights will be loaded.
Returns:
-
–
None
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
upscale_to_mixtral_model(input_model, output_model)
¶
A helper function.
Upscales a LlamaModel or MistralModel to a MixtralModel.
Parameters:
-
input_model
¶LlamaModel | MistralModel
) –The input model to be upscaled.
-
output_model
¶MixtralModel
) –The output model where the upscaled weights will be loaded.
Returns:
-
–
None
Source code in fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
-
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. http://arxiv.org/abs/2212.05055 ↩