LM Fine-tuning¶

FusionBench provides three fine-tuning methods for language models: Full Fine-tuning for Supervised Fine-Tuning (SFT), PEFT (LoRA) Fine-tuning for SFT, and Bradley-Terry Reward Modeling. These methods use PyTorch Lightning Fabric for distributed training and support FSDP, gradient accumulation, and configurable checkpointing.

Full Fine-tuning SFT¶

The FullFinetuneSFT algorithm performs full-parameter fine-tuning of a causal language model on supervised instruction datasets. All parameters of the model are updated (optionally excluding token embeddings via fix_token_embedding).

Training Loop. For each batch, the model computes the autoregressive language modeling loss:

\[\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{t} \log p(y_t | y_{<t}, x; \theta)\]

where \(N\) is the number of samples and the inner sum is over token positions. The loss is computed via the model's built-in cross-entropy (with labels shifted by one position).

Key Features: - Supports gradient accumulation (accumulate_grad_batches). - Gradient clipping by value or norm (gradient_clip_val, gradient_clip_algorithm). - Configurable checkpointing (epoch or step interval, lightning or HuggingFace format). - FSDP-compatible with gradient checkpointing. - Optional token embedding freezing (fix_token_embedding=true).

CLI Usage¶

config/method/lm_finetune/fullfinetune_sft.yaml

_target_: fusion_bench.method.FullFinetuneSFT
_recursive_: False
optimizer:
  _target_: torch.optim.AdamW
  lr: 1e-5
  weight_decay: 0.01
  fused: null
lr_scheduler:
  _target_: fusion_bench.optim.lr_scheduler.CosineDecayWithWarmup
  T_max: _T_max_ # this will be replaced by the expected number of training steps
  init_lr: 0
  warmup_steps: 100
  max_lr: ${..optimizer.lr}
  min_lr: 1e-6
dataloader_kwargs:
  # per-gpu batch size
  batch_size: 1
  num_workers: 0
  pin_memory: True
# Training hyperparameters
# if max_epochs=-1, max_steps will be used to determine the number of training steps
max_epochs: 3
max_steps: -1
max_steps_per_epoch: -1
accumulate_grad_batches: 1
lr_scheduler_interval: step
lr_scheduler_frequency: 1
# Checkpointing may be done by epoch or step, and at the end of training
# `checkpoint_save_interval` can be 'epoch' or 'step'
checkpoint_save_interval: epoch
checkpoint_save_frequency: 1
# Whether to use gradient clipping, and if so, the value and algorithm
gradient_clip_val: null
gradient_clip_algorithm: norm
save_optimizer_state: false
# save_full_model must be true when using shared FSDP
save_full_model: true
# save_ckpt_type can be 'hf' or 'lightning'
save_ckpt_type: lightning
# Path to checkpoint to load from, used for resuming training
ckpt_path: null
max_length: 4096
fix_token_embedding: true

fusion_bench \
  method=lm_finetune/fullfinetune_sft \
  method.optimizer.lr=1e-5 \
  method.max_epochs=3 \
  method.dataloader_kwargs.batch_size=1 \
  method.max_length=4096 \
  method.fix_token_embedding=true \
  modelpool=CausalLMPool/meta-llama/Llama-2-7b-hf \
  taskpool=dummy

PEFT Fine-tuning SFT¶

The PeftFinetuneSFT algorithm applies Parameter-Efficient Fine-Tuning (PEFT) using LoRA adapters. Only the LoRA parameters are updated, keeping the base model frozen.

LoRA Configuration. LoRA low-rank adapters are applied to specified modules:

\[W(x) = W_0 x + \frac{1}{\alpha} B A x\]

where \(W_0\) is the frozen original weight, \(A \in \mathbb{R}^{r \times d_{\text{in}}}\) and \(B \in \mathbb{R}^{d_{\text{out}} \times r}\) are the trainable low-rank matrices, \(r\) is the LoRA rank, and \(\alpha\) is the scaling factor (lora_alpha).

Key Features: - Default targets: q_proj, v_proj (attention) and gate_proj, down_proj, up_proj (MLP). - LoRA rank r=64, lora_alpha=16, lora_dropout=0 (configurable). - Post-training merge and unload option (merge_and_unload=true). - Supports both Lightning and PEFT checkpoint formats.

CLI Usage¶

config/method/lm_finetune/peftfinetune_sft.yaml

_target_: fusion_bench.method.PeftFinetuneSFT
_recursive_: False
optimizer:
  _target_: torch.optim.AdamW
  lr: 1e-4
  weight_decay: 0.01
  fused: null
lr_scheduler:
  _target_: torch.optim.lr_scheduler.CosineAnnealingLR
  T_max: _T_max_ # this will be replaced by the expected number of training steps
  eta_min: 1e-6
dataloader_kwargs:
  # per-gpu batch size
  batch_size: 1
  num_workers: 0
  pin_memory: True
peft_config:
  _target_: peft.LoraConfig
  task_type: peft.TaskType.CAUSAL_LM
  target_modules:
    # lora attention modules
    - q_proj
    - v_proj
    # lora mlp modules
    - gate_proj
    - down_proj
    - up_proj
  r: 64
  lora_alpha: 16
  lora_dropout: 0
  bias: none
adapter_name: default
# whether to merge and unload the adapter after training
merge_and_unload: false
# Training hyperparameters
# if max_epochs=-1, max_steps will be used to determine the number of training steps
max_epochs: 3
max_steps: -1
max_steps_per_epoch: -1
accumulate_grad_batches: 1
lr_scheduler_interval: step
lr_scheduler_frequency: 1
# Checkpointing may be done by epoch or step, and at the end of training
# `checkpoint_save_interval` can be 'epoch' or 'step'
checkpoint_save_interval: epoch
checkpoint_save_frequency: 1
# Whether to use gradient clipping, and if so, the value and algorithm
gradient_clip_val: null
gradient_clip_algorithm: norm
save_optimizer_state: false
# save_full_model must be true when using shared FSDP
save_full_model: false
# save_ckpt_type can be 'peft' or 'lightning'
save_ckpt_type: lightning
# Path to checkpoint to load from, used for resuming training
ckpt_path: null
max_length: 4096

fusion_bench \
  method=lm_finetune/peftfinetune_sft \
  method.peft_config.r=64 \
  method.peft_config.lora_alpha=16 \
  method.optimizer.lr=1e-4 \
  method.max_epochs=3 \
  method.merge_and_unload=false \
  modelpool=CausalLMPool/meta-llama/Llama-2-7b-hf \
  taskpool=dummy

Bradley-Terry Reward Modeling¶

The BradleyTerryRewardModeling algorithm trains a reward model using the Bradley-Terry pairwise preference model. Given pairs of (chosen, rejected) responses, it learns to assign higher rewards to the chosen response.

The Bradley-Terry Loss:

\[\mathcal{L} = -\mathbb{E} \left[ \log \sigma(r_{\theta}(x, y_{\text{chosen}}) - r_{\theta}(x, y_{\text{rejected}})) \right]\]

where \(r_{\theta}\) is the reward model (a sequence classification head on top of the LLM), \(\sigma\) is the sigmoid function, and \((x, y_{\text{chosen}})\) and \((x, y_{\text{rejected}})\) form a preference pair.

Dataset Format. Each sample contains: - chosen_input_ids, chosen_attention_mask: Token IDs for the preferred response. - rejected_input_ids, rejected_attention_mask: Token IDs for the rejected response.

The collate function stacks chosen and rejected samples in a single batch (batch size must be even).

CLI Usage¶

config/method/lm_finetune/bradley_terry_rm.yaml

_target_: fusion_bench.method.BradleyTerryRewardModeling
_recursive_: False
optimizer:
  _target_: torch.optim.AdamW
  lr: 1e-5
  weight_decay: 0.01
  fused: null
lr_scheduler:
  _target_: fusion_bench.optim.lr_scheduler.CosineDecayWithWarmup
  T_max: _T_max_ # this will be replaced by the expected number of training steps
  init_lr: 0
  warmup_steps: 100
  max_lr: ${..optimizer.lr}
  min_lr: 1e-6
dataloader_kwargs:
  # per-gpu batch size
  batch_size: 1
  num_workers: 0
  pin_memory: True
# Training hyperparameters
# if max_epochs=-1, max_steps will be used to determine the number of training steps
max_epochs: 3
max_steps: -1
max_steps_per_epoch: -1
accumulate_grad_batches: 1
lr_scheduler_interval: step
lr_scheduler_frequency: 1
# Checkpointing may be done by epoch or step, and at the end of training
# `checkpoint_save_interval` can be 'epoch' or 'step'
checkpoint_save_interval: epoch
checkpoint_save_frequency: 1
# Whether to use gradient clipping, and if so, the value and algorithm
gradient_clip_val: null
gradient_clip_algorithm: norm
save_optimizer_state: false
# save_full_model must be true when using shared FSDP
save_full_model: true
# save_ckpt_type can be 'hf' or 'lightning'
save_ckpt_type: lightning
# Path to checkpoint to load from, used for resuming training
ckpt_path: null
max_length: 4096
fix_token_embedding: true

fusion_bench \
  method=lm_finetune/bradley_terry_rm \
  method.optimizer.lr=1e-5 \
  method.max_epochs=3 \
  method.dataloader_kwargs.batch_size=2 \
  method.max_length=4096 \
  modelpool=SequenceClassificationModelPool/reward_model \
  taskpool=dummy

Common Parameters¶

Parameter	Type	Default	Description
`max_epochs`	int	3	Max training epochs (-1 = use max_steps).
`max_steps`	int	-1	Max training steps (-1 = use max_epochs).
`max_steps_per_epoch`	int	-1	Max steps per epoch.
`accumulate_grad_batches`	int	1	Gradient accumulation factor.
`gradient_clip_val`	float	null	Gradient clipping threshold.
`gradient_clip_algorithm`	str	"norm"	Clipping: "value" or "norm".
`checkpoint_save_interval`	str	"epoch"	"epoch" or "step".
`checkpoint_save_frequency`	int	1	Checkpoint frequency.
`save_ckpt_type`	str	"lightning"	"lightning", "hf", or "peft".
`save_full_model`	bool	true	Save full model or only trainable params.
`save_optimizer_state`	bool	false	Save optimizer state in checkpoint.
`ckpt_path`	str	null	Path to resume from checkpoint.
`max_length`	int	4096	Max sequence length.
`fix_token_embedding`	bool	true	Freeze token embeddings (SFT/RM only).

LR Scheduler Configuration¶

The _T_max_ placeholder in LR scheduler configs is automatically replaced with the computed total number of training steps. This allows the scheduler to be configured without knowing the dataset size in advance.

Implementation Details¶

The Bradley-Terry model for reward modeling follows the approach used in InstructGPT and RLHF pipelines. ↩