Skip to content

GPT-2 Models for Text Classification

Here we provide a series of GPT-2 models fine-tuned for text classification tasks.

The Seven Tasks from GLUE Benchmark

We provide seven GPT-2 models fine-tuned on the following tasks from GLUE Benchmark: CoLA, SST-2, MRPC, QQP, MNLI, RTE, and QNLI. These models are fine-tuned with the learning rate of 5e-5 for 3 epochs. The models are available on HuggingFace as Pytorch models.

Evaluation results of these single-task models on the GLUE Benchmark are as follows:

Model CoLA MNLI MRPC QNLI QQP RTE SST-2 Avg.
CoLA 76.8 32.8 68.4 50.4 39.2 48.0 51.0 52.4
MNLI 59.5 82.1 33.8 46.5 24.9 57.4 40.5 49.2
MRPC 30.8 25.9 80.4 47.1 65.9 49.1 49.1 49.8
QNLI 58.7 38.9 30.6 88.3 39.9 48.7 47.0 50.3
QQP 31.4 25.7 62.3 45.0 89.6 49.1 49.1 50.3
RTE 52.8 47.7 37.5 53.5 33.7 65.3 54.9 49.3
SST-2 51.8 32.9 40.2 49.8 56.8 44.4 91.2 52.4

Model Pool Configuration

To use these models from our FusionBench library, you can specify the modelpool configuration file as follows:

config/modelpool/gpt-2_glue.yaml
type: HF_GPT2ForSequenceClassification
models:
  - name: _pretrained_
    path: gpt2
  - name: cola
    path: tanganke/gpt2_cola
  - name: mnli
    path: tanganke/gpt2_mnli
  - name: mrpc
    path: tanganke/gpt2_mrpc
  - name: qnli
    path: tanganke/gpt2_qnli
  - name: qqp
    path: tanganke/gpt2_qqp
  - name: rte
    path: tanganke/gpt2_rte
  - name: sst2
    path: tanganke/gpt2_sst2

Basic Examples

Here are some basic examples of using our CLI tool fusion_bench to merge the GPT-2 models.

Simple Ensemble

construct an ensemble of GPT-2 models using simple ensemble and evaluate on the seven tasks

fusion_bench method=simple_ensemble \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue

SimpleAverage

merge GPT-2 models using simple average and evluate on the seven tasks

fusion_bench method=simple_average \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue

Fisher merging

merge GPT-2 models using Fisher Merging and evluate the merged model

fusion_bench \
  method=gpt2_fisher_merging \
    method.batch_size=8 method.num_fisher_examples=512 \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue

RegMean

merge GPT-2 models using RegMean and evaluate the merged model

fusion_bench \
  method=gpt2_regmean \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue

Task Arithmetic

merge using Task Arithmetic on the seven tasks

# set the scaling factor to 0.3
fusion_bench \
  method=task_arithmetic \
    method.scaling_factor=0.3 \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue

# or run the following script to evaluate the model with different scaling factors,
# and save the results to different files
# or "for scaling_factor in $(seq 0 0.1 1.0)", I use the following for loop for better readability for readers who are not familiar with bash
for scaling_factor in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
do
fusion_bench save_report=outputs/gpt2_glue_task_arithmetic_scaling_factor_${scaling_factor}.json \
  method=task_arithmetic \
    method.scaling_factor=${scaling_factor} \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue
done

After running the above commands, you will get the following results:

Table: Task Arithmetic with different scaling factors

scaling_coef cola mnli mrpc qnli qqp rte sst2 Avg.
0.0 0.308725 0.330107 0.313725 0.491671 0.63166 0.527076 0.509174 0.444591
0.1 0.426654 0.501375 0.367647 0.556654 0.739105 0.494585 0.509174 0.513599
0.2 0.658677 0.585532 0.698529 0.602599 0.785258 0.472924 0.669725 0.639035
0.3 0.682646 0.639837 0.718137 0.669046 0.807915 0.462094 0.792431 0.68173
0.4 0.690316 0.673867 0.70098 0.702178 0.817067 0.472924 0.819954 0.696755
0.5 0.68744 0.685583 0.696078 0.704924 0.81818 0.472924 0.836009 0.700163
0.6 0.688399 0.680998 0.678922 0.700531 0.808978 0.472924 0.850917 0.697381
0.7 0.684564 0.665003 0.669118 0.702361 0.789612 0.480144 0.853211 0.692002
0.8 0.677852 0.619154 0.659314 0.673989 0.748776 0.501805 0.819954 0.671549
0.9 0.644295 0.503515 0.654412 0.540912 0.637942 0.487365 0.78555 0.607713
1.0 0.627996 0.411004 0.54902 0.496614 0.478234 0.530686 0.71445 0.544

Ties-Merging

merge using Ties-Merging on the seven tasks

fusion_bench \
  method=ties_merging \
    method.scaling_factor=0.3 \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue\

# or run the following script to evaluate the model with different scaling factors,
# and save the results to different files
for scaling_factor in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
do fusion_bench save_report=outputs/gpt2_glue_ties_merging_scaling_factor_${scaling_factor}.json \
  method=ties_merging \
    method.scaling_factor=${scaling_factor} \
  modelpool=gpt-2_glue \
  taskpool=gpt-2_glue
done
scaling_coef cola mnli mrpc qnli qqp rte sst2 Avg.
0.0 0.308725 0.330107 0.313725 0.491671 0.63166 0.527076 0.509174 0.444591
0.1 0.348035 0.45624 0.328431 0.542559 0.70554 0.523466 0.509174 0.487635
0.2 0.489933 0.589913 0.416667 0.596559 0.788647 0.501805 0.510321 0.556264
0.3 0.646213 0.648497 0.632353 0.641406 0.810611 0.516245 0.618119 0.644778
0.4 0.670182 0.691594 0.669118 0.683141 0.821815 0.490975 0.736239 0.680438
0.5 0.681687 0.710036 0.678922 0.696504 0.82466 0.476534 0.77867 0.69243
0.6 0.683605 0.713805 0.683824 0.695589 0.823967 0.476534 0.817661 0.699284
0.7 0.685523 0.700968 0.64951 0.689365 0.816893 0.487365 0.829128 0.694107
0.8 0.686481 0.68538 0.64951 0.693209 0.801608 0.483755 0.837156 0.691014
0.9 0.684564 0.650229 0.671569 0.69687 0.775587 0.516245 0.833716 0.689826
1.0 0.667306 0.576566 0.661765 0.645616 0.72372 0.490975 0.822248 0.655456

Experimental Results

Table: Multi-task model merging methods using GPT-2 models

Method CoLA MNLI MRPC QNLI QQP RTE SST-2 Avg.
Fine-tuned (STL) 76.8 82.1 80.4 88.3 89.6 65.3 91.2 82.0
Model Merging
Simple Average 55.0 55.1 51.0 57.6 76.7 44.8 52.5 56.1
Fisher Merging 54.8 58.0 39.5 63.3 81.5 49.1 64.7 58.7
RegMean 61.7 70.4 65.4 69.7 78.8 56.0 79.7 68.8
Task Arithmetic (\(\lambda=0.5\)) 68.7 68.6 69.6 70.5 81.8 47.3 83.6 70.0
Ties-Merging (\(\lambda=0.6\)) 68.4 71.4 68.4 69.6 82.4 47.7 81.8 70.0