GPT-2 Models for Text Classification¶
Here we provide a series of GPT-2 models fine-tuned for text classification tasks.
The Seven Tasks from GLUE Benchmark¶
We provide seven GPT-2 models fine-tuned on the following tasks from GLUE Benchmark: CoLA, SST-2, MRPC, QQP, MNLI, RTE, and QNLI. These models are fine-tuned with the learning rate of 5e-5 for 3 epochs. The models are available on HuggingFace as Pytorch models.
Evaluation results of these single-task models on the GLUE Benchmark are as follows:
Model | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | Avg. |
---|---|---|---|---|---|---|---|---|
CoLA | 76.8 | 32.8 | 68.4 | 50.4 | 39.2 | 48.0 | 51.0 | 52.4 |
MNLI | 59.5 | 82.1 | 33.8 | 46.5 | 24.9 | 57.4 | 40.5 | 49.2 |
MRPC | 30.8 | 25.9 | 80.4 | 47.1 | 65.9 | 49.1 | 49.1 | 49.8 |
QNLI | 58.7 | 38.9 | 30.6 | 88.3 | 39.9 | 48.7 | 47.0 | 50.3 |
QQP | 31.4 | 25.7 | 62.3 | 45.0 | 89.6 | 49.1 | 49.1 | 50.3 |
RTE | 52.8 | 47.7 | 37.5 | 53.5 | 33.7 | 65.3 | 54.9 | 49.3 |
SST-2 | 51.8 | 32.9 | 40.2 | 49.8 | 56.8 | 44.4 | 91.2 | 52.4 |
Model Pool Configuration¶
To use these models from our FusionBench library, you can specify the modelpool configuration file as follows:
type: HF_GPT2ForSequenceClassification
models:
- name: _pretrained_
path: gpt2
- name: cola
path: tanganke/gpt2_cola
- name: mnli
path: tanganke/gpt2_mnli
- name: mrpc
path: tanganke/gpt2_mrpc
- name: qnli
path: tanganke/gpt2_qnli
- name: qqp
path: tanganke/gpt2_qqp
- name: rte
path: tanganke/gpt2_rte
- name: sst2
path: tanganke/gpt2_sst2
Basic Examples¶
Here are some basic examples of using our CLI tool fusion_bench
to merge the GPT-2 models.
Simple Ensemble¶
construct an ensemble of GPT-2 models using simple ensemble and evaluate on the seven tasks
SimpleAverage¶
merge GPT-2 models using simple average and evluate on the seven tasks
Fisher merging¶
merge GPT-2 models using Fisher Merging and evluate the merged model
fusion_bench \
method=fisher_merging/gpt2_fisher_merging \
method.batch_size=8 method.num_fisher_examples=512 \
modelpool=gpt-2_glue \
taskpool=gpt-2_glue
RegMean¶
merge GPT-2 models using RegMean and evaluate the merged model
Task Arithmetic¶
merge using Task Arithmetic on the seven tasks
# set the scaling factor to 0.3
fusion_bench \
method=task_arithmetic \
method.scaling_factor=0.3 \
modelpool=gpt-2_glue \
taskpool=gpt-2_glue
# or run the following script to evaluate the model with different scaling factors,
# and save the results to different files
# or "for scaling_factor in $(seq 0 0.1 1.0)", I use the following for loop for better readability for readers who are not familiar with bash
for scaling_factor in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
do
fusion_bench report_save_path=outputs/gpt2_glue_task_arithmetic_scaling_factor_${scaling_factor}.json \
method=task_arithmetic \
method.scaling_factor=${scaling_factor} \
modelpool=gpt-2_glue \
taskpool=gpt-2_glue
done
After running the above commands, you will get the following results:
Table: Task Arithmetic with different scaling factors
scaling_coef | cola | mnli | mrpc | qnli | qqp | rte | sst2 | Avg. |
---|---|---|---|---|---|---|---|---|
0.0 | 0.308725 | 0.330107 | 0.313725 | 0.491671 | 0.63166 | 0.527076 | 0.509174 | 0.444591 |
0.1 | 0.426654 | 0.501375 | 0.367647 | 0.556654 | 0.739105 | 0.494585 | 0.509174 | 0.513599 |
0.2 | 0.658677 | 0.585532 | 0.698529 | 0.602599 | 0.785258 | 0.472924 | 0.669725 | 0.639035 |
0.3 | 0.682646 | 0.639837 | 0.718137 | 0.669046 | 0.807915 | 0.462094 | 0.792431 | 0.68173 |
0.4 | 0.690316 | 0.673867 | 0.70098 | 0.702178 | 0.817067 | 0.472924 | 0.819954 | 0.696755 |
0.5 | 0.68744 | 0.685583 | 0.696078 | 0.704924 | 0.81818 | 0.472924 | 0.836009 | 0.700163 |
0.6 | 0.688399 | 0.680998 | 0.678922 | 0.700531 | 0.808978 | 0.472924 | 0.850917 | 0.697381 |
0.7 | 0.684564 | 0.665003 | 0.669118 | 0.702361 | 0.789612 | 0.480144 | 0.853211 | 0.692002 |
0.8 | 0.677852 | 0.619154 | 0.659314 | 0.673989 | 0.748776 | 0.501805 | 0.819954 | 0.671549 |
0.9 | 0.644295 | 0.503515 | 0.654412 | 0.540912 | 0.637942 | 0.487365 | 0.78555 | 0.607713 |
1.0 | 0.627996 | 0.411004 | 0.54902 | 0.496614 | 0.478234 | 0.530686 | 0.71445 | 0.544 |
Ties-Merging¶
merge using Ties-Merging on the seven tasks
fusion_bench \
method=ties_merging \
method.scaling_factor=0.3 \
modelpool=gpt-2_glue \
taskpool=gpt-2_glue\
# or run the following script to evaluate the model with different scaling factors,
# and save the results to different files
for scaling_factor in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
do fusion_bench report_save_path=outputs/gpt2_glue_ties_merging_scaling_factor_${scaling_factor}.json \
method=ties_merging \
method.scaling_factor=${scaling_factor} \
modelpool=gpt-2_glue \
taskpool=gpt-2_glue
done
scaling_coef | cola | mnli | mrpc | qnli | qqp | rte | sst2 | Avg. |
---|---|---|---|---|---|---|---|---|
0.0 | 0.308725 | 0.330107 | 0.313725 | 0.491671 | 0.63166 | 0.527076 | 0.509174 | 0.444591 |
0.1 | 0.348035 | 0.45624 | 0.328431 | 0.542559 | 0.70554 | 0.523466 | 0.509174 | 0.487635 |
0.2 | 0.489933 | 0.589913 | 0.416667 | 0.596559 | 0.788647 | 0.501805 | 0.510321 | 0.556264 |
0.3 | 0.646213 | 0.648497 | 0.632353 | 0.641406 | 0.810611 | 0.516245 | 0.618119 | 0.644778 |
0.4 | 0.670182 | 0.691594 | 0.669118 | 0.683141 | 0.821815 | 0.490975 | 0.736239 | 0.680438 |
0.5 | 0.681687 | 0.710036 | 0.678922 | 0.696504 | 0.82466 | 0.476534 | 0.77867 | 0.69243 |
0.6 | 0.683605 | 0.713805 | 0.683824 | 0.695589 | 0.823967 | 0.476534 | 0.817661 | 0.699284 |
0.7 | 0.685523 | 0.700968 | 0.64951 | 0.689365 | 0.816893 | 0.487365 | 0.829128 | 0.694107 |
0.8 | 0.686481 | 0.68538 | 0.64951 | 0.693209 | 0.801608 | 0.483755 | 0.837156 | 0.691014 |
0.9 | 0.684564 | 0.650229 | 0.671569 | 0.69687 | 0.775587 | 0.516245 | 0.833716 | 0.689826 |
1.0 | 0.667306 | 0.576566 | 0.661765 | 0.645616 | 0.72372 | 0.490975 | 0.822248 | 0.655456 |
Experimental Results¶
Table: Multi-task model merging methods using GPT-2 models
Method | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | Avg. |
---|---|---|---|---|---|---|---|---|
Fine-tuned (STL) | 76.8 | 82.1 | 80.4 | 88.3 | 89.6 | 65.3 | 91.2 | 82.0 |
Model Merging | ||||||||
Simple Average | 55.0 | 55.1 | 51.0 | 57.6 | 76.7 | 44.8 | 52.5 | 56.1 |
Fisher Merging | 54.8 | 58.0 | 39.5 | 63.3 | 81.5 | 49.1 | 64.7 | 58.7 |
RegMean | 61.7 | 70.4 | 65.4 | 69.7 | 78.8 | 56.0 | 79.7 | 68.8 |
Task Arithmetic (\(\lambda=0.5\)) | 68.7 | 68.6 | 69.6 | 70.5 | 81.8 | 47.3 | 83.6 | 70.0 |
Ties-Merging (\(\lambda=0.6\)) | 68.4 | 71.4 | 68.4 | 69.6 | 82.4 | 47.7 | 81.8 | 70.0 |