Skip to content

Language Model Evaluation Harness Task Pool

The LMEvalHarnessTaskPool is a task pool implementation that integrates the LM-Eval Harness library into the Fusion Bench framework. It allows you to evaluate language models on a wide range of standardized benchmarks and tasks.

Usage

Basic Usage

fusion_bench \
    method=dummy \
    modelpool=CausalLMPool/single_llama_model \
    taskpool=LMEvalHarnessTaskPool/lm_eval \
    taskpool.tasks="hellaswag,truthfulqa"

Configuration Options

The LMEvalHarnessTaskPool supports the following configuration options:

Parameter Type Default Description
tasks Union[str, List[str]] Required Comma-separated list of task names or a list of task names
apply_chat_template bool False Whether to apply chat template to the prompts
include_path Optional[str] None Additional path to include for external tasks
batch_size int 1 Batch size for model evaluation
metadata Optional[DictConfig] None Additional metadata to pass to task configs
verbosity Optional[Literal["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]] None Logging verbosity level
output_path Optional[str] None Path to save evaluation results, if not specified, the results will be saved to log_dir/lm_eval_results, where log_dir is the directory controlled by lightning Fabric.
log_samples bool False Whether to log individual samples

Example Configurations

Basic evaluation with multiple tasks:

fusion_bench \
    method=dummy \
    modelpool=CausalLMPool/single_llama_model \
    taskpool=LMEvalHarnessTaskPool/lm_eval \
    taskpool.tasks="hellaswag,truthfulqa"

Here dummy method simply loads the pre-trained model or the first model in the model pool and does nothing else.

Available Tasks

To see a complete list of available tasks, you can use:

lm-eval --tasks list