Language Model Evaluation Harness Task Pool¶

The LMEvalHarnessTaskPool is a task pool implementation that integrates the LM-Eval Harness library into the Fusion Bench framework. It allows you to evaluate language models on a wide range of standardized benchmarks and tasks.

Usage¶

Basic Usage¶

fusion_bench \
    method=dummy \
    modelpool=CausalLMPool/single_llama_model \
    taskpool=LMEvalHarnessTaskPool/lm_eval \
    taskpool.tasks="hellaswag,truthfulqa"

Configuration Options¶

The LMEvalHarnessTaskPool supports the following configuration options:

Parameter	Type	Default	Description
`tasks`	Union[str, List[str]]	Required	Comma-separated list of task names or a list of task names
`apply_chat_template`	bool	False	Whether to apply chat template to the prompts
`include_path`	Optional[str]	None	Additional path to include for external tasks
`batch_size`	int	1	Batch size for model evaluation
`metadata`	Optional[DictConfig]	None	Additional metadata to pass to task configs
`verbosity`	Optional[Literal["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]]	None	Logging verbosity level
`output_path`	Optional[str]	None	Path to save evaluation results, if not specified, the results will be saved to `log_dir/lm_eval_results`, where `log_dir` is the directory controlled by lightning Fabric.
`log_samples`	bool	False	Whether to log individual samples

Example Configurations¶

Basic evaluation with multiple tasks:

fusion_bench \
    method=dummy \
    modelpool=CausalLMPool/single_llama_model \
    taskpool=LMEvalHarnessTaskPool/lm_eval \
    taskpool.tasks="hellaswag,truthfulqa"

Here dummy method simply loads the pre-trained model or the first model in the model pool and does nothing else.

Available Tasks¶

To see a complete list of available tasks, you can use:

lm-eval --tasks list