LM Evaluation Harness CLI Tool¶

Overview¶

The scripts/lm_eval/evaluate_task.sh script is a comprehensive command-line tool for evaluating language models on various tasks using the LM Evaluation Harness framework. This script provides a convenient wrapper around the lm_eval library with additional features like GPU detection, automated result organization, and support for multiple inference backends.

Installation Requirements¶

Before using the script, ensure you have the following dependencies installed:

# Install LM Evaluation Harness
pip install -e '.[lm-eval-harness]'

Basic Usage¶

Syntax¶

./scripts/lm_eval/evaluate_task.sh MODEL --tasks TASK --output_path OUTPUT_DIR [OPTIONS...]

Required Arguments¶

MODEL: The model path or name to evaluate (positional argument)
--tasks TASK: The task(s) to evaluate on (single task or comma-separated list)
--output_path OUTPUT_DIR: Directory to save evaluation results

Optional Arguments¶

--batch_size BATCH_SIZE: Batch size for evaluation (default: auto)
--use_vllm: Enable vLLM for optimized inference (default: false)
--help or -h: Display help information

Examples¶

Single Task Evaluation¶

Evaluate a model on a single task with standard inference:

./scripts/lm_eval/evaluate_task.sh 'meta-llama/Llama-2-7b-hf' \
    --tasks 'hellaswag' \
    --output_path './outputs/lm_eval' \
    --batch_size 8 \
    --num_fewshot 5

Multiple Tasks Evaluation¶

Evaluate a model on multiple tasks simultaneously:

./scripts/lm_eval/evaluate_task.sh 'meta-llama/Llama-2-7b-hf' \
    --tasks 'gsm8k,gsm8k_cot,hellaswag' \
    --output_path './outputs/lm_eval' \
    --batch_size 8

Using vLLM for Optimized Inference¶

For faster inference with large models, use vLLM:

./scripts/lm_eval/evaluate_task.sh 'meta-llama/Llama-2-7b-hf' \
    --tasks 'lambada_openai' \
    --output_path './outputs/lm_eval' \
    --use_vllm \
    --batch_size auto

Custom vLLM Configuration¶

Override default vLLM parameters:

./scripts/lm_eval/evaluate_task.sh 'meta-llama/Llama-2-7b-hf' \
    --tasks 'lambada_openai' \
    --output_path './outputs/lm_eval' \
    --use_vllm \
    --model_args 'pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2,gpu_memory_utilization=0.9'

Configuration Options¶

Inference Backends¶

Standard Inference (Default)¶

When using standard inference, the script automatically configures:

Model type: hf (Hugging Face transformers)
Default model arguments: pretrained=$MODEL,dtype=bfloat16,parallelize=True

vLLM Inference¶

When --use_vllm is enabled, the script configures:

Model type: vllm
Default parameters:
tensor_parallel_size=1
dtype=auto
gpu_memory_utilization=0.8
data_parallel_size=1

Output Structure¶

Results are organized in the following directory structure:

OUTPUT_DIR/
├── TASK_NAME_1/
│   └── MODEL_NAME__SANITIZED/
│       ├── results.json
│       └── samples/
└── TASK_NAME_2/
    └── MODEL_NAME__SANITIZED/
        ├── results.json
        └── samples/

Where: - TASK_NAME is the name of the evaluation task - MODEL_NAME__SANITIZED is the model name with slashes replaced by double underscores - results.json contains the evaluation metrics - samples/ directory contains detailed sample-level results

Common Tasks¶

Here are some commonly used evaluation tasks:

Language Understanding¶

hellaswag: Commonsense reasoning
piqa: Physical interaction reasoning
winogrande: Winograd schema challenge

Mathematical Reasoning¶

gsm8k: Grade school math problems
gsm8k_cot: GSM8K with chain-of-thought
math: Mathematical problem solving

Reading Comprehension¶

lambada_openai: Language modeling evaluation
arc_easy: Science questions (easy)
arc_challenge: Science questions (challenging)

Code Understanding¶

humaneval: Code generation evaluation
mbpp: Python programming problems