Running Benchmark Using genai-bench
Container¶
Using Pre-built Docker Image¶
Pull the latest docker image:
Building from Source¶
Alternatively, you can build the image locally from the Dockerfile:
Run Benchmark with a Docker Container¶
Create a Docker Network (Optional)¶
To avoid internet disruptions and network latency, it's recommended to run the benchmarking within the same network as the target inference server. You can always choose to use --network host
if you prefer.
To create a bridge network in docker:
Start an Inference Server¶
Then, start the inference server using the standard Docker command with the additional flag --network benchmark-network
.
Example:
docker run -itd \
--gpus \"device=0,1,2,3\" \
--shm-size 10g -v /raid/models:/models \
--ulimit nofile=65535:65535 --network benchmark-network \
--name sglang-v0.4.7.post1-llama4-scout-tp4 \
lmsysorg/sglang:v0.4.7.post1-cu124 \
python3 -m sglang.launch_server \
--model-path=/models/meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp 4 \
--port=8080 \
--host 0.0.0.0 \
--context-length=131072
Start a Benchmark Container¶
Next, start the genai-bench container with the same network flag.
Example:
First, create a dataset configuration file to properly specify the split:
llava-config.json:
{
"source": {
"type": "huggingface",
"path": "lmms-lab/llava-bench-in-the-wild",
"huggingface_kwargs": {
"split": "train"
}
},
"prompt_column": "question",
"image_column": "image"
}
Then run the benchmark with the configuration file:
docker run \
-tid \
--shm-size 5g \
--ulimit nofile=65535:65535 \
--env HF_TOKEN="<your_HF_TOKEN>" \
--network <your-network> \
-v <path-to-your-local-model>:/models \
-v $(pwd)/llava-config.json:/genai-bench/llava-config.json \
--name llama-4-scout-benchmark \
genai-bench:dev \
benchmark \
--api-backend openai \
--api-base http://localhost:8080 \
--api-key your_api_key \
--api-model-name /models/meta-llama/Llama-4-Scout-17B-16E-Instruct \
--model-tokenizer /models/meta-llama/Llama-4-Scout-17B-16E-Instruct \
--task image-text-to-text \
--max-time-per-run 10 \
--max-requests-per-run 100 \
--server-engine "SGLang" \
--server-gpu-type "H100" \
--server-version "v0.4.7.post1" \
--server-gpu-count 4 \
--traffic-scenario "I(512,512)" \
--traffic-scenario "I(2048,2048)" \
--num-concurrency 1 \
--num-concurrency 2 \
--num-concurrency 4 \
--dataset-config /genai-bench/llava-config.json
Note that genai-bench
is already the entrypoint of the container, so you only need to provide the command arguments afterward.
The genai-bench runtime UI should be available through:
You can also utilize tmux
for additional parallelism and session control.
Monitor benchmark using volume mount¶
To monitor benchmark interim results using the genai-bench container, you can leverage volume mounts along with the --experiment-base-dir
option.
HOST_OUTPUT_DIR=$HOME/benchmark_results
CONTAINER_OUTPUT_DIR=/genai-bench/benchmark_results
docker run \
-tid \
--shm-size 5g \
--ulimit nofile=65535:65535 \
--env HF_TOKEN="<your_HF_TOKEN>" \
--network <your-network> \
-v <path-to-your-local-model>:/models \
-v $HOST_OUTPUT_DIR:$CONTAINER_OUTPUT_DIR \
-v $(pwd)/llava-config.json:/genai-bench/llava-config.json \
--name llama-3.2-11b-benchmark \
genai-bench:dev \
benchmark \
--api-backend openai \
--api-base http://localhost:8080 \
--api-key your_api_key \
--api-model-name /models/meta-llama/Llama-4-Scout-17B-16E-Instruct \
--model-tokenizer /models/meta-llama/Llama-4-Scout-17B-16E-Instruct \
--task image-text-to-text \
--max-time-per-run 10 \
--max-requests-per-run 100 \
--server-engine "SGLang" \
--server-gpu-type "H100" \
--server-version "v0.4.7.post1" \
--server-gpu-count 4 \
--traffic-scenario "I(512,512)" \
--traffic-scenario "I(2048,2048)" \
--num-concurrency 1 \
--num-concurrency 2 \
--num-concurrency 4 \
--dataset-config /genai-bench/llava-config.json \
--experiment-base-dir $CONTAINER_OUTPUT_DIR