Skip to content

Analyze Results

After running a benchmark, GuideLLM provides comprehensive results that help you understand your LLM deployment's performance. This guide explains how to interpret both console output and file-based results.

Understanding Console Output

Upon benchmark completion, GuideLLM automatically displays results in the console, divided into three main sections:

1. Benchmarks Metadata

This section provides a high-level summary of the benchmark run, including:

  • Server configuration: Target URL, model name, and backend details
  • Data configuration: Data source, token counts, and dataset properties
  • Profile arguments: Rate type, maximum duration, request limits, etc.
  • Extras: Any additional metadata provided via the --output-extras argument

Example:

Benchmarks Metadata
------------------
Args:        {"backend_type": "openai", "target": "http://localhost:8000", "model": "Meta-Llama-3.1-8B-Instruct-quantized", ...}
Worker:      {"type_": "generative", "backend_type": "openai", "backend_args": {"timeout": 120.0, ...}, ...}
Request Loader: {"type_": "generative", "data_args": {"prompt_tokens": 256, "output_tokens": 128, ...}, ...}
Extras:      {}

2. Benchmarks Info

This section summarizes the key information about each benchmark run, presented as a table with columns:

  • Type: The benchmark type (e.g., synchronous, constant, poisson, etc.)
  • Start/End Time: When the benchmark started and ended
  • Duration: Total duration of the benchmark in seconds
  • Requests: Count of successful, incomplete, and errored requests
  • Token Stats: Average token counts and totals for prompts and outputs

This section helps you understand what was executed and provides a quick overview of the results.

3. Benchmarks Stats

This is the most critical section for performance analysis. It displays detailed statistics for each metric:

  • Throughput Metrics:

  • Requests per second (RPS)

  • Request concurrency
  • Output tokens per second
  • Total tokens per second

  • Latency Metrics:

  • Request latency (mean, median, p99)

  • Time to first token (TTFT) (mean, median, p99)
  • Inter-token latency (ITL) (mean, median, p99)
  • Time per output token (mean, median, p99)

The p99 (99th percentile) values are particularly important for SLO analysis, as they represent the worst-case performance for 99% of requests.

Analyzing the Results File

For deeper analysis, GuideLLM saves detailed results to a file (default: benchmarks.json). This file contains all metrics with more comprehensive statistics and individual request data.

File Formats

GuideLLM supports multiple output formats:

  • JSON: Complete benchmark data in JSON format (default)
  • YAML: Complete benchmark data in human-readable YAML format
  • CSV: Summary of key metrics in CSV format

To specify the format, use the --output-path argument with the appropriate extension:

guidellm benchmark --target "http://localhost:8000" --output-path results.yaml

Programmatic Analysis

For custom analysis, you can reload the results into Python:

from guidellm.benchmark import GenerativeBenchmarksReport

# Load results from file
report = GenerativeBenchmarksReport.load_file("benchmarks.json")

# Access individual benchmarks
for benchmark in report.benchmarks:
    # Print basic info
    print(f"Benchmark: {benchmark.id_}")
    print(f"Type: {benchmark.type_}")

    # Access metrics
    print(f"Avg RPS: {benchmark.metrics.requests_per_second.successful.mean}")
    print(f"p99 latency: {benchmark.metrics.request_latency.successful.percentiles.p99}")
    print(f"TTFT (p99): {benchmark.metrics.time_to_first_token_ms.successful.percentiles.p99}")

Key Performance Indicators

When analyzing your results, focus on these key indicators:

1. Throughput and Capacity

  • Maximum RPS: What's the highest request rate your server can handle?
  • Concurrency: How many concurrent requests can your server process?
  • Token Throughput: How many tokens per second can your server generate?

2. Latency and Responsiveness

  • Time to First Token (TTFT): How quickly does the model start generating output?
  • Inter-Token Latency (ITL): How smoothly does the model generate subsequent tokens?
  • Total Request Latency: How long do complete requests take end-to-end?

3. Reliability and Error Rates

  • Success Rate: What percentage of requests completes successfully?
  • Error Distribution: What types of errors occur and at what rates?

Additional Analysis Techniques

Comparing Different Models or Hardware

Run benchmarks with different models or hardware configurations, then compare:

guidellm benchmark --target "http://server1:8000" --output-path model1.json
guidellm benchmark --target "http://server2:8000" --output-path model2.json

Cost Optimization

Calculate cost-effectiveness by analyzing:

  • Tokens per second per dollar of hardware cost
  • Maximum throughput for different hardware configurations
  • Optimal batch size vs. latency tradeoffs

Determining Scaling Requirements

Use your benchmark results to plan:

  • How many servers you need to handle your expected load
  • When to automatically scale up or down based on demand
  • What hardware provides the best price/performance for your workload