Handling High Loads in AI DIAL

In this document, we provide the highlights of the results of testing we conducted to measure the errors count and the response speed in AI DIAL under various scenarios, especially under high loads, involving many completions and prompts.

Preconditions

We ran a series of tests involving various scenarios: small prompt to small completion, small prompt to large completion, large prompt to small completion, and large prompt to large completion. Also, AI DIAL setup with multiple endpoints was compared to a single-endpoint OpenAI setup to demonstrate the advantages of the load balancing contrary to using single-endpoint setups.

Response Speed

When testing the average response time, AI DIAL has proven to deliver better results compared to single OpenAI instances.

Moderate Load

Number of tokens: completion=1, prompt=30, total=31

	Model	Endpoints count	Load	Avg response time, ms
Dial Core	gpt-35-turbo-16k	9	10 requests per sec	542

The following chart illustrates, that DIAL Core shows a relatively stable and consistent response rate.

	Model	Endpoints count	Load	Avg response time, ms
OpenAI	gpt-35-turbo-16k	1	10 requests per sec	799

The following chart illustrates, that OpenAI, contrary to the DIAL Core, shows relatively slower and less consistent response rate even with less active users.

High Load

When we conducted the same tests under higher loads (much more tokens), the results clearly demonstrated that AI DIAL performed better, further showcasing its effectiveness.

Number of tokens: completion=2189, prompt=2204, total=4393

	Model	Endpoints count	Load	Avg response time, ms
Dial Core	gpt-4-0613	6	0.5 requests per sec	121350

The following chart illustrates the test case with a significantly higher number of tokens. DIAL Core shows a relatively stable and consistent response rate.

	Model	Endpoints count	Load	Avg response time, ms
OpenAI	gpt-4-0613	1	0.5 requests per sec	177370

The following chart illustrates, that OpenAI, contrary to the DIAL Core, shows relatively slower and less consistent response rate.

Errors Rate

We also ran tests to measure the number of successful completions and the occurrence of errors, specifically HTTP 429 (Too Many Requests). These tests showed that users are far less likely to get an error in the response when using AI DIAL.

Moderate Load

Number of tokens: completion=473, prompt=31, total=504

	Model	Endpoints count	Load	Errors
Dial Core	gpt-35-turbo-1106	5	3 requests per sec	0

The following chart illustrates the test case with a moderate number of tokens. DIAL Core shows a stable and consistent response rate and 0% of failed requests.

	Model	Endpoints count	Load	Errors
OpenAI	gpt-35-turbo-1106	1	3 requests per sec	1%

The following chart illustrates the test case with a moderate number of tokens. OpenAI shows a less stable and consistent response rate and 1% of failed requests.

High Load

Number of tokens: completion=2189, prompt=2204, total=4393

	Model	Endpoints count	Load	Errors
Dial Core	gpt-4-1106-preview	3	1 request per sec	0

The following chart illustrates that even under high load, DIAL Core shows a stable and consistent response rate and 0% of failed requests.

	Model	Endpoints count	Load	Errors
OpenAI	gpt-4-1106-preview	1	1 request per sec	57%

The following chart illustrates that under high load, OpenAI shows a significantly lower response rate and very high rate of failed requests.

Findings

Efficient Distribution of Quota

AI DIAL allows you to split Azure OpenAI service quotas, which can be allocated to a single deployment or divided among multiple deployments. This feature enables controlled RPM (Requests Per Minute) or TPM (Tokens Per Minute) for applications, optimizing resource allocation and maximizing quota usage.

Load Balancing

AI DIAL's proprietary load balancer efficiently spreads requests across several deployments, ensuring that no single deployment becomes overwhelmed. This strategy guarantees consistent performance and avoids bottlenecks, especially during times of peak demand. In our tests, AI DIAL reliably delivers faster average response times and handles more requests per second. While single instances often suffer from rapidly declining requests and unpredictable response times under heavy loads, AI DIAL sustains a steady and reliable performance level.

Fewer Errors and Retry Mechanism

AI DIAL's multiple-deployment strategy significantly reduces the likelihood of encountering errors, a common issue with single OpenAI instances during periods of high demand. Additionally, AI DIAL's ability to automatically retry failed requests boosts overall reliability, ensuring the consistent performance and better user experience.

Handling High Loads in AI DIAL

Preconditions​

Response Speed​

Moderate Load​

High Load​

Errors Rate​

Moderate Load​

High Load​

Findings​

Efficient Distribution of Quota​

Load Balancing​

Fewer Errors and Retry Mechanism​

Preconditions

Response Speed

Moderate Load

High Load

Errors Rate

Moderate Load

High Load

Findings

Efficient Distribution of Quota

Load Balancing

Fewer Errors and Retry Mechanism