LLM Throttling

If you're seeing very slow label generation in Brim, it may be due to LLM throttling. This article explains what throttling is, why it happens, and how to manage it effectively.

What is LLM Throttling?

Large Language Models (LLMs) have limited bandwidth, defined by two main factors:

Tokens per minute: the total amount of text you can send and receive.
Requests per minute: the number of individual API calls allowed.

When you exceed either of these limits, your requests will be throttled—meaning you'll have to wait before retrying. Brim handles some retries automatically, but ongoing throttling can slow down processing.

Why Throttling Happens in Brim

Brim parallelizes work across multiple workers and threads to speed up processing. This allows it to make full use of your LLM provider’s available bandwidth.

If you have too many workers running, Brim can work faster overall, but it may saturate your LLM’s limits, triggering throttling errors.

How to Manage Throttling

Here are two ways to reduce or eliminate throttling issues:

1. Increase LLM Limits

If you're frequently hitting your limits, the best long-term solution is to increase your rate limits with your LLM provider. Many providers (eg Microsoft for Azure OpenAI, Amazon for Bedrock) allow you to request higher token and request-per-minute limits based on your usage needs.

2. Reduce Brim Worker Count

If increasing limits isn’t feasible, you can reduce the number of parallel workers Brim uses. This slows processing but helps avoid triggering throttling errors.

⚠️ This change must be made by an IT administrator. It involves modifying a configuration setting and restarting the Brim service.

Need help adjusting this setting? Contact your IT team or reach out to Brim support for guidance.