Variable Best Practices

Well-designed variables are the foundation of accurate, scalable chart abstraction. When a variable is clearly defined and thoughtfully scoped, it leads to better AI performance, easier human review, and cleaner data for your team.

Use the checklist below to evaluate and improve your variables.


There are 4 overall categories:

1) Simplicity: Is this variable answering the simplest question possible? Note: Simple variables can be layered to achieve very complex reasoning — with increased accuracy and traceability.

2) Semantics: Are the desired results in various situations clearly described?

3) References: If this variable references others, are those references clearly laid out and contextualized?

4) Completeness: Are the fields of this variable complete and consistent?


1) Simplicity

✅ Single Concept

Each variable should capture one idea only. Avoid combining multiple questions into one (e.g., “Does the patient have diabetes or hypertension?”). Your instruction should not require conditional logic like "If X, then Y, unless Z."

Weak on Single Concept Strong on Single Concept
left_handed_tennis_injury: Is this patient left handed and do they have a tennis injury?

left_handed: Is the patient left handed (boolean)

tennis_injury: Do they have an injury from tennis? (boolean)

left_handed_tennis_injury: Are left_handed and tennis_injury both true for this patient?


✅ Limited Categories (Bonus)

Whenever possible, keep your variable response options simple. Booleans (yes/no ) or limited categories (e.g., stage 1–4 ) are easier for both humans and AI to interpret consistently.


Not Limited Categories Limited Categories
tennis_injury_type: Free text description of tennis injury type. tennis_injury_type: ankle sprain, torn ACL, broken leg, broken wrist, other

2) Semantics

✅ Insufficient Evidence

Make it clear what to do if the desired information isn’t found. For example:

“If the diagnosis is not mentioned, return ‘unknown.’”

Weak Guidance for Insufficient Evidence Strong Guidance for Insufficient Evidence
tennis_injury_type: should be ankle sprain, torn ACL, broken leg, broken wrist, or other

tennis_injury_type: should be ankle sprain, torn ACL, broken leg, broken wrist, or other

In the prompt:

"If there is insufficient direct evidence of a tennis injury, return "None"."

✅ Multiple Values

Specify how to handle cases where the value appears more than once. Common patterns include:

  • Use the most recent value
  • Use the earliest documented value
  • Choose the highest grade or most severe classification
Weak Guidance for Multiple Values Strong Guidance for Multiple Values
tennis_injury_type: should be ankle sprain, torn ACL, broken leg, broken wrist, or other

tennis_injury_type: should be ankle sprain, torn ACL, broken leg, broken wrist, or other


In the aggregation instructions:

"If there are multiple tennis injuries, return the most recent." or choose a standard aggregation instruction.

✅ Expected Format

Clarify the format of the output (e.g., MM/YYYY , whole number, free text). This improves consistency and downstream usability.

Weak Guidance for Expected Format Strong Guidance for Expected Format
tennis_injury_type: return the type of tennis injury tennis_injury_type: Return the injury location, with a colon, followed by the injury. eg "wrist: sprain", "wrist: break".

✅ Temporal Context

Define the relevant timeframe. Is the variable referring to:

  • The current status?
  • The initial diagnosis?
  • A specific encounter?

If the answer depends on time, say so explicitly.

Weak Guidance for Temporal Context Strong Guidance for Temporal Context
injured_this_year: return whether the patient was injured this year. injured_this_year: return whether the patient was injured this year, defined as after 1/1/2025. Injuries before this date should be ignored.

3) References (Only if Applicable)

✅ Variables Connected

If this variable references others (e.g., using "Date of Birth" to calculate age), make sure those are listed in the variables field.

✅ References in Prompt

Reference those connected variables directly in your variable instructions so Brim can use them correctly.

Example: “Use the Date of Birth variable to calculate the patient’s age.”


4) Completeness

✅ Consistency

The variable name should match its data type, scope, and intent. For instance, don’t name something has_cancer if it returns a free-text tumor description.

Weak Consistency Strong Consistency
age: return whether the patient is older than 45 years old. older_than_45: return whether the patient is older than 45 years old.

✅ Examples

Include at least one example or anti-example. Show a snippet of clinical text and what value should be abstracted.

“If the note says ‘The patient was diagnosed with prostate cancer in 2018,’ return ‘01/01/2018’.”

No Examples Includes Examples
tennis_injury_type: return the type of injury sustained from tennis.

tennis_injury_type: return the type of injury sustained from tennis.


Examples:

  • "Patient had a broken wrist from playing tennis". Expected value: "Wrist: broken"
  • "Patient broke their leg." Expected value: "None", because no tennis is mentioned.

✅ Optimization

Optimize your variable using Brim's variable optimizer. This will also include any examples with labels you've already reviewed. You'll find more information on optimizing variables here.


By following these best practices, you’ll create variables that are easier to maintain, easier to scale, and more likely to produce accurate results.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.