Hard Boundaries: Structuring MBSS Reports Without Clinical Overreach
An acute care SLP develops a workflow with explicit constraints after an LLM repeatedly generates unsolicited diet-level recommendations in instrumental swallowing evaluation reports.
An SLP performing 6-8 modified barium swallow studies per week uses AI to structure MBSS reports, discovering the model defaults to generating diet recommendations that constitute clinical decisions.
Structuring observational findings from instrumental swallowing evaluations into standardized report format
Consistent, well-organized MBSS reports with clear separation between observational findings and clinical recommendations
Clinical Context
A speech-language pathologist in the acute care unit of a 450-bed urban medical center performed an average of six to eight modified barium swallow studies (MBSS) per week. Each study generated a detailed report documenting bolus trials, penetration-aspiration events, pharyngeal residue patterns, and physiological observations. These reports were read by referring physicians, hospitalists, nursing staff, and dietitians, requiring a format that was both clinically precise and accessible.
The clinician’s MBSS reports followed a standardized template but required substantial narrative writing for the findings and interpretation sections. Each report took 25-35 minutes to draft, typically completed between patients or after shift. With concurrent bedside evaluations, treatment sessions, and team rounds, documentation consistently extended her workday.
She began experimenting with an LLM to convert her shorthand procedural notes into structured report drafts, aiming to reduce per-report writing time without compromising the clinical detail that referring teams depended on.
The Challenge
During her first week of using the AI tool, the clinician identified a significant problem. When provided with findings such as “thin liquid, penetration to level of vocal folds, no clearance on cue,” the model consistently generated diet-level recommendations in its output. Phrases such as “nectar-thick liquids are recommended” and “the patient should be advanced to an IDDSI Level 4 diet” appeared in drafts despite no such instruction in the prompt.
Diet-level recommendations are clinical decisions that require integration of the instrumental findings with the patient’s medical status, respiratory function, cognitive profile, and goals of care. An AI tool generating these recommendations from swallow-study data alone posed a patient safety risk if a draft were inadvertently signed without thorough review. The model was performing clinical reasoning it was not qualified to perform.
Compounding the issue, the recommendations the model generated were often plausible. They were not random; they reflected common clinical correlations between PAS scores and diet modifications. A clinician reviewing the output quickly could easily miss that the recommendation had been generated by the model rather than written by a colleague or extracted from a prior report. The plausibility of the error made it harder to detect.
AI-Assisted Approach
The clinician developed a constrained workflow over several iterations.
Step 1: Structured input. She created a shorthand notation system for her procedural notes (bolus type, volume, PAS score, residue location and severity, and observed compensatory strategies). She entered these as a bulleted list.
Step 2: Bounded prompt. The revised prompt included explicit restrictions: “Draft the findings section of an MBSS report based on the following observations. Describe what was observed for each bolus trial. Do NOT include diet recommendations, clinical impressions, or management suggestions. Report only what was seen during the study. Use Penetration-Aspiration Scale scores as provided. Do not interpret them as safe or unsafe.”
Step 3: Separation of sections. She prompted the AI to produce only the findings section, then wrote the clinical impressions and recommendations herself. This hard boundary ensured that all clinical judgment remained in her own writing.
Step 4: Verification pass. Before finalizing each report, she searched the AI-generated text for directive language (“recommend,” “should,” “appropriate,” “suggest”) as a quality check.
What the AI Produced
Thin Liquid (5 mL via spoon): Bolus was propelled posteriorly with mildly delayed pharyngeal swallow initiation (approximately 1.5 seconds after bolus head reached the pyriform sinuses). Penetration to the level of the vocal folds was observed without subsequent clearance. Penetration-Aspiration Scale score: 5. Pharyngeal residue was noted in the bilateral valleculae, graded as mild.
Puree (5 mL via spoon): Oral transit was functional with adequate lingual propulsion. Pharyngeal swallow was initiated within normal limits. No penetration or aspiration was observed. Penetration-Aspiration Scale score: 1. Trace residue in the left pyriform sinus cleared with a subsequent dry swallow.
Thin Liquid (sequential sip from cup): Aspiration of material below the level of the true vocal folds was observed during the swallow. No spontaneous cough response was elicited. Penetration-Aspiration Scale score: 8. Moderate residue in bilateral pyriform sinuses persisted after two consecutive swallows.
Clinical Review & Modifications
The constrained prompt significantly reduced but did not eliminate clinical overreach. In approximately one out of every five reports, the clinician found residual evaluative language. Common examples included “the patient demonstrated difficulty with” (implying a deficit rather than describing an observation) and “thin liquids posed a challenge” (attributing causality). She systematically replaced these with observational equivalents: “aspiration was observed on thin liquid trials” and “PAS scores of 5-8 were recorded across thin liquid boluses.”
The clinician also found that the AI occasionally reordered bolus trials in a way that implied a progression narrative, presenting thicker consistencies first and thinner last, even when the study protocol had followed a different sequence. She added a line to her prompt: “Present trials in the order listed below. Do not reorder.”
The most critical modification remained the complete separation of findings from recommendations. The clinician wrote all clinical impressions, diet recommendations, and compensatory strategy suggestions herself, drawing on the full clinical picture including chart review, bedside evaluation findings, respiratory status, and interdisciplinary input.
Outcome
After refining the workflow over three weeks, the clinician reduced her per-report writing time from an average of 30 minutes to 18 minutes. The findings sections were consistently organized and used standardized language, which she received positive feedback about from two referring physicians who noted the reports were easier to scan during rounds. No AI-generated recommendation language appeared in any signed report after the constrained workflow was implemented.
The clinician shared her prompt template and verification checklist with two colleagues in the department, both of whom adopted modified versions for their own MBSS documentation.
She also developed a brief onboarding document for new staff explaining the workflow’s constraints: which sections were AI-assisted, which were clinician-authored, and why the boundary existed. This transparency was important for department accountability and for ensuring that future users of the template understood the clinical reasoning behind its design rather than treating it as a black-box documentation shortcut.
Reflection
The clinician observed that the AI’s tendency to recommend diet levels likely reflected its training data, in which MBSS findings and diet recommendations frequently co-occur. This made the model’s behavior predictable but also insidious: the recommendations it generated were often plausible, making them easy to overlook during cursory review. She emphasized that the verification pass for directive language was not optional but essential.
She also noted that the AI performed best with the most structured input. Free-text clinical notes produced inconsistent output; the shorthand notation system she developed gave her the most reliable results. She considered this shorthand system a clinical tool in its own right, since it forced her to standardize her own observational language during the study, which improved her procedural note-taking independent of the AI.
Looking back, she identified one risk she had not anticipated: the potential for anchoring bias. Because the AI draft was well-organized and fluently written, she found herself initially inclined to accept its framing and make only minor edits. She countered this by reviewing her own raw notes before reading the AI draft, forming her clinical interpretation first and then using the draft as a formatting tool rather than a reasoning aid. She recommended this sequence (interpret first, format second) to the colleagues who adopted her workflow.
Key Takeaways
- LLMs trained on clinical text will default to generating clinical recommendations alongside findings; explicit constraints are necessary but not sufficient without a verification step.
- Separating AI-assisted sections (observational findings) from clinician-authored sections (clinical impressions, recommendations) creates a clear boundary that reduces risk.
- A keyword search for directive language (“recommend,” “should,” “suggest,” “appropriate”) is a practical quality check for clinical documentation.
- Structured shorthand input produces more consistent and reliable AI output than free-text clinical notes.
- Clinicians should form their clinical interpretation before reading the AI draft to avoid anchoring bias from well-written but potentially misdirected output.