De-identification

Clinical Definition

Removing all identifying information from clinical data before it enters any AI tool. This is not optional. Under HIPAA, 18 categories of identifiers must be stripped: names, dates, locations, medical record numbers, and more. If you paste a client's eval into ChatGPT with their name attached, you have created a reportable breach.

Technical Definition

The process of removing or transforming personally identifiable information (PII) and protected health information (PHI) from datasets, typically following the HIPAA Safe Harbor method (removal of 18 identifier types) or Expert Determination method. Automated de-identification tools use NER (named entity recognition) models to detect and redact identifying elements.

Also known as: de-id, anonymization, data scrubbing, PHI removal

Why SLPs Need to Know This

Every major AI tool (ChatGPT, Claude, Gemini) processes your input on external servers. Unless you have a signed BAA and a HIPAA-compliant enterprise agreement, any client data you enter is potentially exposed. De-identification is the non-negotiable first step before clinical data touches any AI tool.

The 18 HIPAA Identifiers

Strip all of these before entering data into any AI system:

Names
Geographic data smaller than a state
All dates (except year) related to an individual
Phone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers
Device identifiers and serial numbers
Web URLs
IP addresses
Biometric identifiers
Full-face photographs
Any other unique identifying number or code

Practical Guide

Replace, don’t just delete. Use placeholders like [CLIENT] or [DOB] so the text remains usable
Watch for indirect identifiers. A rare diagnosis plus an age plus a school district can identify a child even without a name
Automate where possible. Manual de-identification is error-prone under time pressure
Check your output too. If the model generates a response that somehow includes identifying information you provided, that output is also a risk

Guardrails: safety constraints that may include automatic PHI detection
HIPAA: the federal law governing health information privacy