When AI Writes Care Notes, Gender Still Shapes the Story
When AI Writes Care Notes, Gender Still Shapes the Story
Published in BMC Medical Informatics and Decision Making, the research examined how fairly different large language models (LLMs) summarise decades of care records—and found striking differences between models.
The study used 617 anonymised case notes from a London local authority, creating gender-swapped versions of each record to test whether summaries changed depending on whether a patient was described as male or female. These were then processed through two of the most advanced open-source models—Meta’s Llama 3 and Google’s Gemma—alongside earlier benchmarks T5 and BART. The aim was to measure “counterfactual fairness”: would the same patient, described with a different gender, be summarised in the same way?
The answer depended on the model. Llama 3 generated summaries with no significant differences across gender, suggesting a higher level of fairness. But Gemma displayed measurable bias. Male records were more often described with negative sentiment, and the model used more direct language about men’s health and disability—terms like “disabled” or “unable”—while women’s health issues were more likely to be softened or downplayed. In some cases, women’s needs were omitted altogether, while men’s were emphasised as “complex medical histories.”
Why does this matter? In long-term care, documentation is the foundation for decisions about what services and support are provided. If women’s health issues are consistently underemphasised in records, the risk is that they receive fewer resources or delayed interventions. Conversely, if men are portrayed in more negative or urgent terms, they may be prioritised for faster or more intensive care. The authors describe this as a risk of “allocational harm”—bias that can influence how services are delivered, even if unintentionally.
The findings echo wider concerns about how generative AI can amplify inequities embedded in training data. While early transformer models like BERT and GPT-2 have long been known to reflect gender stereotypes, this study highlights that not all modern LLMs behave the same way. Some, like Llama 3, may mitigate bias more effectively, while others, like Gemma, still show disparities in tone and emphasis. That variation underscores the importance of model-specific evaluation before deployment in healthcare settings.
Importantly, the study also provides a practical framework for assessing bias in AI-generated summaries, combining sentiment analysis, thematic word counts and linguistic comparisons. This reproducible methodology could help regulators, developers and healthcare providers test AI systems for fairness across gender—and, in future, across other protected characteristics such as ethnicity or disability.
The results don’t suggest abandoning AI in health documentation altogether. Used carefully, LLMs can ease administrative burden, reduce cognitive load on practitioners, and improve access to key information in sprawling case records. But as the study shows, without rigorous evaluation, these same tools risk embedding subtle inequities into the very notes that shape care decisions.
In the words of the researchers, accuracy alone is not enough. If generative AI is to play a role in healthcare, fairness must be tested, measured, and designed into the system—because the words that shape care records can ultimately shape outcomes.
Renae Beardmore
Managing Director, Evohealth