by: Lavanya Sankaran
Every day, technology is becoming more and more like its creators—us! Within medical fields, deep learning through artificial intelligence (AI) algorithms has already been proven to match expert-level human performance in medical imaging tasks like early cancer detection and eye disease diagnosis.
Medical advances such as these celebrate the fact that technology can indeed mimic useful human tendencies, but what can be said about the algorithms that pick up more unsavory human tendencies, such as racial bias? This misapplication of medical algorithms has made technology a powerful tool to historically exclude racial minorities, especially African Americans, from receiving essential healthcare services.
Now, a team of researchers from Stanford, Harvard, the University of Chicago, and the University of California, Berkeley, are proposing a way to develop medical algorithms that could remedy, rather than exacerbate, existing inequality. The key: stop training algorithms to match human expert performance.
The researchers focused on disparities existing in the current diagnosis and treatment of knee osteoarthritis, an illness that causes chronic pain. Radiologists typically assess the severity of that pain by reviewing an x-ray of the knee and scoring the patient’s pain on the Kellgren-Lawrence grade (KLG), which calculates pain levels based on the presence of different radiographic features. But data from a National Institute of Health study found that radiologists using this method systematically score Black patients’ pain as far less severe than what they claim to be experiencing. Put differently, Black patients self-report more pain than other patients who have similarly scored x-rays.
This leaves one glaring question: what are the radiologists missing?
Instead of approaching the question directly, researchers hypothesized that algorithms designed to reveal what radiologists don’t see, instead of mimicking their knowledge, could make differences in diagnosis. To test this possibility, the team programmed a deep-learning algorithm to predict a patient’s pain level from their knee x-ray. After processing thousands of x-rays, the algorithm discovered patterns of pixels that correlate with pain. When shown a brand-new x-ray, the software uses these patterns to predict the pain a patient would self-report.
The results revealed that the researchers’ new algorithm was much more accurate than KLG at predicting self-reported pain levels for both White and Black patients, but especially for Black patients. Notably, the algorithm reduced the racial disparity by nearly 50% at each pain level. The study concluded that the algorithm’s predictions correlated more closely with patients’ pain than the scores radiologists assigned to knee x-rays. This suggests that the algorithm was learning to see things over and above what the radiologists were seeing— things that were common causes of pain in Black patients.
What’s the secret behind this new algorithm? Ziad Obermeyer, a lead researcher of the study and a professor at the University of California Berkeley’s School of Public Health, believes, “the algorithm explicitly removes emotional, cultural, and psychological factors and creates a knee-based measure of disease severity: that’s its strength.” So, by grounding the patient’s report of pain in structural realities of the knee, the algorithm is enabled to exclude human bias.
Let’s return more directly to the question of what radiologists are missing. Radiologists’ lack of proficiency in assessing knee pain in Black patients can be traced back through history. The KLG methodology itself originated from Lancashire, England in a small study conducted in the 1950s. Tellingly, the original manuscripts did not even mention the sex or race of the cohort— presumably because it was 100% White and male. Therefore, some experts argue that the criteria of radiographic markers that the KLG directs clinicians to find may not be comprehensive of all possible sources of pain within more diverse populations. That is to say, there may be patterned radiographic indicators of pain that are common for Black people that simply aren’t part of the KLG rubric.
This study not only shows what happens when AI is trained by patient feedback instead of expert opinions, but also why medical algorithms have previously been seen as a cause of bias rather than a cure. Previous research conducted by Obermeyer in 2019 proved that an institutional algorithm guiding care for millions of US patients gave priority to those identifying as White over those identifying as Black when it came to assistance with complex illnesses (such as diabetes). But why did this happen?
Obermeyer explains: “These algorithms are trained to predict health costs, as a proxy for health needs. That’s not an unreasonable choice— costs do increase in health needs. But the problem is that, because of barriers to access and discrimination, Black patients generate fewer costs than White patients, on average. The algorithm sees that, and de-prioritizes Black patients relative to their needs, because they do cost less.”
Therefore, the bias lies in the unit of measure we trained the algorithm to predict: costs, not health. So, we often blame algorithms when we should really be blaming ourselves for pointing the algorithms in the wrong direction.
However, this seemingly magical algorithm comes with a pseudo-magical catch: the researchers engineered the algorithm using a type of AI that can’t be reverse-engineered. That is to say, researchers were able to create an efficient, accurately functioning algorithm, but they don’t exactly know how it works. They used a new technology called artificial neural networks that makes many AI applications more practical. Ironically, however, they include a major practical downside: they’re extremely tricky to reverse engineer, so much so that many experts call them “black boxes.”
Despite the magnitude of the task, experts are eager to understand what this knee algorithm knows. One such expert is Judy Gichoya, a radiologist and assistant professor at Emory University, who plans on testing the algorithm’s performance to the limit. By comparing detailed x-ray notes made by radiologists to the results of the pain prediction algorithm, Gichoya’s team hopes to uncover clues about the overlooked pattern it detects.
Her team’s potential success ultimately tests the magnitude of our human capabilities. She is optimistic that the results are within grasp— it may merely be a matter of changing our perspective and getting out of our comfort zone. In that sense, technology has impressively mimicked a fundamental aspect of the human experience: breaking boundaries. So, although this research shows that humans have to take a hard look at the racial disparities that exist alarmingly and exceedingly at institutional levels, it also imparts hope that change is possible using AI as a fresh set of eyes in clinical settings.
Gichoya’s collaborative research also sheds light on the future of radiology and medicine as a livelihood. Notice that all the algorithm can do is link pain to features of the x-ray. But it can’t tell us anything else— why that part of the knee hurts, what to do about it, etc. These are questions that human researchers and clinicians are trained at answering because they require integrating data from the lab bench, other studies, treatment options, and their personal experience. Therefore, although algorithms will be a powerful new tool for scientific discovery and medicine, they certainly won’t replace all the things that human clinicians and researchers do.