By Niki Ebrahimnejad
AI. AI. AI. This is not your average buzzword: a lot of attention has been placed on artificial intelligence given its emergence and success in the tech industry. One medical application of AI that has gained praise is image reconstruction. Building images from complex data inputs such as magnetic resonance image (MRI) measurements is a tricky task that AI-based algorithms can greatly ease. A lot of valuable work has been done to develop what are known as deep learning (DL) algorithms: programs that given swaths of data samples can learn and recognize patterns on par with human assessment. That said, assuming they are taught well in the first place.
UCSF researcher Elfrat Shimron and her team in a 2022 study questioned the current training of these MRI algorithms, especially considering that other AI tasks in other fields have given misleading results from faulty data cohort constructions. The researchers reviewed and evaluated several industry-level algorithms against gold standard MRI image constructions at various points in the production pipeline and found that these algorithms were giving errors that went unreported. These shortcomings are part of what they have coined as “data crimes.”
Before diving into Shimron’s work, it is worth reviewing what MRI data actually looks like. MRI measurements are not stored in a simple spreadsheet. They are described to belong to a mathematical space called k-space. Accessing k-space directly is generally avoided because doing so would make MRI scanning take hours. The real issue lies in the next step of training. Just like with an Olympic athlete, the key to highly finessed and precise performance is practice, practice, practice. DL algorithms need a lot of k-space data to work with and unfortunately there are few open-source k-space databases available. Now, this may raise a question: why bother with open-source? Open-source, or publicly available data, has been of tremendous importance in the advancement of machine learning algorithms since the barriers of access between institutions would make it difficult for developers and researchers to easily get the training data needed for machine learning. Shimron and her team looked closer at what little was open sourced and hypothesized that these data were “spoiled.”
In fact, there are two classes of data crimes believed to be invasive and underreported: zero-added k-space data and JPEG-compressed data. In the first case, the commercial scanning pipeline uses a method called zero padding to produce a convenient but oversimplified image later used in synthesizing a copy of k-space. The flaw here is that the synthesized k-space actually does not mirror the original k-space due to the scanner pipeline. In the second case, JPEG compression is also convenient for data storage, but it removes some data fields and so potentially key data point relationships in the process. To prove the existence of data crimes, they ran comparison tests in image reconstruction in the presence and absence of these data-processing pipelines on raw k-space MRI knee data in training well-established algorithms. Shirmon and her team were ultimately proven right: the processed data yielded poorer and even misleading performances from all three algorithms compared to raw k space data as the training set.
To simulate the first data crime, all 3 algorithms were trained separately on two separate versions of the same knee MRI dataset. The difference was whether there was zero padding or not. They then were each tested to produce an image against the original image designated as the control or the gold standard. What they found, even as they tested different variants of the same processing scenario, was that the processed data setups led to algorithms producing images that were overly enhanced with detail—which was never there to begin with. The so-called “improvement” is artificial and conclusions cannot be safely made on what is actually occurring in vivo, similar to an applied photo filter adding contrast and detail not originally present.
For the second data crime, Shimron was able to corroborate findings with the apparent blindness of error metrics when JPEG compression is used. These metrics essentially reveal the degree to which the reconstructed image is accurate. The error metric compares the final reconstructed image and the gold standard. Worryingly, JPEG compression is not factored into this assessment because existing error metrics do not account for the use of processed data and instead make conclusions assuming unprocessed data. This paradigm of ignoring the process and looking at the output is exactly what Shirmon warns against. Even with error metrics included in screening the output, a spoiled image can still pass off as authentic when it is not.All in all, the concerns and caution Shrimon and her team have can be expanded to machine learning algorithms as a whole in the field of medicine. Because the consequences of defective or biased medical tools fall on patients, it is crucial that whatever medical innovations are made with machine learning, they are heavily screened and evaluated for potential flaws such as the data crimes Shrimon has identified. Faulty conclusions used for important patient diagnoses will only be compounded with time and have devastating effects on human health and the effectiveness of medical practice. Additionally, Shrimon brings up the politics of paper publication in continuing data crimes: the use of overly enhanced images can hurt researchers who cannot replicate the same results with unprocessed data and therefore find it more difficult to publish because the image reconstruction quality appears poorer than what is out there. With data science and machine learning not yet deeply entrenched in the way research is being done, it is not too late to begin a more conscientious practice of reporting data processing and its consequences on outputs.