Neural Networks and the Black Box

The excitement over deep learning and neural networks is well deserved, but the notion that artificial intelligence (AI) in the form of “deep learning” will replace the need for physician expertise is, I would suggest, a bit premature. Siddhartha Mukherjee, M.D., in “A.I. Versus M.D.” does an excellent job of illustrating the differences between man and machine when it comes to AI.

In his study, Mukherjee asked a radiology trainee how he had identified a stroke on a CT scan despite all the confusing elements on the image. The student could verbalize the rules he used to sort through the image details, but in the end:

How had he narrowed his focus to that one area? He paused as the thought pedaled forward and gathered speed in his mind. “I don’t know—it was partly subconscious,” he said, finally.

The Black Box

The black box is the gatekeeper between what we know we know and that part of our mind which operates without words. The student in Mukherjee’s example called it his subconscious. We can document the information that goes into a system and what comes out, but we don’t know how the black box processes the input to create the output. Turns out this is true for humans and for AI.

Years ago when I was at Carnegie Mellon University, Drs. John Hayes and Linda Flower were studying writers and how they composed. They developed an approach called protocol analysis in which the writers were given an assignment and asked to verbalize their thinking as they revised the documents. At the time there was much controversy over how verbalizing thinking changed the actual writing process itself, but the approach did allow researchers to track the differences in the thinking of novice and expert writers on how they approach a problem.

In one of these studies, I was a subject and given a one-page piece of correspondence and asked to revise it. I dutifully vocalized what I was thinking as I worked at the computer, cutting and pasting parts of the document or adding new content. At one point I looked at an overly long paragraph—a large portion of the letter, actually—and said “This is garbage” as I highlighted and deleted it. Afterwards, one of my fellow classmates involved in the study told me that my comment was frustrating because I did not verbalize why or how I came to conclusion this large text block was garbage. In other words, my comment didn’t give researchers a clue as to how I recognized this block of text as garbage.

As I discussed in “Novice vs Expert,” early AI concentrated on gathering data from experts, believing all “knowledge” could be programmed into an “expert system.” The topic of discussion was whether one should build an expert system from the top down (Herbert Simon) or bottom up (Hubert Dreyfus). The problem with this approach is that experts aren’t experts because of the rules they know. They are experts because they can recall a single incidence of something observed perhaps 30 years ago.

AI has come to recognize that the human mind is especially suited to recognizing patterns—even a pattern of one. Mukherjee describes in his article how thousands of images of a variety of lesions were programmed into the expert system being built. He was told the deep learning process of neural networks was able to teach the computer how to differentiate between the various lesions. But Mukherjee admits researchers do not know how neural networks do this—the black box raises it’s gate once again.

AI Researcher Interviews

Mukherjee talked to a number of researchers in AI who made much of the testing of the machine on its ability to recognize kinds of lesions against the identification of lesions by expert dermatologists:

The system got the answer right seventy-two per cent of the time. (The actual output of the algorithm is not “yes” or “no” but a probability that a given lesion belongs to a category of interest.) Two board-certified dermatologists who were tested alongside did worse: they got the answer correct sixty-six per cent of the time.

The AI model “outperformed” the experts, but in coming up with a PROBABILITY that a lesion would become malignant. The machine still has only a choice of yes or no, probability or otherwise. To my knowledge, no one bothered to ask the dermatologists why they might have refrained from identifying some of the lesions as ones which would probably become malignant. After all, the dermatologists weren’t limited to yes and no answers in their observation of these tumors. It cannot be assumed that if the lesion wasn’t identified as one which probably would become malignant, the dermatologists erred in their identification of the lesions.

Hype over this kind of result had led some AI enthusiasts to talk in terms of diagnosis, not probabilities. It’s an easy jump then to the prognostication that medicine would no longer need radiologists or dermatologists.

In a second test, the results were stated as:

In almost every test, the machine was more sensitive than doctors….In every test, the network outperformed expert dermatologists,” the team concluded….

Note that somehow, almost has become in every test.

As Mukherjee says, “It’s hard not to be seduced by this vision.” I would proffer that this is not really vision, but a bias driven by the very human need to name things. Naming makes probabilities rules by giving them a place card in the scheme of things.


We as humans hunger after absolutes as we struggle to deal with the constant change of living. AI and “deep learning” have come a long way since Herbert Simon and his researchers were working to create expert systems from the top down by feeding as many rules as possible into the computer. As a culture, we don’t want gray, we want black and white, and we can get downright hateful about demanding what we need to feel comfortable. In The Checklist Manifesto, Atal Gawande wrote about how checklists have made the practice of medical procedures safer when it comes to preventive techniques such as hand washing. These are rule-based procedures, not diagnoses.

I find the progress made in differentiating various lesions amazing. But the information suppied by AI is in terms of probability, not diagnoses. The danger—and it is a serious one—is that the rule hungry will deem the probabilities as diagnosis, much in the manner above where “almost” becomes “every” in the short span of a sentence.

CMS is particularly prone to use checklists to reflect what it has failed to define in the first place. Real quality in patient care can’t be captured by a MACRA checklist, no matter how complex the series of check boxes.

Our hunger for absolutes can be seen in what has transpired with Standard of Care (SOC) guidelines. Initially, they were developed to be just that— guidelines. But in medicine they have become the brickbat of malpractice. Some patients do more poorly on a standard of care regimen than on alternative treatments. In court, physicians risk losing malpractice cases if they step one iota outside the standard of care box, even if the patient thrives on care outside the rules.

Rules rule. Not common sense.

The Black Box Conundrum

The AI work with deep learning is remarkable. But whatever the results, AI won’t replace radiologists. That anyone in AI would even consider this notion simply demonstrates the danger of the rule-hungering culture:

That’s the bizarre thing about neural networks,” Thrun said. “You cannot tell what they are picking up. They are like black boxes whose inner workings are mysterious.

After all, no one knows if the physician’s black box operates the same way as the neural network’s black box. This isn’t a game of who can get more home runs, neural networks or physicians. It’s about using research to build better and better adjuncts to physician diagnosis.

Mukherjee spent some time shadowing Dr. Lindsey Bordone, a well-known dermatologist, observing the way she interacted with patients and arrived at her diagnoses:

…you could almost see the pyramid of neurons in the lower posterior of her brain spark as she recognized the pattern. But the visit did not end there. In almost every case, Bordone spent the bulk of her time investigating causes. Why had the symptoms appeared? Was it stress? A new shampoo? Had someone changed the chlorine in the pool? Why now?

Mukherjee reports that he realized that the most powerful element in the patient encounter with Bordone was not the how of the condition, but the why. Geoffrey Hinton, one of the computer scientists Mukherjee interviewed, believes there’s no question deep learning will replace radiologists. Yet Hinton readily admits that “A deep-learning system doesn’t have any explanatory power.” Let’s face it. There’s a big gap between the two, yet some AI researchers are ready to sweep radiologists out the door. Wishful thinking at its best.

In the final analysis, it’s the expertise of the physician, a physician who has taken time to listen to the patient’s story that makes the diagnosis. As David Bickers, chair of dermatology at Columbia University told Mukherjee: “I also know that medical knowledge emerges from diagnosis.” A machine which cannot explain, only point and click, is, well, a lot like the hated electronic medical record system.

The danger is in allowing the point and click to replace the art of diagnosis the physician develops over time.

Perhaps, if researchers and AI experts ever crack the black box, we will better understand the difference between man and machine. For now, we shouldn’t be so hurriedly trying to sweep physicians out the door because in some narrowly proscribed ways “my toy is better than yours.”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Show Buttons
Hide Buttons