Gary Marcus has emerged as one of deep learning’s chief skeptics. In a recent interview, and a slightly less recent medium post, he discusses his feud with deep learning pioneer Yann LeCun and some of his views on how deep learning is overhyped.
I find the whole thing entertaining, but at many times LeCun and Marcus are talking past each other more than with each other. Marcus seems to me to be either unaware of or ignoring certain truths about machine learning and LeCun seems to basically agree with Marcus’ ideas in a way that’s unsatisfying for Marcus.
The temptation for me to brush 10 years of dust off of my professor hat is too much to ignore. Outside observers could benefit greatly from some additional context in this discussion and in this series of posts I’ll be happy to provide some. Most important here, in my opinion, is to understand where the variety of perspectives come from, and where deep learning sits relative to the rest of machine learning. Deep learning is both an incremental advance and a revolutionary one. It’s the same old stuff and something entirely new. Which one you see depends on how you choose to look at it.
The Usual Awfulness of Machine Learning
Marcus’ post, The Deepest Problem with Deep Learning is written partly in response to Yoshi Bengio’s recent-ish interview with Technology Review. In the post, Marcus comes off as a bit surprised that Bengio’s tone about deep learning is circumspect about its long term prospects, and goes on to reiterate some of his own long-held criticisms of the field.
Most of Marcus’ core arguments about deep learning’s weaknesses are valid and maybe more uncontroversial than he thinks: All of the problems with deep learning that he mentions are commonly encountered by practitioners in the wild. His post doesn’t belabor these arguments. Instead, he spends a good deal of it suggesting that the field is either in denial or deliberately misleading the public about the strengths and weaknesses of deep learning.
Not only is this incorrect, but it also unnecessarily weakens his top line arguments. In short, the problems with deep learning are worse than Marcus’ post suggests, and they are problems that infect all of machine learning. Alas, “confronting” academics with these realities is going to be met with a sigh and a shrug, because we’ve known about and documented all of these things for decades. However, it’s more than possible that, with the increased publicity around machine learning in the last few years, there are people out there who are informed about the field at a high-level while only tangentially aware of its well-known limitations. Let’s review those now.
What Machine Learning Still Can’t Do
By now, examples of CNN-based image recognition being “defeated” by various unusual or manipulated input data should be old news. While the composition of these examples is an interesting curiosity to those in the field, it’s important to understand why they are surprising to almost no one with a background in machine learning.
Consider the following fake but realistic dataset of eight people, in which we know the height, weight, and number of pregnancies for eight people, and we want to predict their sex based on those variables:
Height (in.) | Weight (lbs.) | Pregnancies | Sex |
72 | 170 | 0 | M |
71 | 140 | 0 | M |
74 | 250 | 0 | M |
76 | 300 | 0 | M |
69 | 160 | 0 | F |
65 | 140 | 2 | F |
60 | 100 | 1 | F |
63 | 150 | 0 | F |
Any reasonable decision tree induction algorithm will find a concise classifier (Height > 70 = Male else Female) that classifies the data perfectly. The model is certainly not perfect, but also not a terrible one by ML standards, considering the amount of data we have. It will almost certainly perform much better than chance at predicting peoples’ sex in the real world. And yet, any adult human will do better with the same input data. The model has an obvious (to us) blind spot: It doesn’t know that people over 5’10” who have been pregnant at least one time are overwhelmingly likely to be female.
This can easily be phrased in a more accusatory way: Even when given training data about men and women and the number of pregnancies each person has had, the model fails to encode any information at all about which sex is more likely to get pregnant!
It sounds pretty damning in those words; the model’s “knowledge” turns out to be incredibly shallow. But this is not a surprise to people in the field. Machine learning algorithms are by design parsimonious, myopic, and at the mercy of the amount and type of training data that you have. More problems are exposed when we allow the case of adversarially selected examples, where you are allowed to present examples constructed or chosen to “fool” the model. I’ll leave it as an exercise for the reader to calculate how well the classifier would do on a dataset of WNBA players and Kentucky Derby jockeys.
Enter Deep Learning, To No Fanfare At All
Deep learning is not different (at least in this way) from the rest of statistical learning: All of the adversarial examples presented in the image recognition literature are more or less the same as the 5’11” person who’s been pregnant; there was nothing like that in the dataset, so there’s no good reason to expect the model would get it right, despite the “obviousness” of the right answer to us.
There are various machine learning techniques for addressing bits and pieces of this problem, but in general, it’s not something easily solvable within the confines of the algorithm. This isn’t a “flaw” in the algorithm per se; the algorithm is doing what it should with the data that it has. Marcus is right when he says that machine-learned models will fail to generalize to out-of-distribution inputs, but, I mean, come on. That’s the i.i.d. assumption! It’s been printed right there on the tin for decades!
Marcus’ assertion that “In a healthy field, everything would stop when a systematic class of errors that surprising and illuminating was discovered” presupposes that researchers in the field were surprised or problems illuminated by that particular class of errors. I certainly wasn’t, and my intuition is that few in the field would be. On the contrary, if you show me those images without telling me the classifier’s performance, I’m going to say something like “that’s going to be a tough one for it to get right”.
In the back-and-forth on Twitter, Marcus seems stung that the community is “dismissive” of this type of error, and scandalized that the possibility of this type of error isn’t mentioned in the landmark Nature paper on deep learning, and herein, I think, lies the disconnect. For the academic literature, this is too mundane and well-known of a limit to bother stating. Marcus wants a field-wide and very public mea culpa for a precondition of machine learning that was trotted out repeatedly during our classes in grad school. He will probably remain disappointed. Few in the community will see the need to restate that limitation every time there’s a new advance in machine learning; the existence of that limit is a part of the context of every advance, as much as the existence of computers themselves.
For communications with the public at large outside of the field, though, perhaps Marcus is right that such limits could take center stage a bit more often (as Bengio rightly puts them in his interview). Yes, it’s true! You can almost always find a way to break a machine learning model by fussing with the input data, and it’s often not even very hard! One more time for the people in the back:
People who think deep learning is immune to the usual problems associated with statistical machine learning are wrong, and those problems mean that many machine learning models can be broken by a weak adversary or even subtle, non-adversarial changes in the input data.
This makes machine learning sound pretty crummy, and again elicits quite a bit of hand-wringing from the uninitiated. There are breathless diatribes about how machine learning systems can be, horror of horrors, fooled into making incorrect predictions! They’re not wrong; if you’re in a situation where you think such trickery might be afoot, that absolutely has to be dealt with somewhere in your technology stack. Then again, this is so even if you’re not using machine learning.
Fortunately, there are many, many cases where this sort of brittleness is just not that much of a problem. In speech recognition, for example, there’s no one trying to “fool” the model and languages don’t typically undergo massive distributional changes or have particularly strange and crucial corner cases. Hence, all speech recognition systems use machine learning and the models do well enough to be worth billions of dollars.
Yes, all machine-learned models will fail somehow. But don’t conflate this failure with a lack of usability.
Not Even Close
I won’t go as deeply into Marcus’ other points (such as the limits on the type of reasoning deep learning can do or its ability to understand natural language) in detail, but I found it interesting how closely those points coincide with someone else’s arguments about why “strong AI” probably won’t happen soon. That was written before I’d even heard of Gary Marcus and the relevant section is comprised mostly of ideas that I heard many times over the course of my grad school education (which is now far – disturbingly far – in the past). Yes, these points are again valid, but among people in the field, they again have little novelty.
By and large, Marcus is right about the limitations of statistical machine learning, and anyone suggesting that deep learning is spectacularly different on these particular axes is at least a little bit misinformed (okay, maybe it’s a little bit different). For the most part, though, I don’t see experts in the field suggesting this. Certainly not to the pathological levels hinted at by Marcus’ Medium post. I do readily admit the possibility that, amid the glow of high-profile successes and the public spotlight, that all of the academic theory and empirical results showing exactly how and when machine learning fails may get lost in the noise, and hopefully I’ve done a little to clarify and contextualize some of those failures.
So is that it, then? Is deep learning really nothing more than another thing drawn from the same bag of (somewhat fragile) tricks as the rest of machine learning? As I’ve already said, that depends on how you look at it. If you look at it as we have in this post, yes, it’s not so very different. In my next post, however, we’ll take a look from another angle and see if we can spot some differences between deep learning and the rest of machine learning.
Leave a Reply