Hybrid Systems Aren't a Promising Path to AGI

Plus: Why Kahneman's "System 1 and System 2" is artificial, and the easy and hard problems of intelligence.

Erik J Larson

Jul 31, 2024

Scarface: Tony Montana's 10 Most Badass Scenes, Ranked

Say Hello to my System One….

Hi everyone,

My jumping off point for this post is something

Gary Marcus

wrote last week, an interesting piece, making a good case against Gen AI as a path to AGI, and also making a case for neurosymbolic approaches to AI that marry neural networks with symbolic or rule-based methods. Marcus—not alone—references the late Daniel Kahneman’s 2013 bestseller Thinking, Fast and Slow. In it, Kahneman draws a distinction—though he makes clear it’s not intended to be an accurate neuroscientific account—between a “System 1” that’s fast, reflexive but possibly wrong and a “System 2” that’s slower and more deliberative and sensitive to accuracy and truth:

The idea is to try to take the best of two worlds, combining (akin to Kahneman’s System I and System II), neural networks, which are good at kind of quick intuition from familiar examples (a la Kahneman’s System I) with explicit symbolic systems that use formal logic and other reasoning tools (a la Kahneman’s System II).

In this post I’ll explain why I think Kahneman’s distinction is artificial and doesn’t provide a particularly useful framework for our inferences. Marcus seems to take the “System 2” idea as an explicitly computational and (more to the point) deductive type of inference mechanism, the stuff of Classical AI. Ultimately this leads to the idea of hybrid “neuro” and “symbolic” systems that may go further together toward AGI than separately. I’m skeptical of the “hybrid” claim, and I’ll explain why. This is the first of a two part series.

The Reliability Problem and Truthier AI

Marcus is worried about Gen AI and reliability, which is what I’m worried about (and you should be too). It makes sense to consider logic-based approaches to AI when confronted with reliability issues, because unlike neural networks they offer the promise of certainty: in deduction if the premises are true, the conclusion has to be true. Knowing whether your AI is telling it to you straight would be nice. It would certainly address reliability. But there are, as you might expect, some pretty hairy problems wrapped up with all of this. There’s a reason these approaches are now mostly obsolete in serious work on AI.

To date, all the deductive approaches to achieving AI have roundly failed. It’s a shame, because the theoretical framework for deduction is beautiful, and offers a “roadmap” for establishing the veracity of a system’s conclusions.1 Deduction and its variants were tried for decades, though, and to the tune of hundreds of millions of dollars of (mostly) government R&D, and in the end it all… failed. No one takes seriously anymore the possibility of "scaling up” or using symbolic systems to get to AGI, let alone using only deduction (a small but well-developed subset of “symbolic”). Younger readers may wonder what everyone was smoking, but rewind a few decades and discover that yes, the field took the symbolic and in particular deductive approach seriously. What happened?

Man Takes Birth Control Pills, Doesn’t Get Pregnant

In a word, relevance. It proved impossible to use deductive approaches in dynamically changing domains requiring commonsense and other types of real-world knowledge like an appreciation of causes and effects. In other words, it proved impossible where inferences are context-dependent and what’s true depends on other relevant considerations and can’t simply be deduced. A deductive system might happily conclude that Joe didn’t get pregnant because he took his wife’s birth control pills—birth control pills, after all, prevent pregnancy. It’s a silly conclusion because it’s entirely irrelevant. Men don’t get pregnant anyway. “Relevance” is one of these seemingly tractable issues that end up being the thread that when pulled destroys one’s sweater. Engineers grappled with it for decades before essentially walking away from it or at least killing the feverishly hyped announcements.

Relevance-based problems closed any routes to general intelligence, or AGI, for Classic AI, and showed promise only for problems that could avoid the curse of the real world, like the formal verification of computer chip design; still a killer app. The “relevance problem,” in retrospect, seems quite obvious, but true believers still seek ways to inject old methods into the newer paradigm (understandably, since hallucinating neural networks set the truth and reliability bar pretty low). What I’ll say presently is this: if anything has been tried exhaustively and in every conceivable combination and permutation in AI, it’s the symbolic approach, whose summum bonum is deduction. The problems were never resolved. The approach was simply abandoned.

Marcus’s post snapped me out of my slumber, as it were (and to paraphrase Kant), and reminded me that discussion of neurosymbolic systems is, inter alia, a suggestion that we reopen closed cases, and try to resuscitate some part of Classical AI2. I think this is a profound mistake, and I’ll try to explain why. In truth, I might be better served by not explaining why and leaving all this alone, as someone is bound to take umbrage. AI promotes itself as a practical field driven by effectiveness, but deep divisions persist among its various factions. Fervent and dogmatic but essentially pointless disputes simmer in the background; in the past they raged on openly in university labs, think tanks, and (to a lesser extent) corporations. I don’t want to get tangled in that religious-minded mess. Still, to understand what AI is and where it’s going, I think we need to go for it. So, let us go for it.

Turn now to the hybrid question, to the question of combining deduction (or symbolic AI) with induction (or “neuro” AI aka machine learning aka empirical AI). For this new discussion, nothing much turns on whether we’re talking about deduction proper or just any symbolic approach.

The Easy Problem of Intelligence (Or: what works in AI is already hybrid)

“Neurosymbolic” is self-explanatory. It involves the fusion of two distinct approaches to AI: neural networks, and symbolic systems. Neurosymbolic systems are, in other words, hybrid systems. The claim that getting to AGI requires building hybrid systems rather than just, say, a super scaled up LLM is just the claim that AI to succeed must use more than one approach—it must be hybrid.

And here I make my first Big Claim: all successful AI—with rare exception—is hybrid.

AI has never succeeded at solving an interesting problem by applying one method to the exclusion of the others, and systems that do cool things, like drive cars or play Jeopardy!, are invariably engineered with all sorts of tricks and algorithms and approaches, from the ever-growing grab bag available to AI practitioners.

Let me give a few examples, and then I’ll explain what I mean by “the easy problem of intelligence.” First, Marcus himself makes the point about the ubiquity of hybridization in AI in his book Rebooting AI, with Earnest Davis (it’s a good book—I recommend it). I’ll quote from my own book here:

When the developers of DeepMind claimed, in a much-read article in the prestigious journal Nature, that it had mastered Go “without human knowledge,” they misunderstood the nature of inference, mechanical or otherwise. The article clearly “overstated the case,” as Marcus and Davis put it. In fact, DeepMind’s scientists engineered into AlphaGo a rich model of the game of Go, and went to the trouble of finding the best algorithms to solve various aspects of the game—all before the system ever played in a real competition. As Marcus and Davis explain, “the system relied heavily on things that human researchers had discovered over the last few decades about how to get machines to play games like Go, most notably Monte Carlo Tree Search . . . random sampling from a tree of different game possibilities, which has nothing intrinsic to do with deep learning. DeepMind also (unlike [the Atari system]) built in rules and some other detailed knowledge about the game. The claim that human knowledge wasn’t involved simply wasn’t factually accurate.”

It’s a powerful critique. It’s also a great example of how AI at the level of systems development invariably becomes hybrid.

An even better example is IBM’s development of Watson, the system that played the popular long-running game Jeopardy! Watson beat Ken Jennings, the “GOAT” of Jeopardy! (at least in 2011, I don’t follow Jeopardy!). At the time, it seemed fantastical, a bit like when GPT 3 started chat botting us in 2022. I was hooked. When IBM “open sourced” the detailed technical papers on how Watson worked, I printed out hundreds of pages and basically read about the system end to end. I found that Watson was a brilliantly designed kluge of clever tricks, using methods from machine learning—which today is essentially neural networks—as well as symbolic or classical AI.

Jeopardy! is played by contestants selecting a category, like “Starts with ‘W'. Watson sported special purpose code modules for analyzing particular types of categories. If the category was “four letter words,” Watson would look for particular responses that have four letters, and so on. Modules were organized in a huge pipeline that followed parallel tracks down to a statistical guess about the strongest answer (technically: “question” if you know Jeopardy!, as contestants supply the question to the offered answer.) Scores of these one-off code modules narrowed down the possibilities for particular categories and responses. The IBM team discovered that most Jeopardy! questions where titles of Wikipedia pages. Fantastic. A subsystem indexed all of Wikipedia and the titles of pages, and did a quick search early in the pipeline—why keep processing when the answer can be retrieved from a hash map?

Of course—it’s 2011—machine learning was used. Watson debuted just before convolutional neural networks wowed everyone in 2012 at the ImageNet competition. Strangely, then: no neural networks. But ML, yes. Logistic regression was used in (if I recall now) scoring different questions, and Monte Carlo methods were used when deciding whether to “buzz in” and attempt a response. A powerful system? Yes. But no path to AGI because the IBM team wasn’t even trying to make a general intelligence. This is a hallmark of successful (hybrid) systems. They’ve been engineered to solve the easy problem of intelligence. They’re making no ground against the harder part, which I’ll try to explain later. For now:

(1) The Easy Problem of Intelligence: Find some task/problem that if performed by a human would require intelligence. Engineer the hell out of the problem using any available approach, making a perfectly narrow system capable of performing wondrously (and superhumanly) on the problem.

(2) The Hard Problem of Intelligence: Ignore one-off problems and ask big questions about what would amount to generality—what use cases? Make flexible handling of relevance/context/causality/hypothesis/understanding a precondition of success, and strive to build something that can be applied to lots of different tasks/problems, rather than high visibility but narrow one-offs. (Hint: ChatGPT isn’t it.)

More hybrids. One of my favorites: self-driving cars. [I’ll omit discussion in the interests of cracking on here.] The lesson is: no one algorithm or approach can handle a big complicated problem. Put it another way: no one algorithm or approach can solve the easy problem of intelligence! Ergo, since we’ve been engineering hybrid solutions all along, invoking “hybrid” doesn’t seem particularly groundbreaking. Something different would seem, well, better.

In other words, AI systems tend to be hybrid because we don’t have any computational theory of general intelligence yet. We’ve got the variegated beasts, the Watsons and so on. To extend my definition a bit, AI has succeeded by zeroing in on problems that (a) have a chance of getting solved by computers, (b) are interesting in either the sense of “we need that” (like solving protein folding) or “that would be cool,” like playing Jeopardy! or chess or Go or Atari games. The “need” or “cool” factor typically means that there will be follow-on funding, stock prices will rise, media and the public will take note, and so on. Generality is engineered out of the hybridized systems. That’s how they solve the problem we’ve selected. It won’t do to turn around and ask why AI isn’t general. No one has a clue, and big engineered hybrid programming projects are the only stuff that works.

To recap: we first identify a problem that’s interesting, and then we go about engineering the hell out of it, to get a computational solution.3 This is why we don’t see engineers working on a giant Thinking Brain that will solve world hunger, by the way. We see engineers working on something that’s always some combination of doable and useful (or cool). Let’s just stop for a moment and think about what I just said. It seems we’re not really making any progress at all toward AGI. We seem to be picking problems and piling up big hybrid solutions. Watson, as we all now know, didn’t do so well when Big Blue ported it over to health care. That’s not IBM’s fault. It’s just, the field. It’s just AI.

I’ve made the point I want to make, but this occurred to me: ChatGPT itself is a hybrid system, and not in a trivial way. After the unsupervised “pre-training” (that is: the expensive I need millions of dollars training), the model is fine-tuned, using a dataset of human annotated responses. That’s supervised learning. The pre-training is unsupervised learning. Reinforcement learning comes next, to further tweak parameters and skew toward desired output (Reinforcement Learning with Human Feedback, RLHF). When you ask ChatGPT a math question or a question requiring math, too, it might farm it out to a Python code snippet. That’s not a neural network “black box,” it’s rule execution using traditional programming.

The easy problem therefore sounds a bit tautological: it’s using whatever you have that’ll work to solve a problem that can be solved in one way or the other. Yes, there are surprise performances, and ChatGPT certainly qualifies. But it’s also another example of solving the Easy Problem of Intelligence. Hybrids. No fundamental insights.

The Hard Problem of Intelligence

The Hard Problem is somewhat tricky to unpack but can be glossed as “and the system has to know something about the problem it’s solving” (see above). This is the problem of actually understanding “2 + 2 = 4” versus printing this statement out when asked what “2 + 2” equals. AI folks perennially harp on this problem of the “missing understanding,” the critics and proponents alike. I harp on it. Let’s turn to one reason this pesky understanding requirement won’t just go away.

Mistakes Are Only Human. And Machine.

It comes down to error—and error rates. Every cognitive system we know of makes mistakes. We make mistakes. AI makes mistakes. This is such a truism that if a system doesn’t make any mistakes, we typically don’t think it’s intelligent (consider: a calculator). Now, what Marcus is getting at with his suggestion that we use some symbolic system or other (and his examples, DeepMind’s AlphaProof and AlphaGeometry 2, use theorem provers, or in other words deductive logic) is that the two systems combined may show promise for getting to AGI, where any one approach probably won’t. That’s a nice suggestion, but it seems to support the Easy Problem, not the Hard. AlphaProof is not more general because it’s hybrid, it’s just more effective—at solving the identified problem. That’s the Easy Problem. Progress on the Hard Problem would mean that somehow the hybridization of a system would extend its generality. We don’t see that. It’s not at all clear given what we know about how easy problems go that it’s a good hypothesis. We see the DeepMind International Math Olympiad systems perform better than (say) an LLM on a selected problem by doing what AI researchers always do, stare at the problem and engineer some workable solution or other to it. Deduction turns out to be pretty useful for doing math and geometry proofs.

Back to errors. In the Easy Problem, a system is deemed a success, typically, when the errors fall below a certain level; this is minimizing the loss function in machine learning research (think deep neural networks). Note that “error” is highly dependent on context: 1 error in 10 might be okay for an image recognition contest but terrifying for a passenger in a self-driving car. Successful systems typically best humans at some task or other, as we see with games like chess and Go. If the problem is really hairy, like engaging in commonsense dialogue with a person (Conversational AI), we might put up with an occasional error, but here the type of error will matter a great deal. If the system occasionally enjoins the human that “A” and “not-A” are both true, that’s a teeth shattering error. It means “whoops, we have no understanding in this system.” If it gets the name of the mother of the guy who invented floss mixed up with the toothpaste guy’s mom, we might not care. (Then again, depending on the situation, we really might.) I think Marcus is spot on in fixing on reliability as the key obstacle with today’s AI. Thinking about errors helps make this even clearer.

To put it another way, LLMs don’t make THAT many errors, but the errors they do make can sometimes—and more often than not—be really boneheaded, stupid, subhuman, and possibly plain dangerous. For some theoretical purposes, LLMs “solved” the Conversational AI problem. For practical purposes, the low error rate is hyper-sensitive to context, and so a true evaluation would have a very high bar. The Hard Problem has reared its ugly head.

This takes me to my last points regarding Kahneman’s System 1 and System 2.

Kahneman’s Mistake

Kahneman introduced two intuitive concepts that apply to cognition: we sometimes make hasty but often correct inferences, and we sometimes “think through” a problem to get to an answer. The former he called “System 1.” The latter, “System 2.” AI researchers like Marcus (he references Kahneman’s “System 1 and 2” in his post4) seem fond of Kahneman’s mistake—err, distinction—I guess because it’s a handy heuristic for thinking about fast but possibly bullshit black box inferences with neural networks, and slower, more deliberate and transparent reasoning typical of the older failed approaches. If only we could combine these, it might extend generality—and generality is the Holy Grail for AI. But there’s a problem with Kahneman’s distinction, and it unfortunately translates to an irritating blind spot for researchers in AI. This last bit about Kahneman I hope will tie back into the larger discussion about hybrid systems and easy and hard problems of intelligence. Best here if I quote (at length) from my book, The Myth of Artificial Intelligence:

The idea that part of our thinking is driven by hardwired instincts has a long pedigree, and it appears in a modern guise with the work of, for instance, Nobel laureate Daniel Kahneman. In his 2011 best seller, Thinking, Fast and Slow, Kahneman hypothesized that our thinking minds consist of two primary systems, which he labeled Type 1 and Type 2. Type 1 thinking is fast and reflexive, while Type 2 thinking involves more time-consuming and deliberate computations. The perception of a threat, like a man approaching with a knife on a shadowy street, is a case of Type 1 thinking. Reflexive, instinctual thinking takes over in such situations because (presumably) our Type 2 faculties for careful and deliberate reasoning are too slow to save us. We can’t start doing math problems—we need a snap judgment to keep us alive. Type 2 thinking involves tasks like adding numbers, or deciding on a wine to pair with dinner for guests. In cases of potential threat, such Type 2 thinking isn’t quickly available, or helpful.
Kahneman argued in Thinking, Fast and Slow that many of our mistakes in thinking stem from allowing Type 1 inferences to infect situations where we should be more mindful, cautious, and questioning thinkers. Type 1 has a way of crowding out Type 2, which often leads us into fallacies and biases.
This is all good and true, as far as it goes. But the distinction between Type 1 and Type 2 systems perpetuates an error made by researchers in AI, that conscious intelligent thinking is a kind of deliberate calculation. In fact, considerations of relevance, the selection problem, and the entire apparatus of knowledge-base inference are implicit in Type 1 and Type 2 thinking. Kahneman’s distinction is artificial. If I spot a man walking toward me on a shadowy street in Chicago, I might quickly infer a threat. But the inference (ostensibly a Type 2 concern) happens so quickly that in language, we typically say we perceive a threat, or make a snap judgment of a threat. We saw the threat, we say. And indeed, it will kick off a fight-or-flight response, as Kahneman noted. But it’s not literally true that we perceive a threat without thinking. Perceived threats are quick inferences, to be sure, but they’re still inferences. They’re not just reflexes. (Recall Peirce’s azalea.)
Abduction [I don’t discuss abduction in this post but readers familiar with my “inference framework” will connect the dots here] again plays a central role in fast thinking: it’s Halloween, say, and we understand that the man who approaches wears a costume and brandishes a fake knife. Or it’s Frank, the electrician, walking up the street with his tools (which include a knife), in the shadows because of the power outage. These are abductions, but they happen so quickly that we don’t notice that background knowledge comes into play. Our expectations will shape what we believe to be threat or harmless, even when we’re thinking fast. We’re guessing explanations, in other words, which guides Type 2 thinking, as well. Our brains—our minds, that is—are inference generators.
In other words, all inference (fast or slow) is noetic, or knowledge-based. Our inferential capabilities are enmeshed somehow in relevant facts and bits of knowledge. The question is: How is all this programmed in a machine? As Levesque points out, some field, like classic AI’s knowledge representation and reasoning, seems necessary to make progress toward artificial general intelligence. Currently, we know only this: we need a way to perform abductive inference, which requires vast repositories of commonsense knowledge. We don’t yet know how to imbue machines with such knowledge, and even if we figure this out someday, we won’t know how to implement an abductive inference engine to make use of all the knowledge in real time, in the real world—not, that is, without a major conceptual breakthrough in AI.

Whew. So much for Kahneman, at least by my lights. The broader point here is the question about the Hard Problem. Again, nearly every interesting challenge in AI is already a hybrid approach, and what we’ve learned isn’t promising: that hybrid systems are good for the Easy Problem, but it’s a flat mystery—and I don’t think particularly promising—how combining failed approaches with not-yet-failed approaches (or, more charitably, failed approaches with currently dominant approaches) gets us out of easy problems and into progress on hard ones. Kahneman’s distinction plays into the idea of building out modules for big hybrid systems solving problems by breaking the problems down into smaller parts—this goes to fast, this here, we better think about more.

I argued in the Myth that we don’t get generality from AI by combining inferences already known to be inadequate individually (induction and deduction). You can read my case for abduction as tantamount to looking for a solution to the Hard Problem by avoiding the hybridization of AI and its allocation of parts of a problem to specific modules or subsystems. Systems have subsystems, yes. But the point is that we can’t divide the subsystems in terms of inference. The fundamental theory we’re in search of would allow us to reason flexibly, from particular observation to particular observation via rules. We can formalize this type of inference as “abductive inference” but we don’t really understand how to implement it computationally, or for that matter, using anything at all. Brains do it constantly—it’s like a condition of being awake. So for now anyway, we’re stuck. We’d be better off, in my view, reverse-engineering the brain and looking for a powerful algorithm in the neocortex than continuing to make hybrid systems tailored to specific problems. At the very least, we need some reason to think more hybridization won’t give us more of the same—better performance on more and more easy problems, but no hope of traction on the hard problem5.

I think I’ve run out of steam on this post, so for now I’ll leave it here. I trust anything that’s left dangling or unresolved can be taken up in comments. Thanks everyone!

Erik J. Larson

I’m trying to keep this informal, but I’ll have to bring up some relevant concepts. “Proof theory” is syntactic and is concerned with how a system draws a valid conclusion. “Truth theory” is semantic, or meaning based, and is concerned with how valid conclusions can be considered true, or in other words how the inference can be sound. These concepts are all part of the study of mathematical logic, and found their way into AI early on, as the pioneers (like McCarthy, Minsky, and others) were mathematicians and realized that logical deduction could (to some extent) be automated.

Deductive systems are technically a subset of “symbolic” systems. There were plenty of ad hoc symbolic systems approaches, but since they didn’t have a semantics, a way of knowing what’s true and what’s a contradiction and so on, they typically ended in various degrees of spaghetti code. “Spaghetti code” isn’t a bad rendition of what happened to symbolic systems everywhere, as we kept trying to extend them to cover new knowledge, new cases, and especially to incorporate some notion of relevance. I’ll try to explain this as we go.

Credit for the wonderful phrase “engineering the hell out of it,” goes to Gerben Wierda. See https://ea.rna.nl/2024/02/07/the-department-of-engineering-the-hell-out-of-ai/.

Kahneman can be read as giving a “rough and ready” distinction for cognition, and a charitable way to take Marcus—I take him this way—is that he’s appealing to the rough and ready idea, with nothing substantive hanging on it. That’s fine as far as it goes, but I have a specific reason for rejecting K’s distinction; namely, that it tends to obscure the noetic nature of our inferences writ large.

I didn’t tease “generality” and “understanding” apart, but it’s a bit out of scope, and at any rate they seem to be very closely tied so that one is essentially a precondition of the other. We can haggle, sure. It’s for another post.

Ondřej Frei

A few years from now, I'll be able to proudly say: "I was lucky enough to find Erik J Larson's blog during the AI hype. Yes, *that* Larson, Larson's paradox etc. He was right all along."

What you explain in such a clear way seems so true I sometimes worry there must be some caveat to it :D Seriously though, I still cannot understand why you're still the only one to date I've encountered who talks about abduction (put so nicely in today's piece "the hard problem"). Maybe others purposefully ignore it because it's too hard?

Expand full comment

1 reply by Erik J Larson