What Midas Forgot
Reward, curiosity, and the hidden philosophy of artificial intelligence
Hi Everyone,
I’m reading an excellent book by an Oxford neuroscientist and research scientist at UK DeepMind titled Natural General Intelligence. If you’re interested in the exploration of the brain to improve AI, this is a great choice.
I want to discuss a section in the book where Summerfield discusses the virtues and drawbacks of reinforcement learning (RL). Here goes.
Is Reward Enough?
The story of King Midas is usually told as a fable about greed, but it is also a fable about computation. Midas gets exactly what he asks for: everything he touches turns to gold. At first this sounds like a perfect objective. Gold is valuable. More gold is better than less gold. So why not maximize gold?
The problem, of course, is that Midas did not specify the objective carefully enough. He wanted the rose to become gold, perhaps, but not his dinner. He wanted treasure, not the death of his daughter. He wanted wealth, not a world in which every object lost its ordinary human use. The fable turns on a deceptively simple point: an objective that sounds clear in ordinary language becomes catastrophic when pursued with literal consistency.
This is why Midas is such a useful entry point into the problem of artificial intelligence. In reinforcement learning, the hope is that we can build agents that learn to act by maximizing reward. But before an agent can maximize reward, someone has to say what counts as reward. This is where the apparent simplicity of the framework conceals the entire philosophical problem.
The basic plumbing is this. In reinforcement learning, an agent acts in an environment. At any moment, the agent occupies some state of the world, selects an action, and receives some consequence. The mathematical framework often used to describe this setup is called a Markov decision process, or MDP. An MDP compactly specifies the relationship among states, actions, transitions, and rewards. The transition function says what is likely to happen when the agent takes a given action in a given state. The reward function says what the agent is supposed to value.
To see why this matters, it helps to pause over the basic machinery of reinforcement learning. Here’s how this goes.
The standard formalism is called a Markov decision process, or MDP. An MDP is usually specified as a tuple:
M = ⟨S, A, P, R, γ⟩
Here S is the set of possible states, A is the set of possible actions, P is the transition function, R is the reward function, and γ is the discount factor, which determines how much future rewards matter relative to immediate ones.
The agent is in some state, chooses an action, moves to a new state according to the transition function, and receives a reward. The point of learning is to discover a policy: a way of choosing actions that maximizes expected reward over time.
A simple example is a maze. Imagine an agent trying to find the exit.




