Discussion about this post

User's avatar
Amarda Shehu's avatar

The contradiction separates into two distinct behaviors. Refusal declines a particular region that the system was trained to avoid; it is visible and so can be contested. Steering routes you toward a resolution while preserving the form of deliberation. This is way harder to catch for the general user.

Steering, however, is not a hidden author. It is what preference optimization produces. RLHF (you wrote a piece on RL) fits a reward model to aggregated human ratings. I think of this as aggregating to the center of mass, not your circumstances. So, when your prompt underdetermines the response, the output regresses toward that center. What you experience is the regression as an imposed value. Polanyi survives the translation: particular cases do not reduce to rules, and they do not reduce to a reward model's expected value.

Knowledge and possibility are not diverging. Expository capacity grows. What contracts instead is something the model never had to begin with, the authority to resolve one case for one person.

Contarini's avatar

Agreed absolutely. Embedded, undisclosed preference, biases, presuppositions, will all shape future actions guided by AI. This is dangerous for many reasons. Those preferences, etc. may be bad or wrong. But also dangerous, as you note, the range of possible options will be defined by the AI response as the "right" response. This will make decision-making brittle, stereotyped, uniform. Dangerous!

1 more comment...

No posts

Ready for more?