Incentivizing both Grounding and Reasoning in Large Language Models with Online Reinforcement Learning
Fine-tuned LLaMA-based LLM agents with online reinforcement learning (PPO) in a text based multi-step environment (BabyAI-Text). Investigated the impact of encouraging “reasoning-before-action”. In this simple setting, reasoning-before-action did not improve sample efficiency but provided interpretability advantages, and we also observed an interesting “reasoning collapse” phenomenon.
