论文标题
同时进行数据效率增强学习的信用分配
Concurrent Credit Assignment for Data-efficient Reinforcement Learning
论文作者
论文摘要
广泛采样状态和行动空间的能力是建立有效的增强学习算法的关键要素。本文所揭示的变分优化原则强调了占用模型在综合代理可以在其上进行的环境状态的一般分布的重要性(定义虚拟的``领土'')。随着探索的进行,占用模型是频繁更新的主题,并且在培训过程中新状态未公开。通过做出统一的假设,由此产生的目标表达了两个并发趋势之间的平衡,即占用空间的扩大和奖励的最大化,提醒了经典的探索/利用权衡权衡。在经典的连续动作基准上实施的是参与者批评的非政策,它可提供抽样效率的显着提高,这在较少的训练时间和更高的回报中,在浓厚的奖励和稀疏奖励案例中都可以反映出来。
The capability to widely sample the state and action spaces is a key ingredient toward building effective reinforcement learning algorithms. The variational optimization principles exposed in this paper emphasize the importance of an occupancy model to synthesizes the general distribution of the agent's environmental states over which it can act (defining a virtual ``territory''). The occupancy model is the subject of frequent updates as the exploration progresses and that new states are undisclosed during the course of the training. By making a uniform prior assumption, the resulting objective expresses a balance between two concurrent tendencies, namely the widening of the occupancy space and the maximization of the rewards, reminding of the classical exploration/exploitation trade-off. Implemented on an actor-critic off-policy on classic continuous action benchmarks, it is shown to provide significant increase in the sampling efficacy, that is reflected in a reduced training time and higher returns, in both the dense and the sparse rewards cases.
