The authors propose to reformulate the problem of anatomical detection as a cognitive learning task for an artificial agent. Given a volumetric image $\boldsymbol{I}: \mathbb{Z} \to \mathbb{R}$ and the location of an anatomical structure of interest $p_{GT} \in \mathbb{R}^3$ within $\boldsymbol{I}$, the task can be formulated as learning a navigation strategy to $p_{GT}$ in the voxel grid. This can also be interpreted as finding voxel-based navigation trajectories from any arbitrary starting point $p_0$ to a destination point $p_k$ within the image $\boldsymbol{I}$, such that $|p_k - p_{GT}|$ is minimized. In the domain of reinforcement learning, this problem can be modelled as a Markov Decision Process (MDP): $$ \mathcal{M} := \left(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma\right) $$ where $\mathcal{S}$ represents the finite state of states, and $s_t \in \mathcal{S}$ denotes the state of the agent at time $t$, and is defined as $s_t = \boldsymbol{I}(p_t)$.

$\mathcal{A}$ denotes the finite set of actions that the agent can perform, and $a_t \in \mathcal{A}$ is the action of the agent at time $t$

$\mathcal{T}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]$ is the stochastic transition function where $\mathcal{T}^{s’}_{s,a}$ denotes the probability of arriving in state $s’$ after performing action $a$ in state $s$.

$\mathcal{R} : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$ is the scalar distance based reward function which drives the behaviour of the agent, and the expected reward for a state transition $s \to s'$ at time $t$ from $p_t \to p_{t+1}$, and $\mathcal{R}^{s'}_{s,a} = \left(\|p_t - p_{GT}\|^2_2 - \|p_{t+1} - p_{GT}\|^2_2\right)$

$\gamma$ is the discount factor controlling the importance of the future vs the immediate rewards.

Define the optimal action-value function $Q^*(\cdot, \cdot)$ which represents the maximum expected future discounted reward when starting in state $s$ performing action $a$ and acting optimally thereafter:

$$ Q^{*}(s,a) = max_\pi \mathbb{E} [R_t | s_t = s, a_t = a, \pi] $$

where $\pi$ denotes the action policy, and $R_t$ denotes the cumulative future discounted reward associated with an arbitrary trajectory and is given by the weighted (weighted by $\gamma$) sum of the immediate rewards.

The optimal action policy defining the optimal behavior of the agent in any state $s$ is given by

$$ \pi^*(s) = \text{argmax}\_{a \in \mathcal{A}} Q^{*} (s,a) \ \ \ \ \forall s \in \mathcal{S} $$

The optimal action value function $Q^*$ satisfies the Bellman optimality equation

$$ Q^{*}(s,a) = \sum_{s'} \mathcal{T}^{s'}_{s,a} \left(\mathcal{R}^{s'}_{s,a} + \gamma max_{a'} Q^{*}(s',a')\right) = \mathbb{E}_{s'} \left(r + \gamma \ max_{a'} Q^{*}(s',a')\right) $$

where $r = \mathcal{R}^{s’}_{s,a}$ represents a compact notation for the current, immediate reward, and therefore, the Bellman equation can be viewed as a compaction mapping.

This is now the formulation of a Deep Q-Network (DQN), and at any training iteration $i$, the optimal expected target value for the action-value function using a set of reference parameters $\bar{\theta}^{(i)}$ based on a previous iteration $i’ \le i$ is $$ y = r + \gamma \ max_{a’} Q^{*}(s’,a’; \bar{\theta}^{(i)}) $$

And the error function at each training step is given by $$ \hat{\theta}^{(i)} = \text{argmax}_{\theta^{(i)}} \mathbb{E}_{s,a,r,s’} \left[ \left( y - Q(s,a;\theta^{(i)}) \right)^2 \right] $$

This is now a supervised learning setup and therefore, stochastic gradient descent can be applied.

This summary was written in Fall 2018 as a part of the CMPT 880 Special Topics in AI: Medical Imaging Meets Machine Learning course.