Given two shapes, $N$ samples are drawn from the edge elements of the shape. There are no specific constraints on these points - they can be either on the internal or the external contour of the object. Moreover, they also need not correspond to keypoints for the shape (such as maxima of curvature, inflection points, etc.), and although desired that the samples be uniform in spacing, this too is not a rigid criterion.
Now, for each point, vectors are generated joining it (the reference point) to all the other points. It is important to note that having a large number of vectors (a large value of $N$) would mean an exact representation of the shape (assuming the contours have a bounded curvature).
A compact descriptor for each sample point is generated by computing a coarse histogram of the relative coordinates of the other points. The reference axis can be chosen to be either absolute or relative, as has been demonstrated in the various experiments the authors do. A log-polar coordinate system is used and this histogram is called the \textit{shape context}. The log-radii and the and the angles are samples into equally spaced bins, and the number of these bins is different varies for different experiments.
Incorporating Invariances:
Translation: Since everything is measure to the points on the object, the shape context is inherently invariant to translation. If however, translation variance is needed, it can be acheieved by using a global frame of reference instead of a local one.
Scale: The authors propose to achieve scale invariance by normalizing all radial distances between $N^2$ point pairs in the shape by dividing by the mean distance $\lambda$. The median length is chosen because of its reasonable robustness to outliers.
Rotation: One of the ways to achieve rotational invariance is by using the axis of least inertia of the object to measure all angles. A more general approach is to measure all the angles relative to the tangent angle at each sample point, making it completely rotation invariant. This, however, is achieved at the expense of loss of discriminative power should rotation-variant local features are to be considered.
Occlusion: Although not completely addressed, the authors posit that using log-distances minimizes the effect of occlusion by placing more dependence on the nearby pixels, and therefore making it more possible for pixels far from the occluded edge to find correspondences.
In order to generate shape correspondences, two criteria must be satisfied - corresponding points must have very similar descriptors, and these correspondences should be unique.
The matching cost $C$ can be written as $$ C = (1-\beta) \ C_S + \beta \ C_A, $$
where $C_S$ and $C_A$ denote the shape context term and the local appearance term respectively. The shape context term $C_S$ represents the $\chi^2$ (chi-squared) distance between the histograms of the two shapes for which correspondences have to be generated, and the local appearance term $C_A$ acts as a measure of the tangent angle dissimilarity, and they are defined as
$$ C_S = \frac{1}{2}\sum_{k=1}^{K}\frac{[g(k) - h(k)]^2}{g(k) + h(k)} $$
$$ C_A = \frac{1}{2} \left\lVert{\begin{pmatrix} cos(\theta_1)\\ sin(\theta_1) \end{pmatrix} - \begin{pmatrix} cos(\theta_2)\\ sin(\theta_2) \end{pmatrix}}\right\rVert $$
The value of the weighting factor $\beta$ can be chosen for each problem, as has been shown in the paper.
The uniqueness criterion mentioned above for the correspondence determination can be addressed by treating this matching cost minimization as a weighted bipartite matching problem, and can be solved in $O(N^3)$ time by the Hungarian method [1].
References:
- Christos H. Papadimitriou and Kenneth Steiglitz. 1982. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.