Augmented reality

Contributors Julien Martel, siohoi ieng,

Objective & Context

The objective of this sub-project as part of the MAP group is to provide "augmented reality" to a user equipped with a Dynamic Vision Sensor attached on a helmet and a pair of goggles featuring overhead vision.

As a case study, we focused on developing an algorithm to localize a player around a billiard table during a pool game. Further extensions of the work would include detecting balls on the table and displaying game situations (such as game statistics or potential shots) in the user's goggles. In this project, the use of a silicon retina, the ATIS, is motivated by the low light conditions the player has to face as well as the real-time constraints on its localization. Real time is made possible by exploiting the sparsity of the events produced by the sensor trigering event-based updates in the localization.

In the following we describe the proposed solution for the localization of the player.

Solution proposed


Our solution assumes we know the billiard table is a rectangle with known dimensions; in the following we refer to it as "the model". A second hypothesis is to assume the ATIS's intrinsic parameters known. These parameters are determined by a prior calibration operation is is stored in matrix $K$. Our solution exploits the fact there exists a unique perspective transformation (a collineation) mapping the model to the view. In other terms, we exploit the fact we can map a planar object (the surface of the table) appearing in the view to a planar model.

Each event produced when the player moves around the table and belonging to one of the rectangular table's edge should be able to be mapped on the model and then contribute in updating the collineation. The collineation gives the relation there exist between the model and the view, and is a necessary step to the localization of the player.

To localize the player/ATIS w.r.t. the table, we need to retina's pose w.r.t. a world coordinate frame attached to the table i.e. the rigid transformation parameters $(\mathbf{R},\vec{T})$ that define the ATIS position and orientation w.r.t. to that coordinate frame. If a 3D point on the table $\vecf{X}$ is projected into the ATIS focal plane as $\vec{x}$, then there is a 3-by-4 projection matrix $\mathbf{M}$ such that:

\vec{x} = \mathbf{M}\vec{X},

and $\mathbf{M}=\mathbf{K} (\mathbf{R} \vec{{T})$. For each new position of the player/camera, we need to estimate the pose $(\mathbf{R},\vec{T})$. This is possible if the collineation mapping the model and the view is estimated first because if the world coordinate frame is defined such that the model is set to Z=0, then we have up to a scale factor the following relation:

\mathbf{K}(r1 r2 \vec{T}) \propto \mathbf{P},

where $r1$ and $r2$ are respectively the first and the second column of the rotation matrix $\mathbf{R}$ in $\mathbf{M|$. (i.e. $\mathbf{P}$ is proportional to the sub matrix of $\mathbf{M}$, built from eliminating the 3rd column of $\mathbf{M}$.) Since $\mathbf{K}$ is known, we can reconstruct the complete $\mathbf{M}$ because we have:

\rho (r1 r2 \vecf{T}) = \mathbf{K}^{-1} \mathbf{P} = \mathbf{N},

where $\rho$ is the scale factor.

To recover the real $r1, r2, r3 and \vec{T}$ , we need to estimate $\rho$ and this can be done because $\mathbf{N}=(\rho r1, \rho r2,\rho2 r1 \times r2)$, assuming $\times$ stands for the cross product. Since $R=(r1 r2 r3)$ is a rotation matrix then $det(\mathbf{N})=\rho4 det(r1,r2,r3)$. Hence, $\rho$ is the fourth root of det(\mathbf{N}).

Now this means for each incoming event, we are able to localize the player in the world coordinate frame and we can determine the projection matrix $\mathbf{M}=\mathbf{K}(\mathbf{R},\vec{T})$. $\mathbf{M}$ is mandatory for the reality "augmentation" since it is used to reprojected virtual structures into the ATIS focal plane.

Initialization of the algorithm:

To initialize the collineation, we ask the user to provide the intrinsic matrix $\mathbf{K}$ and to click the four corner points of the billiard table. These points are trivially paired with the four corner points in the model if we ask the user to click them clockwise starting by one of the the smallest lengths of the table.

We use a direct linear transform (DLT) to find the collineation $\bold{P}$ that optimally (in a least squares sense) maps points in the view $\vec{x}$ to points in the model $\vec{x}'$ .


   \vec{x}' = \bold{P} \vec{x}


This collineation $\bold{P}$ is described by a 3x3 matrix.

Instead of having the user clicking on the corner of the tables, an event based segment detector could be used to extract the four corner points in the view and map them to the model.

Event based update:

For each incoming event in the view we can test the distance to the current estimate of the rectangular surface of the table. We need to test the distance to the four segments forming the rectangle in the view. If the incoming event is close enough to one of the segments, it should participate to update the collineation. Since a billiard table is not deformable, any change of the position and orientation of the table in the view has to result from a move of the observer (the player itself).

Therefore we can use this event to find a collineation "close-enough" to our current estimate that minimizes the orthogonal distance between the event projected through the collineation and the closest edge in the model. We use one update of a gradient descent on the squared error of the orthogonal distance in the model to update the collineation matrix. The current estimate of the collineation is used as the gradient descent starting point.

More formally, for an incoming event $\vec{e}$ , a current estimate of the collineation $\bold{P}^t$ we want to find the next estimate $\bold{P}^{t+1}$ satisfying:


 \bold{P}^{t+1} = \arg_{min} || \bold{P}^t \cdot \vec{e} - \mathcal{P}(\bold{P}^t \cdot \vec{e})||_2^2


where $\mathcal{P}(e)$ is a function returning the orthogonal projection of a point to the closest edge in the model. Also, instead of pursuing the full minimization we take one gradient descent update of the form:


 \bold{P}^{t+1} = \bold{P}^t - \eta \nabla_{\bold{P}} || \bold{P}^t \cdot \vec{e} -  \mathcal{P}(\bold{P}^t \cdot \vec{e})||_2^2,


where $\eta$ is a "step-size" parameter $\eta \in [0;1]$ that can be found and validated by experiments.

With the estimated colineation $\mathbf{P}$, we can estimate the scale factor $\rho$ from the matrix $r$:

\rho = \sqrt(det(r)).

Then we can reconstruct the projection matrix $M(vec{e})$ for each new event $\vec{e}$:

M(\vec{e})=K(n1/\rho, n2/\rho,  (n1 \times n2)/\rho^2, n3/rho),

where $n1,n2,n3$ are the 1st,2nd and 3rd column of $\mathbf{N}$.

$\mathbf{M}(\vec{e})$ allows to augment the scene by inserting any virtual 3D structure in a perspectively correct way for each incoming event. The event-based Augmented reality algorithm is tested on a square plotted on the flat screen of the computer to simulate the pool table. A virtual parallelepiped is reprojected onto the image plane for each event $\vec{e}$. This result is shown in the figures below:

As it is shown in the example, the complete square should be visible at the initialization to define manually the model. The iterative tracking of the model can be done as long as this plane is sufficiently visible. This is a critical problem to solve because in practice, the pool table can be just partially visible.