Landmark recognition from ATIS data

Participants: Ching Teo, Cornelia Fermuller, Francisco Barranco


We propose to detect object-specific contours from DVS and Intensity data obtained from the ATIS camera. A random forest object-specific boundary classifier is trained for two specific landmarks/objects: "Bottles" (indoors) and "Cars" (outdoors). Training was done offline using publicly available datasets containing annotated RGB still images where contours of the target classes are labeled. Inference is applied directly over the ATIS intensity data, modulated by the motion boundaries estimated from DVS events.


The input is an ATIS intensity image "frame" and the motion boundaries derived from the DVS events. More details about the motion boundary estimation can be found at  Motion segmentation from ATIS data. The output are detections (indicated via bounding boxes) of the predicted target's location and size. To do this, a random forest object-specific boundary classifier is trained for the two landmark classes:

Positive patches of size (30x30) are obtained from annotated boundaries while negative patches are obtained from other regions. For "Bottles", we trained a forest with 20 trees, with each tree trained with 1000 positive and negative patches. For "Cars", we trained a forest with 20 trees, with 800 positive and negative patches. Each tree is trained to discriminate between K=20 classes of contours (obtained first via K-means clustering of the training positive patches). To preserve structure, we store structured edge information at the nodes of each trained tree together with simple edge-based features: gradient histograms and gradient channels. See the section on further work to embed mid-level patterns to improve the discriminative power of the features used.

Using the trained forest, input test patches (30x30) from a test ATIS intensity image are then fed to be classified into the K=20+1(non-boundary) classes. Responses over the 20 trees are averaged to derive the object specific boundary response per pixel. We use an efficient structured inference procedure obtained from  P. Dollar's toolbox which provides results in ~0.05s for a standard ATIS intensity image (240x304). Larger responses correspond to boundaries that are closer to the target. Motion boundaries obtained from the DVS are then used as a mask to select valid edge responses. For indoor sequences (bottles), we only consider translation. In addition to motion boundary detection, the motion process for many of the boundaries also returns which side of the boundary is in front and which behind (ordinal depth), and we use this information to obtain object masks.

To obtain an estimate of the location and scale of the targets, we apply max pooling of the edge responses over (15x15) patches and threshold them at 0.5. The thresholded responses are binarized and grouped to localize the target. We then rank and score each detections via the mean of the max-pooled edge responses.


We illustrate the inference process in the following manner with two figures:

  • Left: (Top Row): Input ATIS intensity image, Denoised ATIS intensity. (Bottom Row): Estimated motion boundaries, Predicted object-specific boundaries.
  • Right: Landmark detections: ranked via scores. Blue is best, followed by green and red etc...

Bottles (indoors)

We show two results here, one without depth estimates and one with depth estimates.

  • Seq 1:
  • Without Ordinal Depth estimation

bottle_seq1_process bottle_seq1_detRes

  • With Ordinal Depth estimation

  • Seq 2:
  • Without Ordinal Depth estimation

  • With Ordinal Depth estimation

Cars (outdoors)

  • Seq 1:

  • Seq 2:

  • Seq 3:

  • Seq 4:

  • Seq 5:

  • Seq 6:

Further work

The main limitation of the current approach is that the decisions made per patch is local, via local features computed at each patch. Further work aims to embed more global information per patch, via mid-level patterns related to certain Gestalt grouping principles. In the workshop, we have developed gestalt like mid-level operators that are sensitive to the following patterns: Closure, radial lines, spirals, hyperbolars and parallel lines. It will be interesting to see if we can extract appropriate features directly from DVS data for direct recognition over DVS events.