Neuromorphic Body-part Tracking

Participants: Michael Pfeiffer, Ryad Benjamin Benosman

Controlling electronic devices with arm-, hand- or body-gestures is becoming an increasingly important topic for consumer electronics. This topic has become especially popular with the arrival of Microsoft's Kinect sensor, which uses a patterned-light approach with infrared lasers. The neuromorphic Dynamic Vision Sensor (DVS) offers a fast and energy efficient approach for gesture recognition, but recognizing body parts from DVS input is still an unsolved and very challenging problem, because the classical computer vision approaches fail for DVS input, since only temporal log-intensity changes in the pixels are reported.

One of the main problem is that the DVS can only see moving parts, so static body parts (e.g. the torso or head during arm movements) become invisible. The following picture shows a humanoid robot, and a snapshot of the DVS input during arm movements. One can see that only the arms are visible, but the rest of the robot body is not.

Humanoid robot and DVS picture

For this project we extended our project from Telluride 2011, in which we built a prototype of a multi-layer general-purpose tracking algorithm. The system consisted of three layers: The first layer tracks small circular regions of strong pixel activity. The second layer connects first-layer regions that are spatially close and have synchronous activity, which indicates that they belong to the same moving part. These connections are learned with a simple spike-timing based learning rule. In the third layer we perform a hierarchical clustering of regions that are strongly connected in the second layer. No part of the structure is pre-defined, all connections between regions or body-parts are learned online.

This year we implemented several improvements for the first and second layers, in order to stabilize the results. We also started to adapt the third layer of the algorithm to match regions to a pre-defined simple body-model with joints. We also collected a valuable dataset for arm and body gestures performed by a humanoid robot, which yields more reproduceable data than gestures performed by humans.

Humanoid-robot dataset

We collected an extensive dataset of arm and body-gestures performed by a humanoid robot, and recorded with a single DVS (and in addition with the u-Doppler sensor from JHU, see other project description). The idea was that these gestures are quite reproduceable, and can therefore be used as a standard benchmark for spike-based classifiers of gestures from DVS. We recorded 13 different gestures, with 10 trials each, and in addition some random movement sequences which can serve as test examples. The following pictures shows some DVS snapshots of the different recorded gestures. One can see that most of the body is invisible, and only moving parts appear in the DVS recordings

Picture of all robot gestures

Low-level Tracking

We replaced the circular trackers with ellipsoids, which not only follow the position of moving parts, but also adapt to the shape of the object they are following. The idea is that each region is modeled as a 2D Gaussian with arbitrary covariance matrix, which represents the receptive field. Whenever an event falls into the receptive field of one tracker, its mean is shifted slightly in the direction of that event, and also the covariance matrix of the tracker is adapted with online learning. If two regions have too much overlap, they repulse each other. The degree of overlap is measured by the Mahalanobis distance between one region and the center of another, which also accounts for the shape of the receptive field. We found that this modification resulted in large improvements, because e.g. for linear edges, the regions would adapt their shape, and so much fewer regions are needed than with the simple circular trackers from last year. This not only improves the tracking, but also reduces computation time, because the number of regions could be reduced by almost 50 percent for most problems. In addition, we added a maximum life-time for trackers, meaning that regions that do not receive events for a long time (more than 1 second) are randomly resampled. This is particularly important in the beginning, when the regions are not yet initialized to meaningful positions.

This tracking is already very powerful, and works for a variety of tasks, as shown in the following picture. We track line features in an artificial grid, a moving hand, a moving human body, and the moving body of the humanoid robot. None of this requires a model of the stimulus, all shapes are learned online from DVS input.

Tracking results with low-level trackers

In summary, this improvement yields a much more powerful general-purpose tracking mechanism for the DVS, which can be used for a variety of tasks, from fast object tracking, to object or gesture recognition.

Shape extraction

As the next step we wanted to extract shapes from the tracked low-level features in the DVS input. The idea is that nearby regions who receive input events at similar times very likely are connected and part of the same structure (e.g. of one limb). The timing information distinguishes features that are spatially close, but do not always move together (e.g. upper and lower arm). The connection strength of two tracked regions is determined with a spike-timing based learning rules (where spikes arrive from the low-level trackers whenever they are updated). If two regions are within a certain (Mahalanobis-) distance of each other, and receive spikes within a given time window, the connection is strengthened. When they are far apart or fire out of sync, their connection is weakened. The connection strength is used to pull connected regions together, and to push regions that should not be connected away from each other.

Compared to last year, we got a big improvement by using the new elliptic low-level trackers, and also added new constraints and features:

  • We added the possibility to fix connections forever, if the weight stay above a threshold for a given time (typically 3 seconds).
  • We removed connections if the distance between regions exceeds a threshold.
  • We made connection strengths bistable, i.e. weights slowly converge to the -1 / +1 limit, even in the absence of further learning events. This makes connections more stable.
  • We added the possibility of having multiple blocks of regions, which are always strongly connected, and never strongly connected to other blocks. This should help future models that incorporate pre-defined fixed body-models.
  • We avoided random movements of strongly connected regions, to improve stability.

The performance of this two-layer algorithm approached the performance of last year's three-layer architecture, although there are still several possible improvements, which probably mostly require parameter tuning.