Event-based Computer Vision

Members: Michael Pfeiffer, Ryad Benjamin Benosman

An event-based, learning tracker of body parts


Our project was motivated in part by the body-part or skeleton-tracker that is supplied with the open-source libraries for the Microsoft Kinect sensor. Their algorithm uses the depth map provided by the Kinect and prior knowledge about the human body to map the 3D information from the sensor onto a dynamic model of an observed human body. It can e.g. extract the lower and upper arms or legs, and track their movement over time. This algorithm has been successfully used in the CogRob project, but our motivation was to create a neuromorphic equivalent that uses a fast, low-data rate silicon retina instead of the Kinect's infrared laser sensor, which burns a lot of power, and produces huge amounts of data that need to be processed.

In our approach we decided to implement a small, event-based visual system with three interconnected layers. The first layer would consist of multiple small circular regions, that follow movements within their receptive fields. In the second layer, connections between these regions are learned, based on the information of which regions move together. In the third layer, finally, 2D regions modeled as multivariate Gaussians with arbitrary covariance matrix are tracked, and the algorithm tries to establish virtual "joints" between regions, without being told about the final expected model.

The following image shows a snapshot of the events recorded by the DVS for an artificial joint that we constructed as a test case for the system. One can see that basically only the edges of the two "arms" are detected, as the object moves.

Raw input from the DVS, viewing a moving joint

Low-level Tracking

In the lowest level, we used between 50 and 100 units with small, circular Gaussian receptive fields, that change their positions based on the events that they receive within their receptive field, and on competition between nearby regions. Whenever a region detects an event within its receptive field, it shifts the center of the receptive field slightly into the direction of that event. If this would result in two regions being closer to each other than some threshold, the receptive field centers do not move. In addition we randomly position units if they do not receive any input in their receptive field for more than 500 ms.

The effect of this system is that the little regions cluster around regions of high activity (i.e. a lot of change in the scene). This is typically enough to see individual body parts, and track them very quickly. The following snapshot shows the result of the low-level tracking for the artificial joint. One can see that the small regions (each region has a distinct color) cluster around the edges that produce the most events.

First level of tracking for the moving joint

Second-level tracking

Whereas the Gaussian units in the lowest level tracked clusters of events independently of each other, we try to establish bonds between regions in the second layer, keeping regions that belong to the same body part together. This means that for every low-level unit we have another unit in the second layer, which receives all events from the corresponding low-level cells. We also have an all-to-all connectivity matrix between the second-layer units, which are trained with a Hebbian (or symmetric STDP) rule. Whenever two neurons in the second layer fire in close successsion (irrespective of the order), we strengthen the synapse between the two units, otherwise we reduce the synaptic weight. Second-level units that have a strong weight between them, move slightly towards each other. We can think of the meaning of the synaptic strength as an analog to a mechanical spring force between the receptive field centers, which is proportional to the weight.

Tracking in the first- and second-level typically finds body regions that move in synch with each other. However, the connections move and change rapidly, so it is still tricky to find a complete body regions, and/or find joints between two regions. The image shows the result of the second- and low-level tracker for the artificial joint after connection weights have been learned. The olive green circles represent the low-level trackers, and a red line is drawn between the centers of any two regions that are connected with a weight that exceeds a threshold of 0.5. One can see that the lines connect locally, and meaningful clusters are found.

Second level of tracking for the moving joint

High-level tracking

In the (currently) highest layer, we want to extract the shape of larger regions, and also want to learn the relationships between regions that are attached to each other via a joint. We use a small number of regions (3 in the example image), which learn their position in exactly the same manner as the lowest-level regions (with the exception that input spikes come from the second-layer, not directly from the DVS), i.e. whenever an event occurs within their receptive field, the center is moved slightly towards that event. In addition, each region learns a covariance-matrix for a 2D multivariate Gaussian distribution, which is also updated online by each event. Furthermore, each region repells the other regions in the direction of its main axis.

One problem with learning high-level regions this way was that regions would quickly change their position and shape completely under some circumstances, e.g. when a limb does not move for some time (and thus does not produce events). We therefore introduced the concepts of joints in the algorithm: whenever the probability mass of the intersection of two regions exceeds a threshold, we say they are connected, and the intersection of their main axes is the joint. Once two regions are connected, instead of purely translating a region, we also rotate a region around the joint, whenever the position and covariance are updated. Furthermore, whenever a region translates, it drags the regions connected via a joint into the same direction. This mechanism helped in stabilizing the high-level region extraction, although more parameter tuning would have been required for better performance.

The following image shows the results obtained for the artificial joint. One can see that three meaningful regions (in terms of position and shape) are extracted (two for the arms and one for the joint), and meaningful joint-connections are learned (represented by the extensions of the main axes and their intersection points).

Highest level, showing the extracted regions and identified joints

Summary and Outlook

We are very satisfied by the results we obtained during the three weeks, and think that this is a promising approach for the future. Clearly, the algorithm can be improved, but we have shown as a proof of concept that a simple, very fast, neuromorphic algorithm can achieve body-part tracking with input from a silicon retina, instead of an expensive 3D depth sensor. The low-level and second-level algorithms also perform very well on videos of full-body movement, if the background is not too noisy. A nice feature of the algorithm is that it never uses prior knowledge of the human body structure, so the same algorithm can be used for tracking arbitrary body shapes (e.g. animals, robots). On the other hand, inclusion of prior knowledge can potentially improve the performance of our algorithm significantly. We will continue to explore these questions, and will continue this fruitful collaboration that has started at the Telluride workshop.