Problem Description and Data Collection:

The aim of this project us to implement an event-based hand detector algorithm. This will be done using a spiking convolutional neural network, trained on DVS data.

Using a DVS sensor, we recorded the different actions executed by the subjects for the action prediction task. In these recordings, participants always move their hands from an initial position until they grasp an object, and then they perform some action with it.

The time at which each one of these stages begins was manually labeled.

Between the initial position and the contact point, the hand is the only moving element in the scene, and therefore it is relatively easy to track it. In our case, we used a set of gaussian blob trackers to follow it. Each one of these trackers tends to follow a region of the hand, and thus their mean provides an estimate of the position of the hand in the image.

We next take 60x60 windows centered around the mean of these trackers. This gives us positive examples of hands to train our convolutional neural network:

Taking events out of this 60x60 window we generate the negative examples, choosing positions in which the rate of events is higher than some threshold:

The Method:

Using the training data obtained as explained in the previous section, we have trained a convolutional neural network.


The network was quite successful on the training data, achieving around 1% error.

The qualitative results on the testing data were mixed. The three images below show three cases: the classifier successfully detecting the hand when it was not grasping an object, the classifier being reasonably successful at detecting the hand when grasping an object, and the classifier being unsuccessful at detecting then hand when grasping an object. In each image, the left panel shows the filtered DVS data at the time (the input to the classifier), and the right panel shows the output of the classifier (where red indicates hand, green indicates not hand, and blue indicates that the classifier was not run on that region due to an insufficient number of spikes). Note that for the classifier output, the classifier was run on 60 x 60 pixel regions, with adjacent regions having their centers offset by three pixels. Therefore, the right panel indicates the classification results when the classifier was centred on the indicated pixel.

Successful hand detection in the absence of an object

Reasonably successful hand detection in the presence of an object

Unsuccessful hand detection int the presence of an object

Overall, the classifier was reasonably successful at detecting the hand when it was not grasping an object. It was much less successful at detecting the hand when it was grasping an object. There are two explanations for this: One is that we simply did not train the classifier on frames where there was a moving object in the scene, since it was much easier to segment and label the hand in the absence of objects. Training with better distractors (i.e. more examples of non-hand objects) would certainly increase the accuracy. Secondly, detecting the hand in the absence of objects is a much easier task, since regions with high spike counts typically correspond to the hand as it is the main source of motion in the scene. Thus, the classifier can "cheat" a bit and look for areas with high spike rates. Again, better distractors should help address this problem.