Motion and Action Processing on Wearable Devices

The main goal of our project was to create new algorithms and tools for vision on wearable devices, e.g. data goggles, with a focus on moving event-based vision sensors like ATIS, DVS, eDVS, and DAVIS. In particular we focused on the goal of Visual Self-Localization and Mapping (Visual SLAM), i.e. the generation of a map of the environment from visual cues while the user is moving in it, and the simultaneous localization of the user in the environment from the observed cues. To achieve this goal, we collected new datasets of complex outdoor and indoor scenes with moving event-based vision sensors, created new feature detectors that operate on event streams, identified landmarks (objects) in an input stream from a moving camera, and dealt with the problem of segmentation of moving objects. In addition, we created new algorithms for wearable computing that can enhance the user experience in data goggles. For example, we created a face detector and a gesture interface that can identify simple commands in a touch-free interface, as well as an augmented reality application of localization in a small environment. The following is a list of sub-projects achieved during the 2014 Telluride workshop. We discuss the project as a whole and its potential impact afterwards.


Overview of wearable sensors

We used a variety of event-based vision sensors for this project, illustrated in Fig. 1. Most of the work was done with the ATIS vision sensor, which provides both events that indicate local relative illumination changes, and gray-level information for each firing pixel. For the SLAM scenarios we mounted the sensors on bicycle helmets, and let people either walk with it during recordings, or ride a bike, which creates more stable recordings. In addition we recorded from sensors mounted on robots, e.g. an ATIS attached to a wheeled Lego robot, or an eDVS sensor embedded in a Pushbot robot from TU Munich. We also performed recordings with the new DAVIS sensor, which provides both events and full gray-level frames. The sensor was attached to a mountain bike while it was driven on an outdoor bike route. Finally, for the gesture recognition project we used recordings from the Sevilla silicon retina. This means we used data from virtually all major event-based vision sensors available to date. We also planned to incorporate data goggles like the Epson Moverio BT-100, and established a connection between ATIS and the goggle display for simple augmented reality purposes.

Sensors used in the MAP project

Fig. 1: Wearable vision sensors: (top left) ATIS mounted on mobile Lego robot. (top middle) ATIS mounted on bicycle helmet during outdoor recordings. (top right) eDVS on Pushbot robot. (bottom left) DAVIS mounted on bicycle. (bottom middle) ATIS sensor closeup. (bottom right) Augmented reality setup: ATIS mounted on bicycle helmet and connected to Epson Moverio BT-100 data goggles.

We found that all sensors provided useful data for our project. For subprojects like motion segmentation, object recognition, and SLAM, it proved useful to have the gray level information provided by SLAM. For other projects like corner detection, gesture recognition, and object recognition with DBNs we found that very good results could already be achieved by using only DVS events.

Project plan

In the following systems diagram we illustrate the relationships between the different subprojects.

Systems diagram of the MAP project

The collection of new benchmark datasets and the establishment of new or adaptation of existing machine learning algorithms constitutes the basis of our project. In our project we collected systematically and for the first time datasets for event-based vision on wearables. This forms a very useful database, not only for this project, but also for future projects on event-based vision. We also looked at new algorithms that can process events for feature extraction and classification, building on existing event-based machine learning frameworks like H-First and spike-based Deep Belief Networks.

Feature detection is a crucial task for all higher level components. They either feed directly into SLAM, using the tracking of event-based based features to estimate the camera position. Or they can provide a pre-processing of the scene for object recognition and scene segmentation. Due to the high importance of this component, we tried multiple approaches for event-based features, ranging from convolution-based feature detectors, corner-detectors, to PCA-based detectors that exploit the timing of events.

For tasks like object recognition and face detection we adapted the existing machine learning and computer vision algorithms, working mostly on the boundaries of observed objects, which are often given by the DVS events. We used both deep learning approaches like DBNs and HMAX-like models, and contour-based computer vision algorithms. Object recognition can detect landmarks in a scene, and thereby feed into the SLAM process, e.g. to establish an initial position. Face detection is a useful tool for augmented reality applications.

The gesture interface is supposed to facilitate the control of a wearable augmented reality display. Instead of using spoken commands, or inputs from a touchpad (which is the default input to Moverio BT-100 data goggles), the idea was to use hand commands. Here we worked both on simple hand position tracking, and on the recognition of different hand poses with convolutional networks to recognize a set of commands.

Movement segmentation is the process of detecting moving objects from recordings in which the camera itself moves. Because of this ego-motion, every object in the scene will create a stream of events, and it is important to distinguish events from static objects from those of moving objects, because this might create severe artifacts in the estimation of the camera position for visual SLAM, which relies on the recognition and tracking of previously seen features, which are assumed to be static.

At the top level we have two target tasks, Augmented Reality and SLAM. In our project we focused on the latter, but provided new technological tools for augmenting the display by developing an efficient streaming algorithm for visual events between two devices. For SLAM we used the recordings and the feature detectors to create a visual map of the environment, and localize the user wearing the vision sensor in it. This worked first for simplified visual scenes, in which dots on a plane constitute the features, and also in an extended version for indoor recordings of a user walking with an ATIS sensor mounted on a bicycle helmet.

Within the 3 weeks of the Telluride workshop, we managed to complete many of the required components, and integrate some of them. The following diagram shows the completed components, the subproject leaders, and arrows indicate that integration between the components was achieved.

Components achieved during MAP14


The project was highly successful, delivering many novel approaches for event-based vision on wearable devices, and solving many very difficult theoretical and implementation issues on the way, which will certainly results in a number of publications. Furthermore, we achieved the goal of raising the awareness for wearable computing as an important future application area, and in particular as one where neuromorphic engineering can make a big impact. Fast and low-power computing are absolutely necessary for wearable devices, which cannot integrate large batteries or powerful computers due to their size. Event-based vision can provide a solution for this, and the solutions developed during Telluride 2014 provide an important first step into that direction. Unfortunately there was not enough time to integrate all the components into one big application, because every component proved to be very difficult on its own. However, with a bit more time this could have been resolved, and e.g. the different feature detectors could have directly fed into SLAM or the object recognition algorithms, which could have served as landmark detectors to aid the localization. We think this project provides the basis for many more interesting neuromorphic engineering challenges: new vision sensors like DAVIS become available, hardware platforms like SpiNNaker could implement several of the components in a more efficient way, new algorithmic solutions might exploit the timing information in the DVS events better, and new ways to deal with dynamic objects (e.g. moving people) in the scene can be developed. This provides many fruitful challenges for future Telluride projects.