Detection of motion direction from an eDVS using Dynamic Field Theory

Mathis Richter, Yulia Sandamirskaya

Parsing

This workshop project is part of a long-term research project we have recently started to work on at the Institute for Neural Computation at the Ruhr-University Bochum. It aims at autonomously parsing a perceived continuous flow of actions into its components, something we humans can do effortlessly. For instance, watching someone grasping an object on a table, we are able to subdivide the flow of actions into different phases, e.g. movement of the head and eyes to locate the object, movement of the arm toward the object, opening and closing the hand, and finally grasping the object. There are two hypotheses as to what type of features this parsing is based on: first, it could be based on the change of contact points (e.g., between the hand and the object, or the object and the table); second, it could be based on estimates of movement parameters such as the velocity and direction of the arm. While parsing might use both mechanisms, in this early stage of the project we concentrate on the second hypothesis and try to determine components of perceived action based on the direction of movement alone.

Model of motion direction detection

We recently developed a model of human motion direction detection. It is based on the observation that humans detect motion best when the moving object dissapears in one location and appears in another at the exact same time. Previous models depended the sequential occurance of these events. While based on Dynamic Field Theory, our model depends on several conventional preprocessing steps (e.g., edge detection, transient detection), as it works with input of a regular camera. Event based cameras such as the DVS on the other hand already provide edge detection of (moving) objects in the scene. Thus, the idea for this workshop project is to determine whether the preprocessing steps of the model can be discarded when working on input from an event-based camera.

Goals of this project

The primary goal of this project is to determine whether our model of motion direction detection can be simplified by using input from an event-based camera. Since we usually do not work with event-based sensors, an important subgoals of the project is to investigate the integration of such sensors into DFT architectures.

Results

Integration of event-based camera in DFT framework

We decided to use the event-based eDVS camera by Jorg Conradt because it already provides a C++ interface.



We integrated the camera into  cedar, our C++ based framework for Dynamic Neural Fields. You can now drag and drop the an icon representing the eDVS into an architecture and connect its output to various processing steps and to Dynamic Neural Fields.



Approach for the detection of motion direction

The eDVS camera produces events based on log-intensity changes perceived on its silicon retina. In other words, whenever the light intensity in a pixel increases or decreases to a certain amount, the camera will produce a positive or negative event, respectively.

We collect the incoming streams over a small time window, split them into positive and negative streams and integrate them into 128x128 pixel event matrices. Both matrices contain binary values at each pixel, where 1 denotes an event and 0 denotes no event in the current time window.



As you can see in the above screenshot, a single object moving through the scene produces two edges of events, where one edge is comprised mainly of positive and the other of negative events. You could think of these two edges as two separate objects moving in space. When doing so, the problem can be translated into determining on which side the edges are relative to another (e.g., positive edge is left of the negative edge). To solve this problem, we employ a Dynamic Field Theory architecture for "spatial language", which is able to determine the relative spatial position of objects in a scene.

We built a complete DFT-based architecture in cedar and connected it to the sensorial input from the eDVS. The following screenshot shows an overview of the architecture. It runs in realtime, processing the eent-stream of the eDVS as it comes in. At the far end (right side) of the architecture, four dynamical nodes (zero-dimensional Dynamic Neural Fields) represent detected motion direction for leftward, rightward, upward, and downward motion, respectively.



The following diagram shows the same architecture, overlaid with descriptions of functional blocks within the architecture.



The architecture is able to detect four different motion directions. As shown in the next four screenshots, for the different motion directions, only the corresponding node becomes active (output of approximately 1), while the output of the others stays close to 0. The plots in the screenshot show (from left to right) the camera output, the modules doing the actual detection of the direction of motion (a sum of two streams, where one is convolved with a sigmoid), and the output of the motion direction neural nodes. The object used is a white-board marker. In the four screenshots, it moves left, right, up, and down, respectively.

leftward motion



rightward motion



upward motion



downward motion



Additionally to the screenshots shown above, the following video shows the same detection process.

[motion_direction_detection.ogg Download]

In future research, the output of the motion direction nodes can be used to determine changes in the movement parameters of a perceived action. Thus, they are the first step toward parsing a continuous flow of actions into its components.

Caveats and future research

Due to the limited time, the architecture is rather limited and still has some (big) caveats that need to be addressed in future research.

1. Since the eDVS produces events based on log-intensity change alone, the polarity of the events depends on the intensity difference between foreground (i.e., the moving object) and background. In other words, if the architecture produces correct responses fo r the direction of motion for an object that is darker than the background, it will report the exact opposite when the object moves in front of a background that is darker than the object itself. This issue could be addressed by introducing a top-down signal from a module that is able to detect the intensity of the background.

2. The architecture currently only produces correct output when a single, uniform object (i.e., an object without patterns or internal edges) moves through the scene. These limitations could also be addressed by top-down signals that focus the attention of the model onto a single object and detect structured objects as a whole.

3. At the moment, the architecture can only deal with the input of a static camera (one that does not move), since a moving camera produces many events unrelated to the object moving in the scene. Humans use their vestibular system to stabilize the perceived signal of the environment when moving. Additionally, input may be suppressed during saccades or other fast movements. In future research we could use similar techniques to deal with moving cameras.

Attachments