Each neuron in the network is uniquely assigned a horizontal and vertical cyclopean coordinate x and y, as well as a disparity coordinate d. With the definition of being the one-dimensional disparity space, then the three coordinates represent a point in the 3D disparity space , which corresponds to the neuron’s cognitive representation of a location in 3D space. We can thus define the map which transforms retinal image coordinates to disparity space as
where and are rectified pixel coordinates of the retinal input neurons. In order to implement a neural coincidence detection mechanism, the proposed network uses neurons with leaky-integrate-and-fire (LIF) dynamics36. The membrane potential vc(t) of a LIF coincidence neuron is described by the following equation
where the time constant τc determines the neuron’s leak and θc the threshold at which the neuron fires. A coincidence neuron receives input from a pair of epipolar retinal cells, which can be described as a sum of spikes
where the indices i and j indicate the spike times of the retinal cells and respectively. The proposed disparity detectors aggregate evidence from the responses of simple coincidence detectors. The disparity detectors are also modeled using LIF neuron dynamics, but with a distinct time constant τd and a firing threshold θd:
The input of the disparity detector at d = (xd, yd, dd) combines the outputs from coincidence detectors within bounded planar excitatory and inhibitory regions in disparity space and respectively and outputs from disparity detectors in the mutually inhibitory region :
where k represents the index of the spike times of coincidence neuron c, while wexc and winh are constant excitatory and inhibitory weights. The regions C+ and C− are squared windows in the plane of constant disparity Ed and the plane of constant horizontal cyclopean position Ex, which are defined relative to the disparity detector’s spatial representation d. The index n represents the spike times of disparity neuron and wrec is a constant inhibitory weight (mutual inhibition of disparity detectors). The region D− is defined by the two lines of sight (one from each retina):
The final output of the network is then obtained from a combination of coincidence and disparity responses such that a disparity spike is only considered valid when immediately preceded by a coincidence spike representing the same location in disparity space (see Supplementary Material for more details). Network simulation
For our simulations we used the output of 180 × 180 pixels from each sensor. This led to a population of 2 × 1803 coincidence detectors (polarities where treated with separate detectors). With another 1803 disparity detectors, the total network initially incorporated more than 17 million neurons. However, for a natural scene within 5 m distance from the camera, the disparity is limited to the range (0, 40) so the total number of required neurons was reduced to less than 4 million. The simulation code was entirely written in C and in a completely event-based fashion, meaning that the membrane potential of a specific neuron was only updated when it received an input leading to a very efficient implementation. For each neuron the current membrane potential and the time of the last update needed to be stored in memory ending up in an occupancy of about 30 megabytes. In the 4 seconds walking scene, roughly 1.2 million input events were processed with a total simulation run time of about 30 seconds on a single core of an i7 CPU running at 3.40 GHz. Comparing stereo correspondence performance of traditional and neuromorphic hardware
Conventional SAD algorithms for stereo matching typically involves many processing steps. However the step that dominates the algorithm complexity is the calculation of the sum of absolute differences. Typically, a three-dimensional map of disparity space is calculated, in which each entry contains the absolute difference of intensity of the corresponding image pixels at a given disparity. The sum can then be easily computed across constant disparities. This three-dimensional map is then updated for each pair of images, with every new frame. As a consequence, the algorithm complexity is proportional to n2Dfs, where n2 is the dimension of the disparity image, D the depth resolution and fs the sampling rate. Typically, with this method, a subtraction, rectification and addition needs to be done at each step. For the sake of simplicity, we don’t consider in the comparison, the extra steps required to identify the highest SAD entry in the map. In our stereo-correspondence model the disparity map is represented by the population of coincidence neurons, which are updated only when the afferent pixels produce address-events. This update rate determines the network complexity and is proportional to n2d2Dr, where d can represents the edge density of the visual input and r is the rate by which the edges change. This is a highly simplified expression and assumes that intensity changes are caused by moving edges. It becomes immediately evident that for a rate of change r that matches the sampling rate fs, our stereo network requires a factor d2 less operations. In the context of the stereo network, the generation of a coincidence spike and its propagation to the disparity detectors is considered an operation. This is approximately equivalent to the SAD operation (subtraction, rectification, addition) as concluded from our network simulations and detailed analysis.
For the estimation of the power consumption of the SAD algorithm we considered a state-of-the low power art micro-controller, such as the STM32F7 series, which has power consumption figures rated as low as 0.33 mW/DMIPS. In this calculation we assumed that a SAD operation is implemented with 3 DMIPS (which is the bare minimum not considering any parallel processing). We scaled the number of operations to match the described scenario, considering a depth resolution of 30 levels, computed within a field of view of 21 × 21 pixels.
In the spiking stereo neural network (SSNN) the number of operations corresponds to the amount of input coincidence spikes. This number was calculated from the input spikes produced by the DVS and estimated edge density of the visual scene. Using this estimate we could calculate the rate of change of edges (r = 151 Hz), which defines the temporal resolution at which the two approaches should be compared.
For the estimation of the power consumption on neuromorphic hardware implementation, we considered the energy used for both generation and the routing of address-events. These figures were derived from measurements taken from a more recent design of a neuromorphic processor comprising the same circuits of the ROLLS chip, and using the same 180 nm CMOS technology: the energy cost of a single spike generation is 883 pJ and the routing of a spike is 360 pJ resulting in a total of 1.243 nJ per primitive operation. Using more advanced technology nodes, such as the 28 nm one used for recently proposed neuromorphic processors11, these figures would be even smaller. We are currently developing a neuromorphic processor chip that can support the network proposed in this paper, using a 28 nm process, and have estimated energy figures of 50 nJ for the spike generation, and 147 nJ for routing the event to its destination20, resulting in a dramatically smaller total cost of 197 pJ per primitive operation. The results of Table 1 summarize this analysis.
In both SAD and SSNN calculations, we considered only the cost of the core operations, neglecting the costs of peripheral functionality such as I/O operations. Additional Information
How to cite this article: Osswald, M. et al. A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems. Sci. Rep.7, 40703; doi: 10.1038/srep40703 (2017). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
BRN Price at posting:
29.5¢ Sentiment: Hold Disclosure: Held