With the large volume of posts these last few days I thought this deserved it's own thread.
TATA has just had an SNN patent approved. Very interesting given it talks about Convolutional Spiking Neural Networks (MetaTF allows users to run these with no special setup). It also talks about the STDP learning rule.
For those who are new, Brainchip and TATA did a demonstration where they controlled a digital robot figure with hand gestures using Brainchip's software at the end of 2019.
The abstract indicates the patent is covering a very similar application (however, note this patent mentions radar while the demonstration used a DVS camera).. I've linked this video at the bottom of this post for those who haven't seen it before.
This is one of the biggest indicators TATA is still working with Brainchip.https://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=1&f=G&l=50&d=PG01&p=1&S1=stdp&OS=stdp&RS=stdp | United States Patent Application | 20210365778 |
---|
1 | Kind Code | A1 |
---|
2 | Dey; Sounak ; et al. | November 25, 2021 |
---|
SYSTEM AND METHOD FOR
REAL-TIME RADAR-BASED ACTION RECOGNITION USING SPIKING NEURAL NETWORK(SNN)AbstractThis disclosure relates generally to action recognition and more particularly to system and method for real-time radar-based action recognition. The classical machine learning techniques used for learning and inferring human actions from radar images are compute intensive, and require volumes of training data, making them unsuitable for deployment on network edge. The disclosed system utilizes neuromorphic computing and Spiking Neural Networks (SNN) to learn human actions from radar data captured by radar sensor(s). In an embodiment, the disclosed system includes a SNN model having a data pre-processing layer, Convolutional SNN layers and a Classifier layer. The preprocessing layer receives radar data including doppler frequencies reflected from the target and determines a binarized matrix. The CSNN layers extracts features (spatial and temporal) associated with the target's actions based on the binarized matrix. The classifier layer identifies a type of the action performed by the target based on the features.
| Inventors: | Dey; Sounak; (Kolkata, IN); Mukherjee; Arijit; (Kolkata, IN); Banerjee; Dighanchal; (Kolkata, IN); Rani; Smriti; (Kolkata, IN); George; Arun; (Bangalore, IN); Chakravarty; Tapas; (Kolkata, IN); Chowdhury; Arijit; (Kolkata, IN); Pal; Arpan; (Kolkata, IN) |
---|
Applicant: | Name | City | State | Country | Type |
---|
Tata Consultancy Services Limited | Mumbai |
| IN |
|
|
---|
Assignee: | Tata Consultancy Services Limited Mumbai IN |
Family ID: | 1000005290548 |
Appl. No.: | 17/122041 |
Filed: | December 15, 2020 |
1. A processor implemented method comprising: employing, via one or more hardware processors, a spiking neural network (SNN) model for recognition of an action performed by a target, the SNN model comprising a data pre-processing layer, a plurality of Convolutional Spiking neural network (CSNN) layers and a classifier layer, wherein the SNN model for action recognition comprising: receiving, by the data preprocessing layer, a radar data acquired by one or more radar sensors, wherein the radar data indicative of one or more actions performed by the target, wherein the radar data comprises a plurality of Doppler frequencies reflected from the target upon motion of the target with respect to the one or more radar sensors; determining, by the data preprocessing layer, a first binarized matrix associated with the radar data; extracting, by the plurality of CSNN layers pre-trained on a training data, a set of features associated with the one or more actions of the target based on the first binarized matrix, the set of features comprising a first set of spatial features and a first set of temporal features; and identifying, by the classifier layer, a type of the action from amongst the one or more actions performed by the target based on the set of features.2. The method of claim 1, wherein determining the first binarized matrix comprises: transforming the radar data to obtain an input spectrogram using a Short-time Fourier Transform (STFT) model; and performing a greyscale conversion and binarization of the input spectrogram to obtain the first binarized matrix.3. The method of claim 1, wherein extracting the first set of spatial feature hierarchically by convoluting over the plurality of CSNN layers, wherein convoluting over the plurality of CSNN layers increases complexity in the first set of spatial features from an initial CSNN layer to a last CSNN layer of the plurality of CSNN layers.4. The method of claim 1, further comprising training the plurality of CSNN layers using an unsupervised training technique for identifying a plurality of actions using the training data, wherein training the plurality of CSNN layers comprises: receiving, by the data preprocessing layer, the training data acquired by the one or more radar sensors, the training data indicative of the plurality of actions performed by one or more targets, wherein the training data comprises a time series data comprising a plurality of Doppler frequencies reflected from the plurality of targets upon motion of the plurality of targets with respect to the one or more radar sensors; determining, a plurality of second binarized matrices associated with each of the plurality of actions, by the preprocessing layer, wherein determining the plurality of second binarized matrices comprises: computing a plurality of spectrograms for the plurality of actions as a time-frequency domain representation of the time series by using a STFT model; performing a modulus operation on the STFT model to obtain a real valued matrix; consecutively converting the real matrix into a grayscale image and into a second binary matrix of the plurality of second binary matrices using a threshold; extracting, by the plurality CSNN layers, a set of training features associated with the plurality of actions of the target based on the second plurality of binarized matrices, the set of training features comprising a second set of spatial features and a second set of temporal features; and identifying, by the classifier layer, a type of the action from amongst the plurality of actions performed by the target based on the second set of features.5. The method of claim 4, wherein each CSNN layer of the plurality of CSNN layers comprises a set of class-wise filter blocks connected via an inhibition mechanism, each class-wise filter block comprising a set of filters controlled by a switcher node to enable the CSNN layer to capture spatially collocated patterns within a spike frame of a single action class associated with the action, and wherein the each CSNN layer of the plurality of CSNN layers captures spatially collocated patterns within a spike frame of the single action class, and wherein each filter-block of the set of class-wise filter blocks comprises a set of filters connected via a switcher node, and wherein the switcher node applies inhibition to force all but one filter in the each filter-block to inactive state for a predetermined inactivity duration.6. The method of claim 5, further comprising enabling spatially collocated but temporally separable features appear on distinct filters from the set of filters, wherein said enabling comprises iteratively selecting a filter from amongst the set of filters to cause a spike upon elapse of the predetermined inactivity duration, wherein a filter from amongst the set of filters causing a maximum spike is selected, and wherein iteratively selecting the filter of the set of filters comprises: activating a filter block from amongst the set of class-wise filter blocks at a time for a frame sequence associated with the action; and applying a long-term inhibition to disable more than one class-wise filter blocks from amongst the set of class-wise filter blocks to learn a redundant pattern and enable lateral inhibition among the set of class-wise filter blocks and allowing the set of class-wise filter blocks to compete for a plurality of action classes.7. The method of claim 6, wherein applying the long-term inhibition comprises: initializing weights associated with the filters in the set of filter blocks randomly; determining a filter-block from amongst the plurality of filter-blocks selected for a distinct action class from amongst the plurality of action classes; and sending an inhibition signal to remaining filter blocks from amongst the plurality of filter-blocks to prevent the remaining filter blocks from being activated.8. The method of claim 1, wherein a radar sensor from amongst the one or more radar sensors comprises a Continuous Wave (CW) type radar sensor.9. The method of claim 1, wherein a radar sensor from amongst the one or more radar sensors comprises a frequency modulated continuous wave (FMCW) type radar sensor.10. A system (900), comprising: one or more memories (915); and one or more hardware processors (902), the one or more memories (915) coupled to the one or more hardware processors (902), wherein the one or more hardware processors (902) are configured to execute programmed instructions stored in the one or more memories (915), to: employ a spiking neural network (SNN) model for recognition of an action performed by a target, wherein the SNN model comprising a data pre-processing layer, a plurality of Convolutional Spiking neural network (CSNN) layers and a classifier layer, wherein to employ the SNN model for action recognition, the one or more hardware processors are further configured by the instructions to: receive, by the data preprocessing layer, a radar data acquired by one or more radar sensors, the radar data indicative of one or more actions performed by the target, wherein the radar data comprises a plurality of Doppler frequencies reflected from the target upon motion of the target with respect to the one or more radar sensors; determine, by the data preprocessing layer, a first binarized matrix associated with the radar data; extract, by the plurality of CSNN layers pre-trained on a training data, a set of features associated with the one or more actions of the target based on the first binarized matrix, the set of features comprising a first set of spatial features and a first set of temporal features; and identify, by the classifier layer, a type of the action from amongst the one or more actions performed by the target based on the set of features.
DETAILED DESCRIPTION
[0018] Radar-based human sensing is being utilized for unobtrusive detection of human physiology as well as recognizing gestures/activities. These radars are compact in size, affordable and can be easily integrated to remote monitoring systems. Using radar for human sensing has certain advantages over vision technologies in that this is privacy preserving, independent of lighting conditions, usually does not require background subtraction (static background is defined as `Clutter` in radar terminology) and can be used for through-the-wall imaging. Amongst the radar technologies, `Continuous Wave` (CW) or `Frequency Modulated Continuous Wave` (FMCW) radars are preferred for short range (up to 15 meters) and indoor applications like elderly care. For CW radar, one measures motion directly in terms of Doppler frequency while for FMCW or Pulse radars, Doppler frequency is derived through pre-processing. It is to be noted that human movements constitute articulated motion vide linkages through flexible joints. When a rigid body moves in the field of view of the radar sensor, the return signal displays doppler frequency by taking the difference between transmitted and received frequencies. But when a person performs some action, even while standing still, the radar return displays time varying doppler frequency patterns. This is known as micro-Doppler effect. Thus, the radar signal is usually processed by generating spectrogram or joint time-frequency plots. The classification and detection of different actions from such radar signals is complex.
[0019] The signatures due to human motion displays complex pattern. In certain scenarios, machine learning techniques are applied for action detection. However, new insights are required for accurate and reliable detection of human gestures from radar returns. Conventionally, classical machine learning techniques, including Artificial Neural Networks (ANN) and Deep Learning models have been used on data from vision sensors for identifying actions. However, apart from the privacy concern, the classical approaches suffer from another great disadvantage in that the methods are not tailored for end-to-end execution on edge devices. In various industrial domains, such as Internet of Things (IoT), robotics, healthcare, retail etc., an abundance of low powered devices exists at the edge of the network and there is a drive to utilize the available compute cycles on such devices. The advantage this approach has over the prevailing methodology is that the data need not be sent upstream to the computing infrastructure over the network, thereby reducing the latency and communication cost. However, the classical approaches mentioned above require a large volume of data for training and are highly compute/memory intensive making them too heavy-weight for edge devices. Pre-trained compressed models can however be deployed on constrained devices--but that does not avoid the cost incurred during training, the requirement of a large volume of training data, and being compressed, they often sacrifice accuracy.
[0044] To achieve stability, the membrane potential always tends to evolve towards the resting potential, V.sub.rest. Hence, in the absence of any stimulus from pre-synaptic neurons, the membrane potential of a particular neuron remains at V.sub.rest. Similarly, the equilibrium potentials of the excitatory and inhibitory synapses are represented by E.sub.exc and E.sub.inh. Synapses are modelled as conductance values, namely, g.sub.e, the excitatory conductance, and g.sub.i, the inhibitory conductance. Excitatory pre-synaptic neurons increase the membrane potential, whereas, inhibitory pre-synaptic neurons tend to decrease it. As mentioned before, a spike is generated when the membrane potential breaches a threshold (V.sub.thresh). A spike in the presynaptic neuron increases the conductance of the synapse in magnitude. The dynamics of excitatory and inhibitory conductance are modelled as per equations (4) and (5) respectively:
.tau. e .times. dg e dt = - g e ( 4 ) .tau. i .times. dg i dt = - g i ( 5 ) ##EQU00002##
[0045] In an embodiment, a mathematical function used to model a spike may be Dirac Delta function. As this model is non-differentiable (which is logical for a spike which occurs at a time instance only), the gradient based learning algorithms popular in ANNs, cannot be applied in case of the SNN. Learning and memory in SNNs are thus modelled using Spike Time Dependent Plasticity (STDP) which takes into account the strengthening of synaptic bonds due to positive temporal correlation between pre- and post-synaptic spiking neurons. TheSTDPprotocol modifies classical Hebbian learning rule by improving it with temporal asymmetry. A spiking neuron withSTDPcan learn a linear dynamical system with minimum least square error. A pre-synaptic trace, x.sub.pre, for each synapse keeps track of the activity of the pre-synaptic neuron, and likewise a postsynaptic trace x.sub.post, keeps track of the activity of the postsynaptic neuron. Each trace decays exponentially with time as shown in the equations (6) and (7) with synaptic trace decay constants .tau..sub.pre and .tau..sub.post.
.tau. pre .times. dx pre dt = - x pre ( 6 ) .tau. post .times. dx post dt = - x post ( 7 ) ##EQU00003##
[0049] A biologically plausible approach adopted in neural networks in order to enhance competition between neurons is called Lateral Inhibition or Winner-Take-All. The first excited neuron to produce a spike attempts to stimulate other neurons or directly inhibits one or more of them. In a learning scenario, a pattern to be learnt excites one or more neurons, which in turn try to deactivate other neurons with the help of lateral inhibition, preventing them from learning the same pattern. In SNN networks, this mechanism helps multiple neurons to compete and learn different patterns. In the disclosed SNN model, a softer form of Lateral Inhibition like that of k-WTA is used, which is proven to be computationally less power intensive than a hard Lateral Inhibition and leads to better shared feature selectivity in cortical pyramidal cells.
Tata and Brainchip demonstration: