Building Temporal Kernels with Orthogonal Polynomials
Abstract
We introduce a neural network named PLEIADES (PoLynomial Expansion In Adaptive Distributed Event-based Systems), belonging to the TENNs (Temporal Event-based Neural Networks) architecture. We focus on interfacing these networks with event-based data to perform online spatiotemporal classification and detection with low latency. By virtue of using structured temporal kernels and event-based data, we have the freedom to vary the sample rate of the data along with the discretization step-size of the network without additional finetuning. We experimented with three event-based benchmarks and obtained state-of-the-art results on all three by large margins with significantly smaller memory and compute costs. We achieved: 1) 99.59% accuracy with 192K parameters on the DVS128 hand gesture recognition dataset and 100% with a small additional output filter; 2) 99.58% test accuracy with 277K parameters on the AIS 2024 eye tracking challenge; and 3) 0.556 mAP with 576k parameters on the PROPHESEE 1 Megapixel Automotive Detection Dataset.
1Introduction
Temporal convolutional networks (TCNs)[18]have been a staple for processing time series data from speech enhancement[22]to action segmentation[17]. However, in most cases, the temporal kernel is very short (usually size of 3), making it difficult for the network to capture long-range temporal correlations. The temporal kernels are intentionally kept short, because keeping a long temporal kernel with a large number of trainable kernel values usually leads to unstable training. In addition, we require a large amount of memory for storing the weights during inference. One popular solution for this has been to parameterize the temporal kernel function with a simple multilayer perceptron (MLP), which promotes stability[28]and more compressed parameters, but it often increases the computational load considerably.
Here, we introduce a method of parameterization of temporal kernels, named PLEIADES (PoLynomial Expansion In Adaptive Distributed Event-based Systems), that can in many cases reduce the memory and computational costs compared to explicit convolutions. The design is fairly modular, and can be used as a drop-in replacement for any 1D-like convolutional layers, allowing them to perform long temporal convolutions effectively. In fact, we augment a previously proposed (1+2)D causal spatiotemporal network[23]by replacing its temporal kernels with this new polynomial parameterization. This new network architecture serves as the backbone for a wide range of online spatiotemporal tasks ranging from action recognition to object detection.This network belong to a broader class of networks named Temporal Event-based Neural Networks (TENNs).
Even though our network can be used for any spatiotemporal data (e.g. videos captured with conventional cameras), in this work, we investigate mainly the performance of our network on event-based data (e.g. data captured by an event camera). Event cameras are sensors that generate outputs events{−1,+1}responding to optical changes in the scene’s luminance[4], and can generate sparse data on an incredibly short time scale, usually at 1 microsecond. Event cameras can produce very rich temporal features capturing subtle motion patterns, and allow for flexible adjustment in the effective FPS into our network by varying the binning window size. This makes it suitable for us to test networks with long temporal kernels sampled at different step sizes.
In Section 2, we discuss earlier and concurrent works that are related to our polynomial parameterization of temporal kernels, including the modern variants of deep state-space models. In Section 3, we discuss how the generation and discretization of temporal kernels are performed, and additionally using theeinsumnotation to perform the optimal order of operations to reduce memory and computational loads. In Section 4, we briefly describe the network backbone architecture, including some architectural novelties besides our polynomial temporal kernels. In Section 5, we run three event-based benchmarks: 1) the IBM DVS128 hand gesture recognition dataset, 2) the CVPR 2024 AIS event-based eye tracking challenge, 3) and the PROPHESEE 1 megapixel automotive detection dataset (Prophesee GEN4 Dataset). We achieved SOTA results on all three benchmarks with significantly fewer parameters.
The code for building the structured temporal kernels, along with a pre-trained PLEIADES network for evaluation on the DVS128 dataset is available here:https://github.com/PeaBrane/Pleiades.
2Related Work
2.1Long Temporal Convolutions and Parameterization of Kernels
When training a neural network containing convolutions with long (temporal) kernels, it is usually not desirable to explicitly parameterize the kernel values for each time step. First of all, the input data may not be uniformly sampled, meaning that the kernels need to be continuous in nature, making explicit parameterization impossible
The seminal work proposing a memory encoding using orthogonal Legendre polynomials in a recurrent state-space model is the Legendre Memory Unit (LMU)[33], where Legendre polynomials (a special case of Jacobi polynomials) are used. The HiPPO formalism[11]then generalized this to other orthogonal functions including Chebyshev polynomials, Laguerre polynomials, and Fourier modes. Later, this sparked a cornucopia of works interfacing with deep state space models including S4[12], H3[2], and Mamba[10], achieving impressive results on a wide range of tasks from audio generation to language modeling. There are several common themes among these networks that PLEIADES differ from. First, these models typically only interface with 1D temporal data, and usually try to flatten high dimensional data into 1D data before processing[12,37], with some exceptions[21]. Second, instead of explicitly performing finite-window temporal convolutions, a running approximation of the effects of such convolutions are performed, essentially yielding a system with infinite impulse responses where the effective polynomial structures are distorted[31,11]. And in the more recent works, the polynomial structures are tenuously used only for initialization, but then made fully trainable. Finally, these networks mostly use an underlying depthwise structure[14]for long convolutions, which may limit the network capacity, albeit reducing the compute requirement of the network.
2.2Spatiotemporal Networks
There are several classes of neural networks that can process spatiotemporal data (i.e. videos and event frames). For example, a class of networks combines components from spatial convolutional networks and recurrent networks, with the most prominent network being ConvLSTM[29]. These types of models interface well with streaming spatiotemporal data, but are oftentimes difficult to train (as with recurrent networks in general). On the other hand, we have a class of easily trainable networks that perform (separable) spatiotemporal convolutions such as the R(2+1)D and P3D networks[32,27], but they were originally difficult to configure for online inference as they do not assume causality. However, it is easy to configure the temporal convolutional layers as causal during training, such that the network can perform efficient online inference with streaming data via the use of circular buffering[23]or incorporating spike-based[30]or event-based[15]processing.
2.3Event-based Data and Networks
An event can be succinctly represented as a tupleE=(p,x,y,t)pdenotes the polarity,xandyare the horizontal and vertical pixel coordinates, andtis the time. A collection of events can then be expressed asℰ={E1,E2,…}(2,H,W,T)[19]. Other methods include replacing each event with a fixed or trainable kernel[36,5]before evaluating the contribution of that kernel to a given bin. Here, we only use the direct binning and event-volume binning methods, yielding the4dtensor(2,H,W,T)[23], noting that we retain the polarity channel unlike previous works[36].
The most popular class of event-based networks is spiking neural networks, which propagate event or spike signals with continuous timestamps throughout the network, usually under the assumption of some fixed internal dynamics for the neurons[6,7]. These networks can be efficient during inference, as typically they only need to propagate 1-bit signals, but they are also incredibly difficult to train without specialized techniques to efficiently simulate the neural dynamics and ameliorate the spiking behaviors. The SLAYER model[30]computes the neuron response using the spike-response model (SRM), which can then be used to convolve temporally with the input signal. This impulse-response kernel is typically limited to a non-adaptive shape, such as an exponential decay (leaky integrate-and-fire neuron) or an alpha function. SLAYER then uses customized CUDA kernels to implement the delayed responses. PLEIADES generalizes the impulse-response kernel of SLAYER by making the kernel a trainable convolution kernel. It is a kernel that can take on any shape, thus providing almost any dynamics to the neuron, not limited by a priori assumptions. The kernel is fully adaptive to input data, network connectivity, and cost function used to train the network.
Other works have proposed ameliorating the spiking behaviors with differentiable functions which can interface easier with backpropagation (surrogate gradients)[20], so the spiking network can be trained like a standard neural network. Note that a network or hardware can be fully event-based without necessarily using spike-based processing, with a prominent example being Brainchip’s Akida hardware[15]. Such hardware can generally support most modern neural networks, with sparsity-aware processing where only nonzero input features are processed at each layer. The network we propose in this work is a standard neural network (though configurable as an SNN), and can efficiently leverage event-based processing given sufficiently high sparsity
3Temporal Convolutions with Polynomials
In this section, we discuss: 1) how the temporal kernels are generated by the weighted sums of polynomial basis functions; 2) how the temporal kernels are discretized to be used in a neural network; 3) how the convolution with the input feature tensor can be optimized with respect to the order of operations. From here on, we will index the input channel withcdnxandytt′τ
3.1Building temporal kernels from orthogonal polynomials
Jacobi polynomialsPn(α,β)(τ)are a class of polynomials that are orthogonal in the following sense:
|
| ∫−11Pn(α,β)(τ)Pm(α,β)(τ)(1−τ)α(1+τ)βτ=δnmhn(α,β),(1) |
---|
whereδn,mis the Kronecker delta that equates to 1 ifn=mn≠mhn(α,β)is some normalization constant that is not important for our discussion. A continuous function can be parameterized by taking the weighted sum of these polynomials up to a given degreeN{γ0,γ1,…,γN}are trainable.
When this parameterization is used in an 1D convolutional layer typically involving multiple input and output channels, then naturally we require a set of coefficients for each pairing of input and output channels. More formally, if we index the input channels withcand the output channels withdctodcan be expressed as
|
| kcd(τ)=∑n=0Nγcd,nPn(α,β)(τ).(2) |
---|
We note that training with such structured temporal kernels will strictly lose us expressivity compared to explicitly parameterized temporal kernels. However, there are several key advantages of using structured kernels which will largely overcompensate for the loss of expressivity. First, using this form of implicit parameterization will allow natural resampling of the kernels during discretization, meaning that the network can interface with data sampled at different rates without additional fine-tuning (see Section 3.2). Second, having a functional basis will allow an intermediate subspace to store feature projection, which can sometimes improve memory and computational efficiency (see Section 3.3). Finally, since a Jacobi polynomial basis is associated with an underlying Sturm-Louville equation, this will inject good “physical” inductive biases for our network, to make the training more stable and be guided to a better optimum (see Section 5.1for an empirical proof).
3.2Discretization of kernels
In the current implementation of our network, which interfaces with inputs that are binned in time, we need to perform discretization of the temporal kernels accordingly. One method is to take the integral of the temporal kernels over the time bins of interest.
We start by defining the antiderivative of the temporal kernels as
|
| Kcd(τ)=∫−1τkcd(τ′)τ′=∫−1τ∑n=0Nγcd,nPn(α,β)(τ′)dτ′=∑n=0Nγcd,n(n+1)!Pn+1(α,β)(τ)−const,(3) |
---|
where the constant term does not depend onτand can be ignored. If we need to now evaluate the integral ofkcd(τ)in the time bin[τ0,τ0+Δτ]k¯cd[τ0]
|
| k¯cd[τ0]=Kcd(τ0+Δτ)−Kcd(τ0)=∑n=0Nγcd,nPn+1(α,β)(τ0+Δτ)−Pn+1(α,β)(τ0)(n+1)!=∑n=0Nγcd,nP¯n(α,β)[τ0],(4) |
---|
whereP¯is the appropriately defined discrete polynomials, recovering the same form as Eq. 2. Note that Eq. 4can be considered a generalized matrix multiplication operation where the dimensionn(the polynomial basis dimension) is contracted, discussed further in Section 3.3. See Fig. 1for a schematic representation of the operations of generating temporal kernels for multiple channels.
Under this discretization scheme, it is very easy to resample the temporal kernels (either downsampling or upsampling), to interface with data sampled at arbitrary rates, i.e. arbitrary bin-sizes for event-based data. This means that the network can be trained at a given step-sizeΔτΔτ
Figure 1:An example of generating discrete temporal kernels for multiple channels, based on trainable coefficients and fixed basis orthogonal polynomials. Here, we are considering (depthwise) convolution with 3 channels, 4 basis polynomials, and a kernel size of 5. The shaded areas can be interpreted as discretized values. The coefficients can be organized as a3×44×5matrix. The matrix multiplication of the two then yields the final discretized kernels for the channels as a3×5matrix.
3.3Optimal order of operations
Now that the kernels are discretized, to lessen the burden of notation, we can employ the Einstein summation notation, oreinsum. to reduce the above equation to
where the repeating indexnis assumed to be summed over (or contracted) on the right-hand side, corresponding to summing over the polynomial basis. See Appendix A.1for a detailed description of the contraction rules. If we now wish to convolve the temporal kernelk¯ijτwith a spatiotemporal input feature tensoruwhich gives us the outputy
|
| ydxyt′=ucxytMntt′(P¯)γcdn,(6) |
---|
whereM(P¯)is the convolution operator matrix, a sparse Toeplitz matrix generated fromP¯(see Appendix A.3). If a depthwise convolution is performed[14], then the equation simplifies to
|
| ycxyt=ucxytMntt′(P¯)γcn,(7) |
---|
as we only have parallel connections between input and output channels (both denoted bycxandy
All the einsum operations are associative and commutative
In practice, we select the contraction path based on optimizing memory or computational usage[9], depending on the training hardware and cost limitations. This is possible because the memory and computational costs can be calculated for any contraction path, given the dimensions of the contraction indices (tensor shapes). See Appendix A.2for how these costs can be calculated. The choice of the optimal contraction path can be automatically selected using theopt_einsumlibrary
4Network Architecture
Figure 2:A representative network used for eye tracking. The backbone consists of 5 spatiotemporal blocks.
full convis short for full convolution and denoted by darker blocks, and
DWS convis short for depthwise-separable convolution and denoted by lighter blocks. The detection head is inspired by CenterNet, with the modification that the3×3convolution is made depthwise-separable and a temporal layer is prepended to it.
The main network block is a spatiotemporal convolution block, factorized as a (1+2)D convolution. In other words, we perform a temporal convolution on each spatial pixel followed by a spatial convolution on each temporal frame, similar in form to the R(2+1)D convolution block[32], or a previously proposed (1+2)D network for online eye tracking[23]. Furthermore, each of the temporal and spatial convolutional layers can be additionally factorized as a depthwise-separable (DWS) layer[14], to further reduce the computational costs. For every temporal kernel (for every channel pairing, every layer, and every network variant), we useα=−0.25andβ=−0.25for the Jacobi polynomial basis functions, with degrees up to4
Several other minor design choices we made for our networks (besides polynomial kernels) include:
- •
We keep every operation in our network fully causal, such that the network can be easily adapted for online inference with minimal latency. Importantly, we perform only causal temporal convolutions.
- •
After every temporal convolution, we perform a causal Group Normalization[23]withgroups=4[8].
- •
We apply a ReLU activation after every convolution layers, and also within every DWS layer. The activation function is intentionally kept simple, for ease of implementation on mobile or edge devices, and to promote activation sparsity in the network.
For tasks requiring object tracking or detection (see Sections5.2and5.3), we attach a temporally smoothed CenterNet detection head to the backbone (see Fig. 2), consisting of a DWS temporal layer, a3×3[35], with ReLU activations in between. Since our backbone is already spatiotemporal in nature and capable of capturing long-range temporal correlations, we do not use any additional recurrent heads (e.g. ConvLSTMs) or temporal-based loss functions[24].
5Experiments
We conduct experiments on standard computer vision tasks with event-based datasets. For all baseline experiments, we preprocess the event data into4dtensors of shape(2,H,W,T)B.
5.1DVS128 Hand Gesture Recognition
The DVS128 dataset contains recordings of 10 hand gesture classes performed by different subjects[1], recorded with a128×128dynamic vision sensor (DVS) camera. We use a simple backbone consisting of 5 spatiotemporal blocks. The network architecture is almost the same as that shown in Fig. 2with the exception that the detection head is replaced by a spatial global-average pooling layer followed by a simple 2-layer MLP to produce classification logits (technically a pointwise Conv1D layer during training). This means that the output produces raw predictions at 10 ms intervals, which already by themselves are surprisingly high-quality. With additional output filtering on the network predictions, the test accuracy can be pushed to 100% (see Table 1). In addition, we compare the PLEIADES network with a counterpart that uses unstructured temporal kernels, or simply a Conv(1+2)D network[23], and find that PLEIADES has better performance with a smaller number of parameters (due to the polynomial compression).
Table 1:The raw 10-class test accuracy of several networks on the DVS128 dataset. With the exception of models marked with an asterisk, no output filtering is performed on the networks. PLEIADES is evaluated on output predictions where all temporal layers process nonzero valid frames, which incurs a natural warm-up latency of 0.44 seconds (see Section
5.1). Additionally, a majority filter of window 0.15 seconds can be applied to the raw PLEIADES predictions to reach 100% accuracy.
| Model | Accuracy (%) | Parameters |
---|
PLEIADES + filtering* | 100.00 | 192K |
---|
PLEIADES | 99.59 | 192K |
---|
Conv(1+2)D | 99.17 | 196K |
---|
ANN-Rollouts[16] | 97.16 | 500K |
---|
TrueNorth CNN*[1] | 96.59 | 18M |
---|
SLAYER[30] | 93.64 | - |
---|
Unfortunately, previous studies lacked a unified standard for performing the evaluations on the dataset, so it is not entirely clear the metrics being reported. In particular, some networks perform online inference, and the others process entire recording segments before producing a prediction. Here, we produce a more general accuracy vs. latency relationship for our network variants, as to establish multiple Pareto frontiers for comparisons. In the context of this experiment, the latency is simply the number of event frames (from the beginning of the recording) multiplied by the bin size, that the network has “seen” before making a prediction.
Note that if we wish to guarantee that every temporal layer is working with nonzero valid input features, then the network will have a natural latency of equal to(number of temporal layers)×(temporal kernel size−1)[1]to boost performance at the cost of higher system latencies. See Appendix B.1for details on how the network can be configured to run at different latencies.
Figure 3:(Left) The accuracy vs. latency curves for different PLEIADES variants with a kernel size of 10 but different step sizes on the DVS128 dataset. A masking augmentation is optionally used to randomly mask out the starting frames of dataset segments during training, in order to stimulate faster responses in the network. (Right) The accuracy vs. latency curves for different PLEIADES variants with an effective temporal window of 100 ms for each temporal layer, but having different step sizes. The benchmark network is trained with a kernel size of 10 and a step size of 10 ms, and the other variants are resampled without additional fine-tuning. A network variant trained without any structured temporal kernel is also displayed as a baseline reference.
There are several ways to “force” the network to respond faster. The natural way is to simply use a smaller kernel size or binning window (temporal step size). Here, we test two model variants with a binning window of 5 ms and 10 ms, keeping the temporal kernel size fixed at 10 (meaning that the 5 ms variant has a shorter effective temporal window). Another way is to randomly mask out frames starting from the beginning of the input segment, to force the network to “respond” only to the more recent input frames, such that the effective response window of the network is shorter
The benchmark network is trained with step-sizeΔτof 10 ms, which is also the event data bin-size. Here, we change the step-sizes to 5 ms and 20 ms (upsampling and downsampling) without fine-tuning the network, and simply re-discretize the basis polynomials under the new step-sizes (see Section 3.2) and re-bin the event data. To keep the effective temporal window (thus the network behavior) the same, the 5 ms step-size network would have a kernel size of 20, and the 20 ms step-size network would have a kernel size of 5. We see from the right plot of Fig. 3that the accuracy vs. latency curve does not vary much under the time-step resampling. In the same plot, we also show the performance for the Conv(1+2)D baseline network with unstructured kernels, denoted as “free kernels”, to compare the PLEIADES variants against.
5.2AIS2024 Event-based Eye Tracking
We use the same backbone as the network for the DVS128 hand gesture recognition, but with a temporal step-size of 5 ms. We simply replace the 2-layer MLP classification head with the CenterNet detection head and loss adapted from[23]. Note that we omit predictions of the bounding box sizes, and only predict center points of pupils for this challenge. See Fig. 2for a drawing of the network architecture.
Table 2:The 10-pixel, 5-pixel, and 3-pixel tolerances for the CVPR 2024 AIS eye tracking challenge. The performances of other models are extracted from[
34].
| Model | p10 | p5 | p3 | Parameters |
---|
PLEIADES + CenterNet | 99.58 | 97.95 | 94.94 | 277K |
MambaPupil | 99.42 | 97.05 | 90.73 | - |
CETM | 99.26 | 96.31 | 83.83 | 7.1M |
Conv(1+2)D | 99.00 | 97.97 | 94.58 | 1.1M |
ERVT | 98,21 | 94.94 | 87.26 | 150K |
PEPNet | 97.95 | 80.67 | 49.08 | 640K |
5.3Prophesee GEN4 Roadscene Object Detection
The Prophesee GEN4 Dataset is a road-scene object detection dataset collected with a megapixel event camera[24]. The dataset has around 14 hours of recording with both daytime and night time, and both rural and urban driving scenarios. It contains 7 classes, but we evaluate the mAP only on 2 classes: cars and pedestrians, to be consistent with the original evaluation pipeline[24]. See Appendix B.2for details on the model architecture used and the training pipeline. The backbone network is an hourglass network built from a stack of spatiotemporal blocks with a temporal step size of 10 ms. The detection head is again the CenterNet detection head as described in Section 4. We do not use any non-max suppression on the bounding box outputs, as suggested in the original CenterNet pipeline being robust against spurious bounding box predictions, which is further augmented by the implicit temporal consistency of our network. See Appendix B.2for details of the network architecture.
Table 3:The performance of PLEIADES with a CenterNet detector compared to the models introduced in the original benchmark paper.
| Model | mAP | Parameters |
---|
PLEIADES + CenterNet | 0.556 | 0.576 M |
RED | 0.43 | 24.1 M |
Gray-RetinaNet | 0.43 | 32.8 M |
6Limitations
Since the temporal layers of our network work like a finite-window filter with circular buffers, a practical limitation may be the high memory cost to explicitly buffer the moving window of recent input features, which is worsened if the temporal kernel size or spatial dimensions are large. However, by virtue of the polynomial structures of our temporal kernels, we can derive estimates of an online running projection of the past input features onto the fixed polynomial basis functions[31,11], an idea briefly discussed in Section 3.3also.
These compressed coefficients are analogous to internal states in recurrent networks, which we can then perform a pointwise operation with our pre-trained kernel coefficients to estimate the would-be output of the original finite-window convolution
7Conclusion
We introduced PLEIADES, a spatiotemporal network with temporal kernels built from orthogonal polynomials. The network achieved state-of-the-art results on all the event-based benchmarks we tested, and its performance is shown to be stable under temporal resampling without additional fine-tuning. Currently, the network is configured as a standard neural network, which by itself is already ultra-light in memory and computational costs. To truly leverage the full advantage of event-based processing, we can consider using intermediate loss functions to promote activation sparsity[23]. Another direction is to adapt/convert this architecture into a spiking system by leveraging the structure of the polynomial kernels to provide richer dynamics to neurons beyond the typical and arbitrary leaky integrate-and-fire neurons, without adding computational complexity in inference or training. It offers the possibility for such spiking systems to be trained as a convolutional network without having to simulate any differential equation of the internal neural dynamics.
8Acknowledgement
We would like to acknowledge Nolan Ardolino, Kristofor Carlson, M. Anthony Lewis, and Anup Varase (listed in alphabetical order) for discussing ideas and offering insights for this project. We would also like to thank Daniel Endraws for performing quantization studies on the PLEIADES network, and Sasskia Brüers for help with producing the figures