## ESS 265 Chapter 9. Linear Prediction Filters and Neural Networks by Paul O'Brien

In space physics, or its nascent sibling, space weather, it is often necessary to extract knowledge of a physical system from a set of input-output data. This presentation describes two methods for doing just that: linear prediction filtering and neural networks. In both cases, we describe the method of construction, which would be sufficient for the space weather forecaster. But, we also include some of the theoretical bases of these two techniques, and we demonstrate how they can be used to gain physical understanding. In the end, however, these methods can merely tell us what the system does and we must use theoretical space physics to understand how the system does it.

9.1 Linear Prediction Filters

We will begin our discussion with linear prediction filters (LPFs). Linear prediction filters are used to describe the relation of one output, or system variable, to one or more inputs. They are limited to describing systems which are governed by ordinary differential equations. We will, however, describe a method for achieving nonlinearity with linear prediction filters. The inputs to a system can be anything from external forcing to internal state variables. Space physics provides us the insight we need to choose the appropriate inputs. Linear prediction filter can tell us that a certain input is a poor choice. One common implementation of this method is to forecast some interesting system variable from an external driver. Another common implementation is to nowcast an unknown system variable from other known system variables. In both cases, we can peer into the linear prediction filter model to learn what the system is doing. We often do this with what is called an impulse response, which we will demonstrate later. To introduce the linear prediction filter let us examine the mathematical formula that defines an autoregressive moving average filter in equation (1). There are two terms on the right. The first term is the autoregressive part and the second the moving average.

The autoregressive part (AR) relates the present value of the output to previous values of the output. We generally associate the autoregressive part with the internal dynamics of a system, like oscillation and decay. Technically speaking a strictly autoregressive filter contains the first (no time lag) term of what we have defined to be moving average of the righthand side.

The moving average (MA) part relates the present value of the output to the present and previous values of the input. We generally associate the moving average part with the dynamics of the interaction between the system and the external inputs. The moving average part can encapsulate recurrence or time delays caused by storage and release.

In both cases we assume a constant sampling rate, defined by t. At every time lag, we have a coefficient which tells us how the present or previous values of the input and output effect the present state of the system. Later we will describe how to interpret these coefficients, and we will discuss some special cases which give rise to ambiguity between the autoregressive and moving average parts of an ARMA filter.

Linear prediction filters stem from the broader category of linear filtering. In linear filtering the Fourier spectra of the input and output signals are manipulated to find a transfer function. The inverse Fourier transform of the transfer function gives the response of the system to an input impulse (delta function). The convolution of the transfer function and the input reproduces the output. From the Fourier analysis formalism, we receive several alternate names, for linear prediction filters including: convolution filters, moving average, infinite impulse filter, finite impulse response filter and autoregressive moving average filter.

As we saw above, the response of the system to an input impulse (delta function) can be used to reconstruct the output series from the input series. It is easy to see that an moving average filter is just a discrete version of the impulse response. In mathematics, this is the Green's function. For systems governed by ordinary differential equations (ODEs), the Green's function is unique to the governing ordinary differential equation. To put it another way, if we know the Green's function, we can determine the underlying ordinary differential equation. This is demonstrated in equations (2) to (4) for a first order ordinary differential equation.

For a discretized series

We have related moving average filters to mathematic objects, but we still need to understand what autoregressive filters represent. It turns out that autoregressive filters are just discretized forms of differential equations. Equations 9.5 to 9.7 show how a first-order ordinary differential equation can be discretized. It is apparent that a discretized ordinary differential equation is an autoregressive filter, although it is not nearly as easy to demonstrate the reverse. Complex autoregressive filters can be reinterpreted as ordinary differential equations, but it is unlikely that such an autoregressive filter is the best description of the system. Therefore, except for a few trivial cases, pure autoregressive filters are not recommended for system diagnosis.

The ordinary differential equation

Now we turn to the intensive task of determining the coefficients of linear prediction filter. In the linear prediction filter equation, there are N+M unknowns. If we know O(tk) at N+M or more times tk, we can find the optimal set of coefficients. Usually, we wish to have an overdetermined system, where kmax >>N+M. We construct a matrix equation of the form Ax = y as shown in equation (8). In this case, we want to solve for x which contains the coefficients of the filter (the ai's and bj's).

The A matrix is built row by row, with each time step tk contributing one row. We see that the columns of A include the known values of the output and input at times at or before tk. Finally, we note that the rows of the y vector contain the outputs at each tk. It is customary to construct each subsequent row by shifting the time series by one t, such that tk = t0+kt.

The matrix problem is solved using the method of least-squares, x = (ATA)-1ATy . If available, singular value decomposition should be used, as it results in more stable linear prediction filters. In either case, the solution of this matrix equation provides us with our filter coefficients.

So far, we have largely treated the case of one input. However, it is easy to generalize the formalism to multiple inputs. In such a case, there would be multiple moving average parts, although still only one autoregressive part.

In certain specific cases, a moving average filter can be replaced by a very short, nearly trivial autoregressive or autoregressive moving average filter. One very common example is an exponential decay or oscillation that can better be described with an autoregressive moving average filter than a moving average filter because the moving average filter coefficients are related by a multiplicative recursion relationship: bj+1 = e-tbj. Adding a short autoregressive part can accommodate this exponential decay with more economy than a purely moving average filter.

When constructing linear prediction filters, it is important to understand how to handle data gaps. In some problems, it is possible to interpolate (with a line or a spline) over data gaps. However, this can wash out the coefficients of a linear prediction filter. If we consider that an interpolated point is actually a combination (linear or otherwise) of its neighbors, we can see how this would cause a linear prediction filter to be unable to resolve the different influence of that point and its neighbors. This results in a sort of smearing of the coefficients, which is tolerable in some cases, but is usually not desired. To avoid this, we usually omit any time tk for which one of the inputs or outputs is missing. If interpolation is necessary, an autoregressive or ARMA filter is less sensitive to interpolation in the lag outputs, as that information is usually very overdetermined. ]

Once we have a reliable set of linear prediction filter coefficients, we can identify the underlying dynamics by examining plots of "coefficient vs. lag." This is easiest for moving average filters, where we know that the coefficients represent a discretized Green's function. Examples are shown in Figure 9.1. For autoregressive filters, exponential decay occurs as a single spike at the first lag; oscillations appear as adjacent spikes of opposite sign in the first two lags. Complex autoregressive filters can be very difficult to interpret, so a different approach to the model may be indicated.

 Figure 9.1. Coefficient versus lag for three moving average linear prediction filters: one in which the time series showed exponential decay; one in which the time series showed recurrence; and one in which the output was controlled by the derivative of the input.

Now we turn to the question of nonlinearity. Linear prediction can handle only ordinary differential equations, which means that they can only model linear systems. But through localization we can add some nonlinearity to our linear prediction filter models.

The first approach to localization is time-localization. In this approach, we reconstruct the linear prediction filter at each time step based only on the most recent data. Essentially, we are assuming that the trajectory of the system is determined by some arbitrarily complex set of state variables which are contained in the recent time history of the input and output. This can provide very accurate forecasts, but is often very hard to interpret physically.

The second (recommended) approach, is localization in state-space. We begin with the assumption that the interaction of the input time series and the output time series is influenced by the state of the system. We can determine the state of the system either based on the values of the input and output variables, or we can introduce additional variables which indicate the state of the system. This can provide multivariate nonlinearity, which is a very powerful generalization. In either case, state-space localization is achieved by partitioning (or binning) the data according to the localization variables. One linear prediction filter is calculated for each partition (or bin). If we choose the input or output to be the localization variable, then we achieve low-order nonlinearity. If we choose an additional variable for localization, we are still enforcing the linear criterion, but we are parameterizing that linear behavior in terms of the additional variable. By learning how the system is different in the different partitions (bins), we can identify the nonlinear features of the system. We will demonstrate state-space localization below.

In third example we are trying to learn about local current systems by relating the global current system to a local measurement. The global current system is the magnetospheric ring current, and the local current system is believed to be the partial ring current. These currents create magnetic fields, which are measured on the ground as H. Dst is the global average of 4 or more such measurements, and is therefore linearly related to each. However, the relationship between Dst and an individual H changes as the Earth rotates, moving the ground station beneath the two currents. We can, therefore, parameterize this relationship in the local time (lt) of the station. The filter itself is nearly trivial: an moving average filter with one coefficient b0(lt). We will do this for San Juan (SJG).

We have constructed the localized moving average filter for San Juan by partitioning the Dst and H data according to the local time of the station when the measurements were taken. In each partition we calculated a separate moving average filter. In Figure 9.2 we plot the reciprocal of the one filter coefficient,b0(lt). We see that Dst is less intense than H near local dusk. We attribute this to a local intensification of the ring current system, the partial ring current, which flows in the dusk sector. It might be tempting to instead assume a reverse partial ring current in the dawn sector, but the magnetospheric configuration only allows these type of currents to flow in one direction.

 Figure 9.2. Localized moving average filter for San Juan that predicts Dst from the San Juan H-component. The reciprocal of the filter coefficient b0 (lt) is shown versus local time (lt). The Dst index is believed to be less intense than the localized H-depression in the afternoon and evening sectors because of an enhanced ring current in this region.

In our next demonstration, we would like to know how Dst is related to the solar wind. We know that a good "coupling function" between the solar wind and Dst is VBs. VBs is the velocity of the solar wind multiplied by the southward component of the interplanetary magnetic field (IMF). When the IMF is northward VBs is zero. Otherwise VBs is taken to be positive.

We have an extensive database of hourly measurements of VBs and Dst, so we construct a moving average filter to learn how VBs gives rise to disturbed Dst. We begin with a very long filter (M = 50) because we know that it takes Dst hours to days to recover from the influence of a VBs packet.

 Figure 9.3. Coefficients of the moving average filter for Dst dependence on VBs. These coefficients represent the Green's function or impulse response of the process.

We have constructed the moving average filter according to the process outlined earlier, with no localization. We plot the coefficients bj versus lag (jt) in Figure 9.3. We see that the Green's function here is essentially an exponential decay, beginning in the first hour after the VBs input. Equation 9.11 shows how we would interpret this Green's function as a differential equation.

It is obvious that a short autoregressive moving average filter could do the same job. The autoregressive moving average filter requires only two coefficients to reproduce most of the information in the long moving average filter.

9.2 Neural Networks

So far we have examined linear prediction filters, which constitute a mostly linear approach to data analysis. Now we turn to the more complex neural networks. Neural networks have similar uses to linear prediction filters, and can be applied to the same general set of problems. They can reproduce more complex input-output relationships than linear prediction filters. While it is possible to construct neural networks which map multiple inputs to multiple outputs, this economy is generally not useful in space physics, where we would construct a separate model for each output.

The data which was used in the linear prediction filter to construct the A matrix and the y vector will be referred to as the training data in our treatment of neural networks. They are excellent at performing nonlinear interpolation between points in the training data, but are not reliable for extrapolating beyond the information in the training sample. Linear prediction filters are much stronger at extrapolation, but can generally only provide linear interpolation. Neural networks are excellent for modeling complex phenomena, but do not reveal the physics behind the model in a familiar way. We will demonstrate a powerful technique for extracting physical insight from a neural network. This will consists of running the neural networks model on simulated data to isolate the system response to individual inputs. In global magnetohydrodynamic modeling, this approach is used to see how the magnetosphere responds to a sustained southward interplanetary magnetic field.

Many exotic types of neural networks exist for various tasks. Some are excellent at pattern recognition (images and speech), others are excellent at reconstructing degraded data, and others are excellent at categorizing. We will stick to the types analogous to linear prediction filters.

Neural networks are based loosely on biological neural systems. In biology, a neural network is composed of connections between individual neurons. In an artificial neural network, the connections are replaced by weights and the neurons are modeled with perceptrons. Since we do not know exactly how biological neural networks learn, we have derived some approximate training schemes for our artificial neural networks.

When it comes to qualifying, living neural networks often perform much better than their numerical counterparts. However, when it comes to quantifying, artificial neural networks are in a class of their own. For example, it is very difficult for an artificial neural network to recognize a cow from a new perspective, but it is very easy for a human to do so. On the other hand, it is difficult for a human to estimate how fast an object is moving, but an artificial neural network can do so very easily. To some extent these statements are true due to the sensory input associated with the biological and artificial neural networks, but they are largely features of the networks themselves. One might say that an artificial neural network is too exacting to be able to make the generalizations that humans easily make.

 Figure 9.4. Neural network topology of a standard feed-forward neural network with no recursive connections.

We begin our discussion of neural networks with topology. The type of neural networks we'll be discussing are based on the Feed Forward neural network. Figure 9.4 shows a typical representation of an artificial neural network. Also provided is the mathematical representation of the neural network. The arrows in the diagram correspond to the coefficients b, wi, ci, and vij. We refer to these coefficients as weights and biases or simply as weights. The boxes marked hi and Oi are perceptrons. Since the value of each hi does not correspond to any measurable quantity, we refer to these as hidden units which make up the hidden layer of the neural network. At each hidden unit a weighted sum of the inputs is calculated; then a nonlinear activation function is applied. This is where the nonlinearity of neural networks is introduced. The output is then just a weighted sum of the hidden units.

In continuous systems (as opposed to discrete ones) the activation function is generally a smoothed step function (a sigmoid). The sigmoid can be defined in terms of hyperbolic tangent or an equivalent expression [1+e-x]-1. The latter is marginally better because its derivatives are slightly simpler than those of tanh. In either case, the more hidden units one employs, the more complexity the neural network can handle; we will find out that this is not always a good thing.

A strictly feed forward neural network is equivalent to a moving average filter. Autoregressive behavior is achieved through recurrence.

 Figure 9.5. Recurrent neural networks. (Left) An output recurrent network, useful when the output depends on a previous output. (Right) An Elman network, useful when the output depends on the time history of the inputs.

There are two primary types of recurrent neutral networks as illustrated in Figure 9.5. The first (recommended) type and shown in the left reproduces autoregressive behavior: the output from the previous time step is used as an input at the next time step. This is implicit recurrence, and generally training is performed with the time-delayed output provided from real measurements.

The second type of recurrence is achieved with an Elman Network illustrated on the right hand side of Figure 9.5. In such a network one or more of the hidden units from the previous time step are used as input at the next time step. This allows the output to depend on the time history of the inputs. However, since the hidden units do not correspond to measurements, it is necessary to train and evaluate such networks serially. In fact, continuous time series are necessary for training and evaluation. This makes batch optimization (training) impossible.

Now we address the complex issue of network training. We begin by assigning random values to the network weights. We then train the network by iteratively adjusting the weights to reduce the errors between the network output and the output in the training set. Generally, the error we seek to minimize is mean squared error, but in some pathological cases, relative error must be used. We will demonstrate the theory for MSE optimization, which can be extended with some labor to relative error optimization.

As with linear prediction filters, we want our network coefficients to be overdetermined. Generally it is good to have at least 10 times more training samples than there are weights. Our goal, then, is to achieve a fit which will not only match the training data, but will also work well when new data are provided to the network. We refer the performance on the training data as in sample performance, and that on data which was not used during training as out of sample performance. Poor out of sample performance is an indication of overfitting. Generally the solution to overfitting is a reduction in the number of hidden units or even in the number of inputs. We will return to this point after we have discussed the training algorithms.

The simplest way to optimize a nonlinear system is through gradient descent. We define the error E as in (9.12), where the error at a time tk is defined as the difference between the actual output and the network output at that time. E is proportional to the mean squared error of the network.

where Õ is the neural network output and W is a vector holding all the neural networks weight and basis.

We combine all the weights and biases of the network into one vector W, which we refer to hereafter as the weights. The Jacobian J encodes each weight's influence on the output at each time step. J is a huge matrix, having one row for every time step, and one column for every weight. If we wish to descend the error gradient, we want to adjust the weights by a small amount in the opposite direction given by the error gradient. It is relatively simple calculus to show that the error gradient is given by JTe. The learning rate , , is a positive number less than 1 which determines how quickly the network descends the error gradient. Because the error gradient may not smoothly lead to the optimal set of weights, it is often necessary to stabilize gradient descent with a variable learning rate and with momentum . With momentum, we include a portion of the previous weight adjustment in the new weight adjustment. The momentum should also be a positive number less than 1.

For explicitly recurrent networks, gradient descent should be done serially for each tk. This is because it is necessary to use the hidden units from each time step as inputs to the next time step.

Some readers may be familiar with back propagation . Back propagation is only necessary for networks for which J cannot be calculated due to discontinuous activation functions, and can be completely replaced with gradient descent.

For networks which are not explicitly recurrent Levenberg-Marquardt training is a very good method. This training is much faster than gradient descent, employing both Newton's method and gradient descent.

If we Taylor expand the error vector e to first order in the weights, we get a matrix equation: 0 = e - JW. We can solve this equation explicitly, which is Newton's method. However, if we introduce a slight modification, we combine Newton's method and gradient descent to arrive at Levenberg-Marquardt training. This derivation is shown in equations 9.17 to 9.21. We see that for large µ, we recover gradient descent with = 1/µ. For small µ, we recover Newton's method.

The training algorithm is simple: when Newton's method works (the error decreases with each step), use it (decrease µ), when it fails, use gradient descent (increase µ).

Levenberg-Marquardt training can be memory intensive because J is very large, but a careful implementation of the algorithm can ensure that the largest matrix in memory is JTJ, which is square and symmetric, and has each dimension as long as the weight vector.

Once we know how to train the network, we must take care that it does not overfit the training data. This is called generalization . The first thing we do is reserve a fraction of the training sample for out of sample testing. Identical network topologies can perform differently because they start out with different initial weights. We should, therefore, train several networks (~5), test them on the out of sample reserve, and choose the one with the least out of sample error. We can also starve the neural network by reducing the number of hidden units until the error of the resulting network error is the same in sample and out of sample.

Even with a properly generalized network, we cannot trust all of its output. As we mentioned before, neural networks are exceptional at interpolation, but extrapolation is always dubious. Since the activation function (sigmoid or tanh) saturates, extrapolation is extremely unreliable. To determine when the neural network is extrapolating rather than interpolating, one must know what the distribution of the training data is. This can be rather difficult with many inputs. The best way to address this problem is by plotting histograms of the inputs in the training set, to see which values are most common and which values are rare or absent from the training set.

It is often advantageous to remove one or several inputs to see if the neural network can still perform as well. If the neural network performs equally well with or without a given input, then that input is not influencing the output, and can be assumed to be inconsequential to the physical system as well.

It is nearly impossible to analyze the weights directly. Instead, we run the trained neural network on artificial inputs. Usually, we hold most of the inputs fixed while varying only one or two of them. By examining how the output responds to isolated fluctuations of the inputs, we can determine how each physical input contributes the behavior of the physical system. Another approach to this is the pseudo-impulse response. For neural networks which model time series, we can simulate impulse inputs of various magnitudes. The system response to these inputs can reveal the nature of the underlying physical interaction just as the Green's functions in the linear prediction filter analysis. In either case, we should plot training points in the vicinity of the simulated inputs to verify that there is actual data to support the output provided by the neural network.

We return to the question of H, but this time we will use a neural network model to see if any nonlinearity is needed. The inputs this time will be Dst, VBs, and the local time. We include VBs so that we can find out how the partial ring current responds to strong and weak coupling. By including sine and cosine of the local time, we allow the neural network to implicitly localize the relationship. This type of localization always results in smooth analytical functions, whereas localizing linear prediction filters in partitions results in tabular functions of the localization variables.

We now run the neural network on simulated data, separately varying Dst, VBs, and local time. In Figure 9.6 we plot the H-Dst relationship for several local times at fixed VBs. In the top panel, where VBs = 0 (weak coupling), we see that a localized current system is creating an asymmetry in the magnetic disturbance. At each local time, the relationships are very linear, with the asymmetry disappearing at Dst -20 nT.

 Figure 9.6. The local perturbation in the horizontal component at SJG, DH, at four different local times and for the main phase at the recovery phase separately as a function of the Dst index.

When we then investigate the strong coupling case, VBs = 5 mV/m, we see similar qualitative features. However, we note that the asymmetry has widened by about 30%, indicating a stronger partial ring current. We also note that the asymmetry does not disappear until Dst has nearly recovered to zero.

It is generally impossible to make these kinds of measurements, since the inputs are usually all varying at the same time. The neural network has provided us with an empirical model which we can use to isolate the influence of individual inputs.

The final demonstration is a return to the VBs-Dst relationship. Earlier, we modeled this relation with a moving average filter, and we found that it is probably best described by an autoregressive moving average filter. We will use a neural network to discover whether there are any nonlinear features of the data which were not evident in the linear prediction filter analysis. We will take as inputs Dst at a 1 hour lag, VBs, solar wind dynamic pressure (Psw), and Psw at a 1 hour lag. The inclusion of dynamic pressure was accommodated in the linear prediction filter by a set of correction factors derived separately. Here will let the neural network remove the pressure effects so that we can isolate the VBs effects on Dst. This network is output recurrent, which means we can still train it with the LM algorithm.

After training, we run the network on a set of simulated data. We hold dynamic pressure constant at 3 nPa. For various values of VBs and initial Dst, we let the network calculate the subsequent Dst. We then calculate the change in Dst over the hour (Dst). This is an estimate of the derivative of Dst. If we plot trajectories in phase-space at constant VBs, we can understand the form of the dynamic equation which relates Dst to VBs. What we see is that, in the region of high training density, the phase-space trajectories are linear. This suggests the first-order differential equation in 9.22.

The nonlinearity outside of the high training density region may be artificial. This could be resolved by plotting training points along with the trajectory, so that we could see if there was real data to support the nonlinearity.

 Figure 9.7. Neural network analysis of the relationship between Dst and VBs. Lines show VBs contours derived from the neutral network in a phase space defined by Dst and the change in Dst in the interval. The shaded area is where the neural network was well trained.

For the high training density region, at least, the dynamics of Dst can be described by a simple linear ODE whose coefficients are parameterized in VBs. By studying how each parameter varies with VBs, we can learn about the physical processes which lead to injection and decay.

Since the neural network is poor at extrapolation, we might choose to use the insight we have gained here to go back and construct a Dst-VBs autoregressive moving average filter which is localized in VBs. This is a good example of how neural networks can give us insight into what aspects of a system are nonlinear, and guide us in the construction of our mostly-linear models.

In summary, we now know how to build and interpret two types of empirical models: linear prediction filters and neural networks. We have given examples of their use in space physics. It should now be clear that neural networks can encapsulate more complex multivariate behavior but are not as reliable for extrapolation. In either case careful effort must be taken to analyse the models to gain a physical understanding of how the system is behaving. For physical systems, state-space localized linear prediction filters are often the best means of achieving a simple, accurate, understandable model. We suggest, that neural networks may be used as guides in the construction of simpler local- or nonlinear prediction filters. Again, these techniques can, at best, provide us with the differential equation which governs an observable. We must use our knowledge of space physics to understand the components of that differential equation.

To summarize, linear prediction filters can be used to describe a system and its time evolution. These can be constructed with autoregressive, moving average or mixed terms. The coefficients are calculated using are least squares solution of an overdetemined matrix. They can be localized for studying nonlinearity. They are analagous to Green's functions and local filters can reveal local processes. Neural networks are also used to describe systems and their time evolution. They are based on a biologic analogy. Neural networks are trained with an interative adjustment of weights. One can analyze the results of a neural network analysis by running on simulated data and examining the phase space of a network paying attention mainly to the well trained region of phae space.

Linear Prediction Filtering:

Solar Wind-Magnetosphere Coupling Proceedings, (Kamide and Slavin eds.) Terra Scientific Publishing, 1986. See articles by Clauer (p.39), McPherron et al. (p.93), and Fay et al. (p.111).

Neural Networks:

Matlab Neural Network Toolbox User's Guide.

Beale, Hagan, & Demuth. Neural Network Design. PWS Publishers 1995.

Rumehart, Hinton, Williams "Learning internal representations by error propagation" in

Parallel Data Processing, p.318, Vol. 1, Ch.8. MIT Press (Rumelhart & McClelland eds).

Proceedings of the International Workshop on Artificial Intelligence Applications in Solar-Terrestrial Physics: 1993 and 1997 meetings.