Using Neural Networks to approximate a dynamical System

Ever since I started studying atmospheric physics I was fascinated with the concept of chaos theory. The idea that even in a deterministical system (as long as it is non-linear) the smallest of deviations to the initial conditions can lead to unforeseen consequences is captivating.

Deep Neural networks are a fascinating topic in its own right. Not only because of their amazing capabilities in modern AI reseach, but also more fundamentally because of the universal approximation theorem. Where neural networks are universal function approximators, recurrent neural networks can be seen as universal dynamical system approximators. Since that means you can learn an arbitrary dynamical system just from data it suddenly becomes something incredibly useful for all kinds of scientific questions. For example you can parameterize a process that you don’t understand, but have lots of measurements of.

I wanted to demo this approach and chose a simple RNN to approximate the famous Lorenz ’63 dynamical system.

First we need to define the system itself that is governed by the differential equations; here I defined it as a torch nn.module to make use of the accelerated matrix multiplications.

# lorenz system dynamics
class Lorenz(nn.Module):
    """
    chaotic lorenz system
    """
    def __init__(self):
        super(Lorenz, self).__init__()
        self.lin = nn.Linear(5, 3, bias=False)
        W = torch.tensor([[-10., 10., 0., 0., 0.],
                          [28., -1., 0., -1., 0.],
                          [0., 0., -8. / 3., 0., 1.]])
        self.lin.weight = nn.Parameter(W)

    def forward(self, t, x):
        y = y = torch.ones([1, 5]).to(device)
        y[0][0] = x[0][0]
        y[0][1] = x[0][1]
        y[0][2] = x[0][2]
        y[0][3] = x[0][0] * x[0][2]
        y[0][4] = x[0][0] * x[0][1]
        return self.lin(y)

We then use a Runge-Kutta fourth order scheme to basically get a ground truth by integrating the differential equations starting from an initial condition y_0.

So far so good, we have a dynamical system and we are integrating it over time. This ground truth will become our training data for the neural network. In fact we will give some predefined context to the network and ask it to predict the next state of the system. Since we have the data from the ground truth we automatically have our “labels”. The neural network for this simple exercise is quite basic and looks like this:

class fcRNN(nn.Module):
    def __init__(self, input_size, hidden_dim, output_size, n_layers):
        super(fcRNN, self).__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.rnn = nn.RNN(input_size, 
                  		hidden_dim, n_layers, 
                  		nonlinearity='relu',
                  		batch_first=True) # RNN hidden units
        self.fc = nn.Linear(hidden_dim, output_size) # output layer
    
    def forward(self, x):
        bs, _, _ = x.shape
        h0 = torch.zeros(self.n_layers, 
            bs, self.hidden_dim).requires_grad_().to(device)
        out, hidden = self.rnn(x, h0.detach())
        out = out.view(bs, -1, self.hidden_dim)
        out = self.fc(out)
        return out[:, -1, :]

I put a link to my github gist at the end of the post, but for now let’s look at some of the results. The first one is an animation of training progress for 5000 epochs over time. The scatter plot shows the previously calculated ground truth. The network gets the first 15 time steps as a contest and is then tasked to predict the whole trajectory from there. What you can see is that in the beginning the network fails to correctly represent the system, but over time it becomes more stable and more similar to the ground truth. Keep in mind that exactly fitting all of the scatter points is incredibly difficult due to the “butterfly effect”. If you are only slightly incorrect at one of the previous timesteps the consequences can become arbitrarily large over time.

 

The second animation shows you all of the generated trajectories over the course of training, where early models are depicted in dark colors and late models in bright ones. Here it is quite cool to see that the first two models really didn’t yet train enough to capture the system’s dynamics and immediately leave the trajectory once the 15 context samples ran out. From there you can observe that models that have trained for longer better approximate the true state shown in blue. Now again after one loop or so the even the trained models diverge due to minor differences at some point in the past, but it is worth noting, that the trained recurrent networks (especially the ones with more training time) correctly characterize the dynamical system with its two lobes spanning the 3D space. Now what that means is that you can use the network to describe the system and you probably have a useful model when combined with some data fusion process like Kalman filtering.

 

Feel free to have a look at the full code and play around with the model yourself @github-gist.

Chaos is beautiful

During my time at the University of Toronto, I dipped my toes into the fascinating field of chaos theory. I always found it to be particularly impressive how small changes in the initial conditions of a dynamical system can lead to what scientists call “deterministic chaos”.

Recently I stumbled across an animation of a triple pendulum and then across this great blogpost by Jake VanderPlas. Jake managed to recreate the animation in python and sympy. Since the equations for the triple pendulum tend to get messy Kane’s Method is the way to go.

I dont even want to take away too much. Just go to his website, download the ipython notebook Jake prepared and play around with the parameters. As they say: The flap of a butterfly’s wing really seems capable of causing a tornado.

The Bernoulli equation of fluid dynamics

I has been a while since I did some work in fluid dynamics but I stumbled across this b/c of a job interview so I thought that I maybe share it here.

The Bernoulli equation of fluid-dynamics is basically a law of conservation. In physics there are certain conserved quantities that stay the same during a physical transformation. An example would be energy, or momentum. The Equation derived below is basically a statement of energy conservation, however it requires a so-called ideal fluid.

An ideal fluid has the following properties:

  • it is stationary and laminar (so there is no turbulence, and velocity at one point does not change) \frac{\partial v}{\partial t} = 0
  • it has no viscosity (so there is no internal friction) \eta = 0
  • it has no friction (with outside walls or anything else)
  • it is incompressible, so the density is the same everywhere \rho = \text{const.}

Imagine a tube that is filled with this ideal fluid and its radius is shrinking in the middle in a conical fashion. Because we already know the equation of state (\rho = 0), we can make an assumption about the mass-flow within the tube.

Considering the difference in mass \Delta m_1, that is the mass that enters the pipe in the beginning in a certain amount of time we can easily see that this must be the \Delta m_2, which is the mass that leaves the pipe at the end in the same amount of time. This balance, also called continuity-equation, we can write as

(1)    \begin{equation*}    \Delta m \equiv \rho_1 A_1 v_1 \Delta t= \rho_2 A_2 v_2 \Delta t \end{equation*}

When we consider the total work done to the fluid we have to be careful about the sign and can write it as

(2)    \begin{equation*}    \Delta W = p_1 A_1 v_1 \Delta t  - p_2 A_2 v_2 \Delta t \end{equation*}

This total work must be equal to all other energy representations in the fluid, which we write in a way that is normalized by the mass difference

(3)    \begin{equation*}     \Delta W &=& \Delta m (E_2-E_1) \     \end{equation*}

(4)    \begin{equation*}     \frac{p_1 A_1 v_1 \Delta t}{\Delta m}  - \frac{p_2 A_2 v_2 \Delta t}{\Delta m} &=& \frac{1}{2}v_2^2 + \phi_2 + U_2 - \left( \frac{1}{2}v_1^2 + \phi_1 + U_1 \right), \end{equation*}

where \frac{1}{2}v_2^2 represents the kinetic energy \phi_2 some form of potential (in our case gz) and U_2 is a representation of internal energy. We disregard internal energy and fill in equation ref{eq:massflow} for \Delta m we get this nice little expression:

(5)    \begin{equation*}     \frac{1}{2}v_1^2 + \phi_1 + \frac{p_1}{\rho_1} + U_1 = \frac{1}{2}v_2^2 + \phi_2 + \frac{p_2}{\rho_2} + U_2 \end{equation*}

which basically states

(6)    \begin{equation*}     \left[ \frac{1}{2}v^2 + \phi + \frac{p}{\rho} \right]_{\text{streamline}} = \text{const.} \end{equation*}

This is Bernoulli’s equation of fluid dynamics, it states that within a flow of constant energy the velocity of a fluid rises in fields of low pressure and lowers in fields of low pressure.

Even though we simplified the underlying physics massively there’s still a lot of phenomena to be explained using this simple equation. For example aeroplanes flying can be partially explained with this equation.

Could machine learning mean the end of understanding in science?

If prediction is in fact the primary goal of science, how should we modify the scientific method, the algorithm that for centuries has allowed us to identify errors and correct them?

Interesting piece by UofT’s Amar Vutha on how machine learning is reshaping the scientific landscape. I fundamentally disagree that the goal of science is predicting nature. Predicting is great for applied problems like weather-forecasting, but neglecting the understanding of things bears a great risk, because then the scientific method is basically reduced to guessing what the next step could be and is no longer effective at iterating towards a fundamental truth. With all due respect, let’s leave that “goal-oriented” approach to problem solvers and engineering.

Spaghetti and Moonrockets

This is a translated version of a bookchapter I wrote for German highschoolers interested in meteorology. It’s an easy introduction to the Kalman-Filter and due to the addressed audience has as little math in it as possible. For more details and further reading I suggest looking at the sources or at my post coming soon….

It’s always the same old story: You plan your birthday BBQ a week ahead of time or you want to go swimming with your friends, and even though the forecaster promised good weather for that day it’s raining cats and dogs. Why can’t we just get a reliable weather forecast for a larger timespan, and what has the moonlanding to do with it?

Cloudy with a Chance of Spaghetti

The problems of weather forecasting are manyfold, however in my opinion there’s mainly three things.

First of all you have to think about why forecasting of weather works at all: We need a mathematical model of the weather. The model consists of functions (imagine  y=x^2 ) that describe how air moves in order to calculate temperature, pressure, humidity and wind speed etc. for every timestep. To find and perfect these functions, an arrangement of mathematical equations, is a herculean problem itself, which many meteorologists, physicists and mathematicians work on tirelessly. Phenomena like turbulence, aerosols and cloud formation is not thouroughly understood, but at least one has a sense of how large the corresponding errors in the equations are.

Second, there is the fundamental problem of measurement bias. By measuring we try to estimate the state of the system (pressure, temperature, etc.) at this very moment with high accuracy. These measurements are then pasted into the mathematical model. But how much do we trust in our own measurements? This is a big problem, since small deviations in the beginning can have large consequences in the future. Just imagine you trying to measure the temperature of the room with 50 people and a thermometer. Probably each of the 50 people will give a different answer regarding the second or third digit after the comma. Apart from this, we don’t even know how well the thermometer itself measures temperature. Figure 1 shows a timeseries taken from the so-called Lorenz-Model of 1963.

This specific model is described by an array of 3 “functions” inspired by rising and lowering air-masses and it behaves similarly to our atmosphere. Within the figure 1, you can see 50 different lines, that are very closely together in the beginning. In order to get values for tomorrow or the day after tomorrow we need to integrate the functions of your model. If you paste the 50 slightly different values as the initial conditions into our model you can see that after a short amount of time the forecasts drift apart completely. Meteorologist call this figure a spaghetti-plot.

The nice thing about spaghetti plots is that you can still use it regardless of the different restults. If for example 40 of 50 “spaghetti” point to warmer temperatures, you can assume with approximately  80\% certainty  \left( \frac{40}{50} \right) that its going to be warmer. With the spaghetti drifting apart we get a sense for the trustworthyness of our forecasts. This whole thing is closely related to something we call butterfly-effect and chaos-theory and is a fascinating story on its own. The more mathematically inclined reader might want to check out my piece on chaos and the butterfly effect

By the way, the uncertainty about the initial state is the reason for the false weather forecast for our birthday BBQ and more specifically is the reason why one shouldn’t trust weather forecasts over 3 days into the future. But there has been a lot of improvements in the past few decades, namely:

Third, the so-called data-assimilation. This term describes how we incorporate all the measurement data we have into one big picture.

Data-assimilation can be performed in space and in time. First let’s take a look at space: Peaking at figure 2 one can see a map of the world with all known weather stations as little black dots. Looking closer you realize that these dots are not equally distributed regarding ocean and land, and only the northern hemisphere has a dense station network. In order to start our weather model we need to have data about the initial state at every point in space. To get these values even at locations without a station we use a few mathematical tricks. The easiest possibilities to get data at locations without a measurement are drawn in figure 3. One could simply draw a line (green) between data points or one tries to fit a normal distribution around each data point (blue). This way we get values for every location in space.

Second let’s move onto data-assimilation in time. From time to time we get new information from our weather-stations and more information is always good. Wouldn’t it be great if we could grab the spaghetti, i.e. the prediction for temperature etc., from our figure 1 that have diverged to much from “truth” and “bump” them back to the right values? Thats what the Kalman-filter does.

Kálmán, we have a Problem

In fact this problem I just mentioned was already tackled in the 1960s. An American mathematician named Rudolf Kálmán developed a method to keep the Apollo capsule on its course to the moon.


Figure 1: Following the model the spacecraft would follow the ideal green line, in reality the course is distorted by small inaccuracies, as are the measurements (red dots)

Kálmán called his technique a “filter” and it works in 2 steps. First, one uses Newton-mechanics to predict the state of the system (in this case the position of the spacecraft in space) and the magnitude of uncertainty of this position because of effects that were not considered.

Second, one uses a weighted average to incorporate the latest measured data (of course those are biased, too). For Kalman those were board-measurements of the Apollo capsule, for weather-prediction purposes those would be new temperature measurements.

In order to understand this properly we create a small example with a spacecraft moving on a number-line between 0 and 1. Let’s say that the position predicted by our model (Newton’s mechanics) is at  \frac{1}{4} the measured position on the other hand is at  \frac{3}{4} .


Figure 2: Simple Example on the number line. At a given moment in time the spacecraft is located between 0 and 1.

The Kalman Gain  K is a number that expresses the confidence we have in our measurement relative to the prediction of the model.  K = 1 means that “we are absolutely sure that our measurements are correct” whereas  K = 0 means “we should place all our bets on the prediction, the measurement is rubbish”. Using the Kalman gain  K we can calculate a weighted average

 Estimate_x = \frac{3}{4} K + \frac{1}{4}(1-K)

If one believes equally in the model and the prediction one would take the average of both values.

 Estimate_x = \frac{\frac{3}{4} + \frac{1}{4}}{2} = \frac{1}{2} = 0.5

This seems to be a perfectly reasonable choice, however the Kalman filter can do more. It considers not only the uncertainty about the prediction, but also the error we make when measuring and calculates the optimal value for the relative confidence  K . If you trust the prediction more than the measurements it gives more weight to the prediction. And if the measurements are more plausible it gives those priority.

An example for a weighted average where the measurement is considered to be more reliable would be  K = \frac{2}{3} . In that case the filter calculates an estimate that is closer to the observed position.

 Estimate_x = \frac{2}{3} \cdot \frac{3}{4} + \frac{1}{3} \cdot \frac{1}{4} \simeq 0.58


Figure 3: The Kalman filter compares the prediciton of the model (Newton mechanics, blue) to on-board-measurements (red) in order to get a better estimation of the true position (green, unknown)

Not only does the filter provide a good estimation, but the calculation thereof is simple enough to be performed in real-time on a calculator. Everything that is needed for generating the estimate is the preiction and the latest measurement. Every computation for the moon-landing had to be performed on a boardcomputer less powerful than your average school-calculator.

With this very same technique that helped landing on the moon we can bump our “spaghetti” towards the true value and this helped improving weather forecasting tremendously.

A Filter for every Eventuality

Different implementations of this Kalman-filter are used today in modern weather forecasting and proved to be very useful. Additionally these filters are also used to keep your car navigation system on course if the connection to one of the satellites drops or in order to “filter” the noise from an audiofile. What started with the journey of 3 men to the moon, is now part of our daily life.

Sources and Literature

  • Wikipedia, keyword “Rudolf Kalman”, January 16th, 2017, at 15:22, as seen on https://en.wikipedia.org/wiki/Rudolf_E._Kálmán
  • Wikipedia, keyword “Kalman Filter”, January 27th 2017, at 22:06, as seen on https://en.wikipedia.org/wiki/Kalman_filter
  • Evensen, G. (2003). The ensemble Kalman filter: Theoretical formulation and practical implementation. Ocean dynamics 53, 343–367.
  • Jones, D. (2014). University of Toronto, PHY2506: Data Assimilation (lecture material).
  • Kucharski, Adam „Understanding the unseen“, +plus magazine, Version vom 31.03.2016 14:25 Uhr, abrufbar unter https://plus.maths.org/content/understanding-unseen
  • Lorenz, E. N. (1963). Deterministic nonperiodic flow. Journal of the atmospheric sciences 20, 130–141.
  • Vose, R. S., et al. The Global Historical Climatology Network: Long-term monthly temperature, precipitation, and pressure data. No. CONF-930133-2. Oak Ridge National Lab., TN (United States), 1992.