Dancelogue is an AI first company whose main objective is to understand and classify human movement in dance. To this end, being able to understand video structure is of vital importance.

The main thing that separates videos from images is that videos have a temporal structure in addition to the spatial structure found in images. Videos also do have other modalities such as sound but this ignored for now. As a video is just a collection of images operating in a specific temporal resolution i.e. frames per second.  This means that information in a video is encoded not only spatially (i.e. in the objects or people in a video), but also sequentially and according to a specific order e.g. catching a ball vs throwing a ball, dancing salsa vs hugging. This extra bit of information is what makes classifying videos quite interesting and yet challenging at the same time.

Background

There are quite a few deep learning algorithms that are applied to the spatial domain these range from classification, segmentation, scene understanding etc. However, algorithms that perform well in the temporal domain are fewer and less developed than their spatial counterpart. This is because of the complicated nature of adding time to the equation.

One of the challenges in creating algorithms to detect temporal structure is the exponential increase in the number of parameters required to perform a simple classification of a video segment. One would naively assume that 2d networks can be extended to 3D simply by stacking video frames as input to the network in order to inherently encode motion within the neural network. This has been shown to be computationally intractable, especially when long range time dependency is required, as well as the results being significantly worse than those of the best hand-crafted shallow representations.

Another issue encountered is that most methods learn features based on image context and not inherent action. This is where they categorize each frame and naively select the majority label as they vote on a frame by frame basis meaning the largest number of per-frame votes is used as the final video label. This means that they don't use representation of motion as a classification feature, but learn how to use spatial cues to understand what was the temporal information contained in the video. An example of this is assuming there is a video containing water, and people in the water, then the classification algorithm will return the video contains swimming, but will potentially be unable to determine if the person swimming is doing the front crawl or just floating .

What is desired is an algorithm that prioritizes motion as a key characteristics of classification, and is computationally feasible within a real world context. There are quite a few ways to achieve this and optical flow forms one of the potential candidates to achieve the desired result.

Optical Flow

Optical flow is a powerful idea and it has been used to significantly improve accuracy when classifying videos and at a lower computational costs. It has been around since the 1980s existing in the form of hand crafted approaches.

Optical flow is a per pixel prediction and the main idea is that it assumes a brightness constancy, meaning it tries to estimate how the pixels brightness moves across the screen over time. It assumes that I(x, y, t) = I(x + Δx, y + Δy, t + Δt) in plain english this is the pixel characteristic at time t (i.e. rgb values) is the same as the pixel characteristics at  Δt but at a different location (denoted by Δx and Δy ), where the change in location is what is predicted by the flow field. An example of this is that assume we have the following rgb values at 1 second, (255, 255, 255) and at x, y position (10, 10) in the frame, Optical flow assumes that at t = 2 seconds, the same rgb value (255, 255, 255) will still exist in the screen and if there is motion, it will exist at a different part of the frame i.e. (15, 19). Thus the optical flow displacement vector for this motion will be [9, 5 ]. This means if we take the original pixel position and apply to the displacement vector, we should get the new image (this is what warping an image is about, and which we will come to later).

Applications

There are quite a few applications of optical flow in Deep Learning as well as outside of it.

Some applications outside deep learning include generating 3D shapes from motion, global motion compensation which is used in video camera stabilization as well as video compression. Optical flow was also used in the matrix movies to smooth and re-time the shots in the bullet time dodging scene.

Optical flow has quite a few applications in deep learning as well and some of them are as follows.

It is useful in providing smoothing in Generative Adversarial Networks e.g. vid2vid network, so that generated output can appear to be temporally coherent outputs. This will be hard to do purely using GANs which doesn't encode a temporal component.

Probably one of the most known uses of optical flow in deep learning is in the two stream architecture for video recognition as indicated by the "Two-Stream Convolutional Networks for Action Recognition in Videos" paper published in 2014. In this architecture there are two input streams to the network (similar to FlowNetC below), where one stream takes the raw image, and a second stream takes a series of optical flow images. This was one of the successful uses of convolutional networks in videos as the results were competitive with more traditional state of the art models. There have recently been several other variants with better results but all of them rely on some form of optical flow representation.

Optical Flow Implementation

There are quite a few optical flow implementations, however investigating all of them is beyond the scope of this blog. The implementation we will be looking at is the one described by the Flownet 2.0 paper (FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks) and the code base implementation is hosted on NVIDIA's github repository and written in pytorch. The reason this was chosen is because it's being used in other interesting projects such as vid2vid, and the fact that the repository has one of the highest stars of existing flow net codebases at the time of the writing of this paper.

In order for the code base to make sense, let's skim through the paper in a non technical basis, and hopefully can delve into the technical details in a later blog. We will look only at the key ideas around the paper, and some reasons why this particular flownet implementation is worth considering. It is worth noting that the Flownet 2.0 paper is based off the "FlowNet: Learning Optical Flow with Convolutional Networks" paper released in 2015 and it's almost impossible to read Flownet 2.0 without reading the original paper first.

The heart of the Flownet 2.0 paper is based on the FlowNetS, FlowNetC, FlowNetCSS, FlowNetSD algorithms. Don't worry we will break down what these mean and how the fit together to produce the final FlowNet 2.0 algorithm.

FlowNetS

FlowNetS stands for Flownet Simple and it was originally introduced in the FlowNet 2015 paper. A visual summary of the algorithm is as follows:

FlowNetS has a similar structure to an encoder decoder network. This means data is spatially compressed in a contractive part of the network and then refined in an expanding part. It takes two images, concatenates them and feeds them to a network. The image pair is then processed and motion information is extracted.

The refinement layer is as follows:

In this part of the network, it performs a series of upconvolutions in order to increase the image resolution which means it's the expansive part of the network i.e. the decoder. This is done by combining course feature maps (from an earlier part of the network) with the fine local information provided in lower level feature maps. There are other refinement strategies but that's out of the scope of this blog.

EPE/APE

The loss function for calculating optical flow is done by making use of the end point error. This is shown diagrammatically as:

The end-to-end point error is calculated by comparing the predicted optical flow vector with the ground truth (expected) optical flow vector, and is the Euclidean distance between the two vectors.

FlowNetC

FlowNetC which stands for Flownet Correlated. It was originally introduced in the FlowNet 2015 paper, it was meant to be an improvement on FlowNetS which had high EPE compared to traditional optical flow implementations. A visual summary of the algorithm is as follows:

This is a less generic network structure where each individual image is fed into a separate but identical processing stream of the network. This means that meaningful representations are learnt separately e.g. edges etc. Then the individual representations undergo multiplicative patch comparisons in a correlation layer ( similar idea to a matrix multiplication). The result from the multiplicative patch comparison proceeds in a similar fashion to FlowNetS, which means it goes through a similar refinement process.

FlowNet 2.0

Combining all the above ideas, the FlowNet 2.0 architecture is created where the visual representation is:

From the above final FlowNet 2.0 architecture diagram we can see the very first part of the stack is the FlowNetC network, which takes in 2 images and is designed to detect large displacements.

The resulting flow field is applied to the second image via warping (see below) and this together with image 1 is fed to the following FlowNetS network which calculates large Flow Displacements The brightness error which is the difference between the warped image and the original image is also passed into FlowNetS. This combination of FlowNetC and FlowNetS is called FlowNetCS. .

This is then repeated with a second FlowNetS network which produces a flow field and flow magnitude. This combination is called FlowNetCSS, and is one possible variant of the FlowNet 2.0 network. FlowNetCSS was introduced in the FlowNet 2.0 paper and the main idea behind it is that optical flow estimations can be greatly improved by stacking the networks.

A second stream of the network contains FlowNetSD which is fed with the original Image 1 and Image 2. The FlowNetSD network stands for FlowNet Small Displacements. In the FlowNet 2.0 paper, it was found that despite stacking, the network was still unable to accurately produce flow fields for small motions thus resulted in a lot of noise. Thus a variation of the FlowNetS was added to the network with some slight modifications done in order to capture smaller motions, these changes included changing the kernel size as well as adding convolutions during up convolutions.

The resulting flows from FlowNetCSS and FlowNetD are fused with the main branches flow in a fusion network to produce the final flow field. There wasn't much detail about the fusion network in the paper, but this should be cleared up by studying the pytorch code base.

Warping

Looking at the first FlowNetC network, it computes an optical flow. The optical flow produced by FlowNetC is "applied" to the second image to shift the image according to the optical flow field so as to try match image 1. This new image 2 is then fed to the following FlowNetS layer. This way the network in the stack can focus on remaining increment between image 1 & image 2. The reason this approach works is that the assumption that if only the first image is known as well as the flow vectors then the second image can be generated. This principle is what's useful in video compression as an optical flow representation uses fewer parameters than an actual video representation and thus avoids redundancy.

Training

The training of the network was conducted using synthetically generated data which consists of the Chairs dataset created for the original FlowNet implementation, the FlyingThings3D dataset proposed in the "A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation" paper and is used to model 3D motion. As well as the ChairsSDHom introduced in the FlowNet 2.0 paper which was used to train the network to detect small displacements.

The data was fed to the network using a curriculum model, which is the strategy of training Machine Learning models on a series of gradually increasing tasks, as it was found the order of presenting the training data was vital in producing high accuracy. Thus the simpler chairs dataset was fed in first, then the FlyingThings3D was introduced and finally theh ChairsSDHom was fed to the network as a fine tuning step. Additionally the best results were obtained when keeping the prior network fixed & only training the preceding networks after the warping operation in order to fine tune the results of the first network.

Variants

Even though the final FlowNet 2.0 network is superior to state of the art approaches, it still slower than the original FlowNet implementation i.e. 10 fps vs 8 fps and can be restrictively slow for most practical applications i.e. most videos operate between 25 - 60 fps. Hence variants of FlowNet 2.0 was created to have significant boost in run time. Luckily the difference between the variant and original is not too much, where one of these variants is FlowNet-css (note the lower case as denoted by the FlowNet 2.o paper). This is the same architecture as FlowNetCSS but with a shallower network that is optimized for speed i.e. at 3/8 the network size, it can process images at 140 fps although at a reduced but still acceptable accuracy of the full network. Depending on the application such a speed/accuracy tradeoff might be worth while.

Interpreting FlowNet Diagrams

To view the results of a flow field there needs to be a way that is intuitive for humans to understand, hence the flow field color coding was created. The color coding is shown below.

The displacement of every pixel in the top left (multi colored) image is the vector from the center of the square to that particular pixel, as indicated by the image on the right. This means at the middle of the image i.e. the white portion, indicates no optical flow, in the blue quadrant, it indicates flow to the left and to the top, the greater the shade of blue the greater the magnitude of the the vector. This applies to the other quadrants as well where the more intense the color the greater the magnitude of flow to that point.

Let's look at an example of this at play:

From the above image, it can be seen that the background is green, looking at the flow field color coding, it indicates that the vector in this region is pointing to the bottom and to the left at varying magnitudes as indicated by different shades. An interpretation of why this is the case is that the camera is focusing on the central individual as he is moving across the background hence the background flow to the left, reason it's to the bottom might indicate the central figure is climbing a hill. Looking at the front foot of the humanoid, it's a red color which indicates the flow field vector pointing to the right indicative of forward motion.

Conclusion

We have looked briefly at what is optical flow and it's possible implications in deep learning. We have also briefly looked at the FlowNet implementation of optical flow and the architecture behind it. Subsequent blogs will delve deeper into the topic and look at how to build and train a two stream network as well as use dance training data on the network and try to see if the results make sense.

References