CS 194-26 Final Project

Roger Chen and Sarah Kim

For our final project, we chose to do 3 projects: the Tour into the Picture, Video Textures, and High Dynamic Range imaging.

Tour into the Picture

Guided fly-through
Interactive tour
Our TIP GUI
Pre-rendered examples

Video Textures

Transition model

Extracting frames and downsampling
Pairwise L2 norm
Preserving dynamics
Q-learning and avoiding dead ends
Inverse exponentiation
Non-maximal suppression

Examples
GUI

High Dynamic Range

Radiance map construction
Tone mapping
Memorial Church
Other examples

Project 1: Tour into the Picture

The tour into the picture method creates a 3-dimensional world using a single 2-dimensional image that has single-point perspective. This works by assuming the scene of the image can be modeled as a box. 5 sides of the box are visible. By labeling the vanishing point and the sides of the box in the image, we can extrapolate world coordinates for each of the 5 planes and then use 3D rendering techniques to explore the image.

Original TIP paper

We decided to implement our Tour into the Picture with WebGL and CoffeeScript, so that it could run in the web browser. The following demos are not videos, they are rendered directly in the browser.

Guided fly-through

Here is a demonstration of our WebGL Tour into the Picture running in fully automatic guided mode. In this example, we move the camera toward the rear plane while rotating it slightly (we move the Euler angles in a small circle as we progress down the hallway).

Interactive tour

You can also control the WebGL Tour into the Picture yourself using the keyboard or button clicking. First, choose an input image (the image plane parameters have been pre-selected for you), and then use either the buttons or the arrow keys to move through the scene.

Choose an image to view:

When you select an image above, an interactive tour GUI will appear here in your web browser.

Our TIP GUI

We created a GUI for labeling Tour into the Picture data points. This helps us gather the TIP parameters in a visual way, instead of trying to pick out points in Photoshop. The GUI provides:

Labeling the vanishing point, using guidelines.
Labeling the floor edges
Labeling the ceiling height in a symmetric way.
Labeling the size of the backplane based on the 4 box edges.

You can upload an image to the GUI and label it. The GUI outputs JSON data that can be used in many programming languages. You can also just try out the GUI by following this link:

Open the TIP labeling GUI

The GUI also displays the Tour into the Picture WebGL code so you can explore your newly labeled TIP image.

Pre-rendered Examples

In case your browser is too old to use any of the previous demos, we have pre-rendered some examples of our TIP code in action.

City

This city example doesn’t exactly fit into the assumptions of 1-D perspective, since the cars and trees stick out. I chose to use the building facades as the left and right wall. The ceiling is not very well defined, because the sky is the ceiling, but I just chose the highest building. The back plane is also not very well defined either. Nonetheless, I can get some interesting views out of this example.

Original

Floor

Facade

Another facade

Courtyard

This courtyard example works much better, since the geometry is simpler. However, the planters in the center still mess up the 1-D perspective assumptions.

Original

Ground

Plants

Window

Museum

This museum is probably the best example, since the room is empty and the walls and ceiling are very straight. However, I wish that there were more pixels devoted to the left and right walls. In the 3-D model, the left and right walls are very short as a result.

Original

Ground

Ceiling

Corner

Project 2: Video Textures

The goal of the video textures method is to synthesize a video of arbitrary length, given a small clip of “training” video. It is a video texture, because like image textures, video textures look the best when transitions are seamless.

Original video textures paper

We decided to implement our video textures with OpenCV and C++. The OpenCV framework takes care of video decoding/encoding and also provides fast matrix libraries. Our final program can render video textures in real-time as well as generate video clips and save them to a file.

Transition model

An infinite video texture can be represented with 1) the original set of frames from the training video, and 2) a 2D transition model matrix where T_i,j represents the probability that frame j will follow frame i−1. This implies that the last frame is a dead end, because there are no entries in the transition model for it. We compute this transition model matrix with the following steps:

Extracting frames and downsampling

Each frame of the video is first extracted and then reduced to 1/4th of its original size on each dimension. The downsampling was done to speed up computation time, since video frames have a lot of noise anyway.

Pairwise L2 norm

We create a N × N matrix (where N is the number of total video frames) that contains the L2 distance between each frame and each other frame. We call this matrix D. This is a symmetric matrix, so we only need to compute half the values and then just copy them across the diagonal. The values on the diagonal are always 0.

Preserving dynamics

As described in the original video textures paper, we apply a filter over the values of D so that distances don’t only consider the difference between 2 frames, but also their neighboring frames. We use a binomial weight distribution with m = 2 (2 frames before, and 1 frame after). Here is the formula used:

We call this new matrix D′. This new matrix discards the first 2 frames and the last 1 frame, because when we apply the filter, they do not have enough frames before/after them to fully compute the distance.

Q-learning and avoiding dead ends

When the video texture reaches the last valid frame (frame N), it can no longer try to find a good frame to continue the playback. Therefore, when the program is currently displaying frame N - 1, it must always jump backward to a different point in the video, and never to frame N. In general, we can label frame N as off-limits, because it is a dead end. We use the following definition to define a new matrix D″ to capture this idea.

Because this definition is recursively defined, we must iterate through the matrix and update its elements in a loop until they converge. We used the method in the paper to do this efficiently. I defined the stopping condition to be when the sum of the non-infinite values of the min() expressions is below a certain epsilon. The value of epsilon is defined to be the initial sum of the non-infinite values of the min() expression, divided by 10.

Inverse exponentiation

We apply the following formula to turn our distance matrix D″ into a probability matrix. The values are normalized such that every row adds up to 1, unless there are infinite values in the row (for the off-limit frames). Our value of sigma is set to be the average non-infinite distance value, divided by 27 (determined heuristically to be a pretty good choice).

Non-maximal suppression

We suppress transitions that have a neighbor which has a higher probability than the current transition. However, we leave the main diagonal alone, in order to avoid frames that have no possible remaining transitions. We normalize each row to sum up to 1 again after this operation.

This is our final transition model. Here is a visualization of each step in the derivation process for a particular video that exhibits very strong periodicity (spoiler: it’s a pendulum).

Distances

Preserving dynamics

Q-learning

Inverse
exponentiation

Non-maximal
suppression

Examples

Each of these examples contains a source video (left) and a synthesized video (right). The synthesized video has an overlay that displays the most recent jumps (jumps of +1 frame are ignored, because they are the most common) as well as the current frame’s position in the source image.

Candle

The original paper used a candle video as an example, so I started out with that. The nice thing about a candle is that there is essentially only 1 dimension to its motion. That dimension is how far left or right the candle is blowing. In this particular video, the candle starts out very still, and then it flickers. This property can be seen in the visualization of the transition model.

Original

Synthesized

The upper left region corresponds to the region where the candle is still. In this case, the candle tends to jump back to another stationary frame, so the candle remains still. The lower left region corresponds to the frames toward the end where the flame is flickering. In the lower left region, we have a lot of frames that potentially could jump back to the beginning of the video, because of our dead-end avoidance.

Distances

Preserving dynamics

Q-learning

Inverse
exponentiation

Non-maximal
suppression

Pendulum

The pendulum video is periodic, so we get a nice pattern in our transition model. However, you can see that before we apply “Preserving dynamics”, the forward and reverse jumps have an equal amount of weight, which means the pendulum can suddenly change the direction of its velocity, which does not look natural. After preserving dynamics, the jumps that correspond to the opposite direction are attenuated. (The effect is not as noticeable in the visualization of the distance matrix, because we take the logarithm of its values before rendering the visualization. However, you can easily see the effect of preserving dynamics after the inverse exponentiation step.

Original

Synthesized

Distances

Preserving dynamics

Q-learning

Inverse
exponentiation

Non-maximal
suppression

Flower

This flower video benefited greatly from the non-maximal suppression step. The background of this flower video is constantly changing and adds a lot of noise to the transition model. However, non-maximal suppression is able to pick out the best transition points despite the noise and minimize how much discontinuity appears in the blurred background.

Original

Synthesized

Distances

Preserving dynamics

Q-learning

Inverse
exponentiation

Non-maximal
suppression

Water

This was the most difficult input video. I thought that water texture would be easy at first, but there is a very visible discontinuity where the video jumps back to the beginning, because a video of water has very high dimensionality to it. All of the water waves can be in different positions, and it is very difficult to find two unrelated frames that actually look similar.

Original

Synthesized

Distances

Preserving dynamics

Q-learning

Inverse
exponentiation

Non-maximal
suppression

GUI

Our video textures application also runs as a GUI. It displays the same visualization as the rendered video. It can save and load the transition matrix to/from a file, so that it does not have to be recomputed every time.

Project 3: High Dynamic Range

The real world has a very large dynamic range that cannot be simultaneously fully captured by cameras. The goal of HDR is to determine the radiance of a scene from multiple exposures and remap this radiance onto pixel values to attempt to more fully express the range of the original scene with a limited number of values.

HDR is composed of two separate steps: (1) constructing a radiance map of relative intensity values of scene locations and (2) applying tone operators to map the radiance map to the small range of [0, 255] pixel values.

Radiance map construction

This algorithm was taken from Debevec and Malik 1997. Recovering the radiance map of a scene is composed into several steps:

Selecting points in the scene: We will sample the pixel values as these points across differently exposed images to find the camera's response curve.
Finding response curve g: Given that f is the mapping from exposure x to pixel value z, we will find g - the log inverse function of f - by minimizing the following expression, where

N is the number of locations in the scene
P is the number of images of the scene with unique exposures
w is a weight function that increases the value for pixel values at the middle of the [0, 255] range
z_{ij} is the pixel value at location i in image j
g is the response function mapping from pixel values to log exposure
E_i is the scene radiance at location i
Δt_j is the shutter speed of image j
λ is the smoothness constant that ensures that g is a monotonically increasing smooth curve
Z_{max} is the pixel value 255
Z_{min} is the pixel value 0

Compute radiance for every location in the scene: After recovering the response curve, we are able to compute relative radiance values for every location in the image. These values do not necessarily correspond to real-world units of radiance, but they are valid with respect to each other.

Tone mapping

Once we have radiance values, it is necessary to find the proper mapping from the large spread of values to the limited [0, 255] pixel value range. We worked on two different tone mappers:

Global Reinhart operator: The Reinhart operator remaps values according to whether they are greater or less than constant c. Values in the range [0, c] get mapped to [0, 0.5], and values in the range [c, inf] get mapped to [0.5, 1]. It is additionally helpful because it creates an asymptote at 255 (making all values within the pixel range) and has a derivative of 1 at 0, meaning that dark values are left dark.

Local Durand operator: This algorithm was taken from Durand 2002. Also, much thanks to a write-up by a previous student of this course.
The motivation for this local tone mapping operator is to reduce the spread of radiance values without losing the precision of edges and causing a "halo" effect. It aims to do this by creating two layers from the scene intensity/luminance - the base layer, which is scaled to reduce the spread of intensity values, and the details layer, which preserves the precise edges of the image.

Compute the intensity (luminance) of the image.

Compute the chrominance channels and log intensity of the image.
Filter the log intensity with a bilateral filter to form the base layer. We implemented a bilateral filter that computes the value B(x) of each pixel x based on a set of neighboring pixels for speedup, where

ξ is the set of neighboring pixels
k(x) is the normalization term
c(x, p) is the geometric closeness of pixels x and p
s(x, p) is the photometric closeness of pixels x and p
σ_d is the geometric spread sigma
σ_r is the photometric spread sigma
I(p) is the intensity of pixel p

Compute the detail layer by subtracting the filtered base layer from the intensity map.
Scale the base layer to reduce the spread of values.
Reconstruct the color image from the base and detail layers and chrominance channels, applying gamma compression and/or global tone operators as necessary.

Memorial Church

This set of images of Stanford Memorial Church (source) is the classic example used by both Debevec and Durand.

32 s

16 s

8 s

4 s

2 s

1 s

1/2 s

1/4 s

1/8 s

1/16 s

1/32 s

1/64 s

1/128 s

1/256 s

1/512 s

1/1024 s

Here are the recovered response curves for the three channels.

red channel

green channel

blue channel

Of the left side of the row of images below is the result of applying the global Reinhart tone mapper. On the right side is the result of the applying the local Durand tone mapper.