Final Project: Neural Radiance Field!

By Sai Kolasani and Avidan Shah

Introduction

A neural radiance field (NeRF) is a neural network that can reconstruct complex three-dimensional scenes from a partial set of two-dimensional images. The NeRF learns the scene geometry, objects, and angles of a particular scene. Then it renders photorealistic 3D views from novel viewpoints, automatically generating synthetic data to fill in gaps. In this project we implement NeRF from scratch.

Part 1: Fit a Neural Field to a 2D Image

This section focuses on implementing a Neural Field to map 2D pixel coordinates to RGB pixel values using a Multilayer Perceptron (MLP) with sinusoidal positional encoding.

The MLP learns to reconstruct the target 2D image by predicting pixel colors based on sampled input coordinates. To achieve this, we:

  • Implemented a sinusoidal positional encoding layer to enhance input representation.
  • Built a dataloader to randomly sample pixel coordinates and colors for efficient training.
  • Trained the network using mean squared error (MSE) loss and evaluated quality with Peak Signal-to-Noise Ratio (PSNR).
  • Experimented with hyperparameters like the number of layers, hidden units, and learning rate to optimize performance.

Through this process, we gained a deeper understanding of neural fields and their ability to fit and reconstruct 2D data.

Fox

Below are the hyperparameters used for training:

  • Hidden Layers: __
  • Highest Frequency (L): __
  • Hidden Neurons per Layer: __
  • Learning Rate: __
  • Batch Size: __
  • Epochs: __

PSNR vs Epoch Training Curve

Cat

Below are the hyperparameters used for training:

  • Hidden Layers: __
  • Highest Frequency (L): __
  • Hidden Neurons per Layer: __
  • Learning Rate: __
  • Batch Size: __
  • Epochs: __

PSNR vs Epoch Training Curve

Hyperameter Tuning

For the cat images we also tried to two separate sets of Hyperameter configurations. We tried using double the Layers for one setup and then lowered the learning rate to .0002 for the other setup. Both sets of results are below:

Double Layers

Below are the hyperparameters used for training:

  • Hidden Layers: __
  • Highest Frequency (L): __
  • Hidden Neurons per Layer: __
  • Learning Rate: __
  • Batch Size: __
  • Epochs: __

PSNR vs Epoch Training Curve

Lower Learning Rate

Below are the hyperparameters used for training:

  • Hidden Layers: __
  • Highest Frequency (L): __
  • Hidden Neurons per Layer: __
  • Learning Rate: __
  • Batch Size: __
  • Epochs: __

PSNR vs Epoch Training Curve

Part 2.1: Create Rays from Cameras

In this section, we implemented several functions to compute and transform rays from camera parameters. These functions enable us to move between world, camera, and pixel coordinate systems, which are essential for ray tracing and rendering. The key steps and functions implemented are:

  • Camera-to-World Transformation:
    • We implemented the function transform(c2w, x_c) to transform 3D points from camera coordinates to world coordinates using the camera's extrinsic matrix. The inverse transformation, transform(c2w.inv(), x_w), converts points from world coordinates back to camera coordinates.
  • Pixel-to-Camera Conversion:
    • The function pixel_to_camera(K, uv, s) converts 2D pixel coordinates into 3D camera coordinates by leveraging the camera's intrinsic matrix K. This includes calculating depth s = z_c along the optical axis.
  • Pixel-to-Ray Conversion:
    • We implemented pixel_to_ray(K, c2w, uv), which generates rays for each pixel. This function computes the ray's origin (r_o) and direction (r_d) in the world coordinate system, normalizing r_d to ensure unit vectors.

By implementing these functions, we can efficiently generate and handle rays for rendering tasks.

Part 2.2: Sampling

In this section, we implemented sampling methods to generate rays and points for rendering. These samples form the foundation for training NeRF models and are critical for capturing the 3D structure of scenes. The key steps and functions implemented are:

  • Sampling Rays from Images:
    • We extended the ray sampling process to handle multiple images, leveraging camera intrinsics and extrinsics to compute ray origins and directions.
    • We implemented sampling by
      • Option 1: Sample M images, then sample N / M rays from each image.
      • Option 2: Flatten all pixels across images and perform a global sampling of N rays.
    • To align ray sampling with the pixel grid, we adjusted UV coordinates to account for the pixel center offset (adding 0.5).
  • Sampling Points along Rays:
    • We discretized each ray into samples along its path in 3D space using the function np.linspace(near, far, n_samples).
    • To avoid overfitting, we added random perturbations to the sampled points: t = t + (np.random.rand(t.shape) * t_width)
    • The batched samples along each ray are calculated using rays_o + rays_d * t

By implementing these sampling techniques, we ensure that the ray and point distributions provide sufficient coverage of the 3D space, enabling effective training for the NeRF model.

2.3: Putting the Dataloading All Together

In this section, we implemented a dataloader that takes in a list of images and a list of cameras and outputs a batch of rays and colors. As a sanity check we also plotted our data using the vizualization code. The visualization is below:

2.4: Neural Radiance Field

In this section, we extended the MLP from Part 1 to create a Neural Radiance Field (NeRF) that predicts both density and color for 3D samples. The network was modified to handle 3D inputs and view-dependent outputs. The key steps and modifications include:

  • Input Representation:
    • Inputs are now 3D world coordinates ([x, y, z]) and 3D ray directions ([dx, dy, dz]), instead of 2D pixel coordinates.
    • The ray direction is encoded using positional encoding (PE), but with fewer frequencies L=4 compared to coordinate PE (L=10).
  • Output Prediction:
    • The network predicts both density and RGB color values for each 3D sample point.
      • Density: Constrained to be positive using ReLU.
      • Color: Constrained to range [0, 1] using Sigmoid.
      • Color depends on both the point location and view direction.
  • Network Architecture: Our new network roughly follows the diagram below

Part 2.5: Volume Rendering

In this section, we implemented the volume rendering equation to compute the color of a ray as it passes through 3D space. This process uses densities and colors predicted by the NeRF and combines them along a ray. The key steps include:

  • Volume Rendering Equation:
    • The continuous volume rendering equation integrates the color contributions along a ray, weighted by transmittance and density:
    • C(r) = ∫tntf T(t)σ(r(t))c(r(t), d)dt, where T(t) = exp(-∫tntσ(r(s))ds)
    • We implemented the discrete approximation for computation:
    • Ĉ(r) = Σi=1N Ti (1 - exp(-σiδi))ci, where Ti = exp(-Σj=1i-1 σjδj)
  • Verification:
    • A test case was implemented to ensure the volume rendering outputs match expected results using a provided assert statement.

This implementation ensures that the rendered colors accurately represent the cumulative contributions of densities and colors along each ray, forming the basis for synthesizing images with NeRF.

Final Results

Now we put it all together to produce 3D images!

Below are the hyperparameters used for training:

  • Hidden Layers: __
  • Highest Frequency (L): __
  • Hidden Neurons per Layer: __
  • Learning Rate: __
  • Batch Size: __
  • Epochs: __

Training Curve

Training Process Visualized:

Validation Process Visualized:

Spherical Rendering of Lego Truck

Bells & Whistles: Background Color