The productivity of artists working on high quality photorealistic renders of 3D scenes, especially 3D animations, is limited by the long time to wait for the image to be computed. Microrendering is a novel approach to the final gathering phase of several physically based rendering algorithms that by making use of the massively parallel nature of modern GPUs is able to improve the rendering speed to real time rates. Photon mapping uses a Monte-Carlo approach to approximating the integral in the rendering equation. As a preprocessing step, it shoots photons from the light sources and stores the hits in a photon map. For indirect illumination, final gathering computes the incident radiance from reflected and transmitted rays by density estimation from the nearest hits in the photon map. I propose incorporating microrendering into Yafaray's photon integrator and implementing the GPU based computation using OpenCL, a parallel computing standard backed by many large corporations. This will also facilitate the use of OpenCL in Yafaray for accelerating many other parallel tasks in future.
Since rasterization of a triangle based representation of the scene is quite costly when it needs to be done many times and at low resolutions (can't make use of scan-line/surface coherence, several scattered rays may hit near the same surface point), microrendering uses a point hierarchy representation that is based on QSplat [2].
1. Random points are generated on each triangle in the scene (
where u,v are random numbers
for
). To distribute them more uniformly, best candidate sampling is used i.e the point farthest from all previously selected points is selected next. This can be optimized using a kd-tree or an octree for finding the nearest point. The number of point to generate is proportional to the area of the triangles and the constant of proportionality, the point density, gives the level of detail of the geometry representation.
2. Normals (equal to the normal of their corresponding triangles) and radii are assigned to each point, forming a collection of discs that cover every surface. In [1] the radii depend on the point density. In [2] the location of the points coincide with the vertices and the radii are chosen so that no holes are left.
3. The points are arranged in a hierarchy using binary space partitioning. The output is a sorting of the point set, which represents a balanced tree such that skip pointers to the children need not be stored, they can computed on the fly, when needed. The recursive algorithm takes a set of points, computes the axis on which they have the maximum extent (maximum value - minimum value), sorts the points by the coordinates on that axis and finally, splits the set in half and calls itself recursively on those halves.
4. Bounding spheres are associated to the point hierarchy. Each node has exactly two children i.e we don't have to solve the general bounding sphere of spheres problem, but just this simple case of two spheres centered at
and
with radii
and
. Thus the radius of the bounding sphere is
and is centered at
.
5. The diffuseMap.gather is called for each leaf node in the hierarchy to compute an initial radiance. Radiance is also stored in the internal nodes because often it's enough to get the average radiance from the child nodes.
Design notes:
The algorithms/data structures presented here may be useful for other tasks and there are other approaches to implement them. To facilitate this, they will be encapsulated in classes and the strategy pattern will be applied. For example BoundingSphere (input point set or sphere set, output one bounding sphere), QSplatHierarchy (input scene, output some representation of the hierarchy), BestCandidateSampling (input set of random points and the sampling number, output set of best candidates) etc.
Integration:
To make sure the process of adding the new features does not break existing functionality, a new integrator plugin will be written, PhotonMicroIntegrator. This will reuse code from the existing photon integrator and will be merged back into it once it is stable enough. Most of the preprocessing phase will be the same, but instead of storing the radiance in the radianceMap, it will be stored in the point hierarchy, computed in the same manner.
The microbuffers can be thought of as warped renders of the scene from the point of view of individual final gather points. The term micro refers to the low resolution of these renders i.e the small size of the buffers in which they are stored, 8x8-24x24.
The pixels in the buffers represent different gathering directions, denoted by the mapping
. The solid angle of the volume around the direction to gather from is denoted by
. Although it's not explicitly stated in [1]
should be the center of the microbuffer (because that's how you get rays in all directions for the Nusselt projection, it's not so clear for the warped case). There are index buffers to store the index of the nearest node, depth buffers to store the distance to the nearest node and frame buffers for storing the gathered radiance. In order to support any BSDF (and not just BRDFs as in [1]) two sets of microbuffers are needed for directions toward the upper and lower hemisphere.
is the magnitude of the gradient of
and while intuitively it should be some kind of gradient, that is defined on a scalar field while
is a vector field. Based on the notation of the gradient they could be referring to the divergence
or the magnitude of the curl
. In either case, the derivatives for the formulas can be computed by taking finite differences from the table containing the values of
in the micro buffer. Also, maybe simply computing the magnitude of the average of the neighbours of the pixels is a good enough approximation.![]() |
While
could be taken as a standard hemispherical projection (and multiplying the incident radiance with the BRDF to obtain its contribution), importance sampling should be used to gather more from directions having a higher contribution in order to improve the time/quality ratio. Because the analytical form for importance sampling of the BSDF is implemented (or not) in Yafaray through the material's sample method, it will be ported and used instead of the method given in [1]. The directions cannot be arranged arbitrarily in the buffer, because it will be used to compute partial derivatives for
. Since without importance distortion, the sampling parameters s1, s2
denote the height (Z) and the angle (divided by
) around the normal at the surface point, s1 should decrease (or increase for the lower hemisphere) radially outwards from the center of the buffer and s2 should sweep the angle around the center. A different approach is to generate s1, s2 with Halton scrambling but to radially sort the resulting directions.
After the importance sampled directions are computed, we proceed with finding the nearest nodes in the point hierarchy for every pixel. This is done in two steps:
1. The tree is traversed as follows:
a. start with the root
b. if the current node maps to more than one pixel in the micro buffer, traverse the children
- compute the direction to the node, 
- compute the pixel which
maps to i.e inverse mapping
. Since the sample method is used instead of the approach in [1], we don't have an analytical expression for the inverse mapping so we must search the buffer for the nearest matching angle. TODO: Find a data structure that can speed this up. Maybe put the angles in a grid based on their spherical coordinates ..
- compute the solid angle of the node,
: no formula is given in [1] but
should be the projected area of the node (sphere at distance d with radius r) onto the unit sphere so
where
.
- if
>
then traverse
c. otherwise if the distance to the node is smaller than the distance stored in the depth buffer, update the index and depth buffer
TODO: in [1] there is no discussion about the case when based on the solid angle the node is one pixel sized but it crosses the border between two volumes corresponding to different pixels, strictly speaking it should then be 2 pixels in size, but the algorithm counts it as 1 pixel
2. Some leaf nodes may also be greater than 1 pixel in size, these are added to a post traversal list. After traversal, rays are cast in the directions of each pixel and each ray is intersected with each node in that list to determine the nearest node. It is expected that the number of nodes in the lists should be relatively small. The ray starting from the surface point
in the direction
is given by the parametric equation
. The intersection of this with the sphere centered at
having radius
is given by the equation
. The node having the smallest solution
is the nearest one for the given pixel.
Finally the incident radiance at each of the final gather points is computed by summing up the contribution of each gather direction, the contents of the micro frame buffers.
Integration:
The final gathering phase of the integrator will initialize the OpenCL kernel, for each material get the sampling kernel, create a kernel which calls one of those based on an index, store the indices in the point hierarchy, upload the point hierarchy, start the work items, wait for the results, read the frame buffers and sum up the radiance for each sample.
While some experimental C++ bindings are available [3][4], these do not have extensive documentation or examples. As such, the standard C API will be used for now [3], but should be easy to replace once the C++ bindings mature. Both AMD, Nvidia and Apple provide many examples for the C API [5][6][7]. Custom C++ wrappers will be written instead.
Since there are many implementations of OpenCL, only the headers will be bundled with Yafaray, and the user will be responsible for providing the path to the actual library. OpenCL support will be opt-in and to avoid putting too many #ifdefs throughout the code because of this, files actually calling OpenCL methods will simply not be compiled and the number of calls to methods in such files will be minimized and/or hidden behind interfaces.
Initially when creating the context the code will look for GPU support and then CPU support before giving up. Eventually a build, command line and user interface option may also be added to pick one of them, if both are available.
For each final gather point, a work item will be created. In a manner similar to [1], constant memory will be allocated for the point hierarchy and for each work item, local memory for the microbuffers and post traversal lists.
The maintenance overhead of having two implementations of materials will be reduced by adding a virtual sampleKernel method to material_t that retrieves the OpenCL kernel snippet which samples the BSDF. This preserves code locality so both implementations can be right next to eachother, and hides the information behind the material interface allowing the materials to be moved to plugins for example. Although OpenCL doesn't support OOP, using some #defines in the kernel code and rearranging the existing BSDF code should allow some code to be copy-pasted at least. The virtual methods will return OpenCLKernel objects. These will have stream operators (<<) to simplify building the strings containing the kernel code. Porting the materials will probably take up a very long time, and as such I plan to have results even if it is not complete. Simple materials will be ported first and tested. The sampleKernel method will return an empty Kernel object if the porting has not been done for the material and in that case, rendering will fail. For composite materials the sampleKernel will combine multiple Kernels.
The deliverables of the project will be the OpenCL integration wrappers, the preprocessing phase of the algorithm (until midterm), the kernels and final implementation of the algorithm (until final evaluation). To ensure that these are all properly implemented and on time, they will all be tested as follows:
with normal
at point
will be displayed as
triangles
where
:





The new integrator will have the same user interface as the original, eventually with the addition of options for setting the point density, CPU/GPU rendering choice etc.
- read more theory, get a deeper understanding of the subject and a broader view of global illumination techniques in general
- study the OpenCL specification in detail
- study the Yafaray codebase in more detail
- help with the CMake build system
- iron out remaining details and make a final design document
- add the OpenCL headers and write some test code
- modify the build environment for OpenCL
- test OpenCL on multiple platforms
- implement the 5 steps from the preprocessing phase
- test the preprocessing implementation
- make sure everything is in order for submitting the midterm evaluation
- implement the sampling kernel framework and kernels for a few simple materials
- make the memory allocations for the micro buffers and run the work items
- implement the microbuffer processing kernel
- build the final image from the gathered data, complete integration
- continue writing more material kernels
- final testing and cleanup
[1] Microrendering http://www.mpi-inf.mpg.de/~ritschel/Microrendering/
[2] QSplat http://www.cs.princeton.edu/~smr/papers/qsplat/qsplat_paper.pdf
[3] OpenCL standard http://www.khronos.org/registry/cl
[4] http://code.google.com/p/openclam/
[5] ATI Stream SDK http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx
[6] NVIDIA OpenCL http://developer.nvidia.com/object/opencl-download.html
[7] MAC OpenCL http://developer.apple.com/technologies/mac/snowleopard/opencl.html