DX12/Vulkan: Boost Shadow Map Performance In Flax Engine

by Alex Johnson 57 views

Currently, in Flax Engine 1.11, the performance of the DX12 and Vulkan renderers on Windows is not as good as the DX11 renderer. This article delves into the performance discrepancies, proposes solutions, and discusses the benefits of optimizing shadow map rendering.

Performance Discrepancies: DX12 vs. DX11

When comparing the performance of the DX12 and DX11 renderers in Flax Engine, there's a noticeable difference, especially in projects like Bistro. Let's break down the specifics:

DX12 Performance

Using the DX12 renderer with the Bistro project, the frame rate hovers around 60 FPS. While this is a functional frame rate, it leaves room for improvement, especially when aiming for smoother, more responsive gameplay experiences. The goal is to optimize the DX12 renderer to match or exceed the performance of DX11.

DX11 Performance

Switching to the DX11 renderer in the same Bistro project yields a significantly higher frame rate, achieving approximately 100 FPS. This stark contrast highlights the performance gap between the two renderers, indicating that the DX12 renderer is not fully leveraging the hardware capabilities as effectively as its DX11 counterpart. Understanding why this performance difference exists is crucial for developing targeted optimization strategies. Achieving higher frame rates is essential for creating a better user experience, reducing latency, and ensuring smoother visuals, especially in fast-paced scenes. By closing the performance gap between DX12 and DX11, developers can ensure that players using modern hardware benefit from the latest rendering technologies without sacrificing performance. This optimization will also allow for more complex scenes and effects without compromising the frame rate, leading to richer and more immersive gaming experiences.

Understanding the Bottleneck

The performance disparity between DX11 and DX12 can be attributed to how each API handles draw calls and multithreading. In DX11, the driver implicitly parallelizes and merges draw calls, whereas DX12 requires explicit management of command lists and synchronization. This difference in handling parallelization can lead to performance bottlenecks in DX12 if not properly addressed. By understanding the underlying mechanisms and differences between these APIs, developers can implement targeted optimizations to improve performance and ensure that DX12 leverages the hardware capabilities more effectively. Furthermore, this knowledge enables better resource management, reduced overhead, and ultimately, a smoother and more responsive gaming experience for end-users.

DX11: Driver-Managed Parallelization

In DX11, the driver handles much of the parallelization behind the scenes. Draw calls are submitted to either the immediate or deferred context, and the driver intelligently parallelizes and merges these calls internally. This abstraction simplifies the development process, as the driver optimizes the execution of draw calls without requiring explicit multithreading from the developer. This ease of use and automatic optimization contribute to the higher frame rates observed in DX11. The driver's ability to efficiently manage resources and parallelize tasks reduces the burden on the CPU, allowing for more headroom and improved overall performance. As a result, DX11 can often deliver smooth and responsive experiences even on less powerful hardware.

DX12: Explicit Multithreading and Synchronization

DX12, on the other hand, requires developers to explicitly manage command lists and synchronization. All draw calls are executed in RenderList::ExecuteDrawCalls, which blocks the main thread. Unlike DX11, where the driver handles parallelization, DX12 requires developers to create multiple command lists and manage descriptor/allocator pools to achieve parallelism. This added complexity can lead to performance bottlenecks if not implemented correctly. Efficiently managing resources and ensuring proper synchronization between threads are crucial for unlocking the full potential of DX12. Failure to do so can result in lower frame rates and a less optimized gaming experience. However, when implemented effectively, DX12 can provide significant performance gains by fully leveraging the capabilities of modern hardware.

Proposed Solutions for DX12/Vulkan Optimization

To address the performance issues in DX12 and Vulkan, several solutions can be implemented. These solutions focus on improving multithreading, resource management, and API abstraction to ensure consistent performance across different platforms. By implementing these optimizations, Flax Engine can fully leverage the capabilities of modern hardware and deliver a smoother and more responsive gaming experience.

DX12 Context Pool

To improve multithreading in DX12, it is essential to create a context pool in addition to the _mainContext. This pool would contain command allocators, descriptor heaps, and upload buffers for each thread. By providing each thread with its own resources, the engine can avoid contention and improve parallel execution. This approach allows multiple threads to work simultaneously without interfering with each other, resulting in significant performance gains. The context pool should be designed to be scalable and efficient, allowing the engine to adapt to different hardware configurations and workloads. Furthermore, the context pool should be carefully managed to prevent resource leaks and ensure that resources are properly released when they are no longer needed.

Descriptor/Allocator Management

The existing ring buffer heaps (DescriptorHeapRingBufferDX12, DescriptorHeapWithSlotsDX12) are not thread-safe, which can lead to issues when parallel command lists are used. To address this, these heaps must be either duplicated for each context or secured with locks/atomic pointers. Duplicating the heaps ensures that each thread has its own private copy, eliminating contention and improving performance. Alternatively, using locks or atomic pointers can provide thread-safe access to the heaps, although this approach may introduce some overhead. The choice between these two approaches depends on the specific requirements of the engine and the trade-offs between memory usage and synchronization overhead. Regardless of the approach, ensuring thread-safe access to descriptor heaps is crucial for achieving optimal performance in DX12.

API Extension: Context Management in Render Passes

Modify RenderList::ExecuteDrawCalls and all passes to accept (or request) a context instead of always using GPUDevice::Instance->GetMainContext(). This involves creating a mechanism for reassembling command lists at the end of the frame (CommandQueueDX12::ExecuteCommandLists). By allowing render passes to operate on different contexts, the engine can distribute the workload across multiple threads and improve parallel execution. This approach requires careful management of command lists and synchronization to ensure that all commands are executed in the correct order. However, the performance gains can be significant, especially in complex scenes with many draw calls. Furthermore, this approach allows for more flexible and dynamic rendering pipelines, enabling developers to create more advanced and visually stunning effects.

Platform Abstraction: Consistent Code Path

To maintain code consistency, similar pools should be created for Vulkan (Secondary Command Buffers) and DX11 (Deferred Contexts). This ensures that the same multithreading techniques can be used across different platforms, simplifying development and maintenance. By abstracting the underlying API details, developers can focus on the core rendering logic without having to worry about platform-specific implementations. This approach also allows for easier porting of the engine to new platforms in the future. The platform abstraction layer should be designed to be flexible and extensible, allowing it to adapt to new APIs and hardware configurations as they emerge. Furthermore, the abstraction layer should be carefully optimized to minimize overhead and ensure that the engine can achieve optimal performance on all supported platforms.

Benefits of Implementing These Solutions

Implementing these solutions is a complex undertaking but is essential for achieving optimal performance in the long run. The benefits include:

  • Improved Multithreading: By distributing the workload across multiple threads, the engine can fully leverage the capabilities of modern multi-core processors.
  • Reduced Bottlenecks: Thread-safe resource management and context pools eliminate contention and improve parallel execution.
  • Consistent Performance: Platform abstraction ensures that the same multithreading techniques can be used across different platforms, simplifying development and maintenance.
  • Scalability: The proposed solutions are designed to be scalable and efficient, allowing the engine to adapt to different hardware configurations and workloads.

In conclusion, optimizing shadow map rendering performance in Flax Engine for DX12 and Vulkan is crucial for delivering a smooth and responsive gaming experience. By implementing the proposed solutions, the engine can fully leverage the capabilities of modern hardware and provide developers with the tools they need to create visually stunning and performant games. Implementing these solutions requires careful planning and execution, but the long-term benefits are well worth the effort. By continuously optimizing the engine, developers can ensure that it remains competitive and delivers the best possible experience for players.

For more information on DirectX 12, visit the Microsoft DirectX documentation.