ProGPU is a high-performance, GPU-first UI framework and composition substrate for .NET, built on top of Silk.NET and WebGPU (wgpu-native). It provides a lightweight, low-allocation alternative to traditional heavyweight UI frameworks by routing all vector graphics, text layout, and composition operations directly to the GPU using native WebGPU draw pipelines.
ProGPU release packages are built from eng/progpu-package-list.sh by the Release GitHub Actions workflow. Samples, tests, diagnostics, and framework shim projects are intentionally not packed.
| Package | Purpose | Project |
|---|---|---|
ProGPU.Backend |
WebGPU device, swapchain, Silk.NET windowing, and platform backend services. | src/ProGPU.Backend/ProGPU.Backend.csproj |
ProGPU.DirectX |
DirectX-compatible facade and shader-oriented API surface implemented on ProGPU/WebGPU. | src/ProGPU.DirectX/ProGPU.DirectX.csproj |
ProGPU.Transpiler |
Shader/source transformation helpers used by generated GPU pipelines. | src/ProGPU.Transpiler/ProGPU.Transpiler.csproj |
ProGPU.Compute |
Compute pipeline helpers for GPU-side effects, acceleration, and future hit-test indexes. | src/ProGPU.Compute/ProGPU.Compute.csproj |
ProGPU.Vector |
Vector primitives, paths, geometry, brushes, pens, and rasterization data models. | src/ProGPU.Vector/ProGPU.Vector.csproj |
ProGPU.Text |
Text layout, glyph metrics, and GPU-ready text rendering helpers. | src/ProGPU.Text/ProGPU.Text.csproj |
ProGPU.Scene |
Scene graph, compositor commands, retained visuals, effects, and presentation primitives. | src/ProGPU.Scene/ProGPU.Scene.csproj |
ProGPU.Layout |
Measure/arrange layout substrate shared by higher-level UI adapters. | src/ProGPU.Layout/ProGPU.Layout.csproj |
ProGPU.Virtualization |
Virtualization helpers for large retained visual and item surfaces. | src/ProGPU.Virtualization/ProGPU.Virtualization.csproj |
ProGPU.WinUI |
WinUI-shaped controls and app model implemented on ProGPU. | src/ProGPU.WinUI/ProGPU.WinUI.csproj |
ProGPU.WinUI.Charts |
Chart controls and chart rendering primitives for the WinUI-shaped layer. | src/ProGPU.WinUI.Charts/ProGPU.WinUI.Charts.csproj |
ProGPU.WinUI.Designer |
Designer/editor controls and diagnostics for ProGPU WinUI surfaces. | src/ProGPU.WinUI.Designer/ProGPU.WinUI.Designer.csproj |
ProGPU.Avalonia |
Avalonia integration and compositor backend adapter. | src/ProGPU.Avalonia/ProGPU.Avalonia.csproj |
ProGPU.Uno |
Uno/WinUI integration and compositor backend adapter. | src/ProGPU.Uno/ProGPU.Uno.csproj |
ProGPU.Dxf |
DXF import/rendering support for ProGPU vector scenes. | src/ProGPU.Dxf/ProGPU.Dxf.csproj |
LibreWPF.Interop |
LibreWPF portable interop contracts consumed by the ProGPU/Silk.NET SDK lane. | src/ProGPU.Wpf.Interop/ProGPU.Wpf.Interop.csproj |
Local package build:
PROGPU_PACKAGE_VERSION=11.0.0-dev ./eng/progpu-pack.shThe release workflow validates docs, restores, builds, tests, packs .nupkg/.snupkg artifacts, and can publish to NuGet.org when NUGET_API_KEY is configured. See docs/release.md.
The ProGPU framework is built in a modular, layered stack that bridges native graphics APIs and system windowing up to a modern, declarative WinUI-compatible user interface layer.
graph TD
subgraph L6 ["Layer 6: Application Layer"]
App["Gallery Dashboard / LOL/s & MotionMark Benchmarks"]
end
subgraph L5 ["Layer 5: WinUI Framework Layer"]
Controls["Grid, StackPanel, ScrollViewer, Border, Pivot, RichTextBlock"]
FE["FrameworkElement"]
LN["LayoutNode - Measure & Arrange Sizing Negotiation"]
end
subgraph L4 ["Layer 4: Scene Graph & Effects Layer"]
CV["ContainerVisual / DrawingVisual / Visual"]
ILN["ILayoutNode Interface - Decoupled Invalidation"]
FX["GPGPU Multi-Pass Effects Pipeline - Blur & DropShadow"]
end
subgraph L3 ["Layer 3: Compositor, Text & GPGPU Rasterizer"]
Comp["Compositor - Span-Based Vertex/Index Mesh Compiler"]
Text["TTF Line Layout & Paragraph Wrapping Engine"]
Rast["Compute-Bound 4x SSAA Analytical Path Rasterizer"]
end
subgraph L2 ["Layer 2: Graphics Infrastructure"]
Wgpu["WgpuContext - WebGPU Adapter/Device & Swapchain Management"]
end
subgraph L1 ["Layer 1: System & Windowing"]
Silk["Silk.NET Windowing & GLFW OS Event Loop"]
end
App --> Controls
Controls --> FE
FE --> LN
LN --> CV
CV --> ILN
CV --> FX
ILN --> Comp
FX --> Rast
Comp --> Rast
Rast --> Wgpu
Wgpu --> Silk
- System & Windowing (Layer 1): Interacts with the operating system event queue and monitors display boundaries via Silk.NET and GLFW. It handles window load, resize, rendering loops, and low-level mouse and keyboard input events.
- Graphics Infrastructure (Layer 2): Manages physical GPU adapter querying, logical device creation, graphics command queues, and swapchain surface configuration.
- Compositor, Text & GPGPU Rasterizer (Layer 3): Compiles high-level drawing primitives into optimized GPU-bound vertex and index buffers. Performs TrueType Font (TTF) line layout, glyph metrics extraction, and text line wrapping. Hosts the compute-bound vector path rasterization engine which performs analytical winding-number raycasting inside custom WGSL shaders at 4x SSAA, completely avoiding CPU segment flattening.
- Scene Graph & Effects Layer (Layer 4): Establishes a hierarchical tree of composition visuals (
ContainerVisual,DrawingVisual). Features the decoupledILayoutNodeinterface to allow visual tree operations to invoke layout renegotiations without introducing circular project dependencies. Drives a multi-pass offscreen composition effects pipeline that schedules horizontal/vertical Gaussian blur compute shaders to render real-time drop shadows, Gaussian blurs, and neon glows directly on layout elements. - WinUI Framework Layer (Layer 5): Implements the sizing negotiation lifecycle (
MeasureandArrange) compatible with standard XAML layouts. Handles layout constraints, paddings, margins, alignment calculations, and provides standard UI controls. - Application Layer (Layer 6): The end-user presentation layer, hosting control gallery panels, real-time performance diagnostics overlays, and benchmark test suites.
Our work introduces eleven core rendering and performance optimization pillars that collectively transform frame times, CPU allocation metrics, visual fidelity, and event dispatcher throughput.
Traditional layout systems recursively traverse the entire scene graph every frame to negotiate sizing, causing massive
ProGPU introduces a cached sizing negotiation model that short-circuits measurements using layout dirty flags and cached input boundaries:
flowchart TD
Start["Measure Pass availableSize"] --> Cached{"_isMeasureValid and availableSize == _previousAvailableSize?"}
Cached -- Yes --> O1Exit["O1 Early Exit - Return Cached DesiredSize"]
Cached -- No --> Calc["Calculate Margin Insets & Bounds Constraints"]
Calc --> Override["Execute MeasureOverride child passes recursively"]
Override --> CacheResult["Store DesiredSize, _previousAvailableSize & set _isMeasureValid = true"]
CacheResult --> ArrangeStart["Arrange Pass finalRect"]
ArrangeStart --> CachedArr{"_isArrangeValid and _isMeasureValid and finalRect == _previousFinalRect?"}
CachedArr -- Yes --> O1ExitArr["O1 Early Exit - Return Immediately"]
CachedArr -- No --> Align["Calculate Offset Coordinates & Horizontal/Vertical Alignments"]
Align --> OverrideArr["Execute ArrangeOverride child placements recursively"]
OverrideArr --> CacheResultArr["Store Offset/Size, _previousFinalRect & set _isArrangeValid = true"]
-
Measure Cache: Inside
LayoutNode.Measure(), if_isMeasureValidis true and the incomingavailableSizematches_previousAvailableSize, the pass returns immediately.MeasureOverrideand recursive child traversals are fully bypassed in$O(1)$ time. -
Arrange Cache: Inside
LayoutNode.Arrange(), if_isArrangeValidand_isMeasureValidare true and the incomingfinalRectmatches_previousFinalRect, the pass short-circuits. Children offsets are not recalculated, and recursive child arrangements are bypassed. -
Parent Bubble-Up Invalidation: When layout-affecting properties (such as
Margin,Padding,WidthConstraint,HeightConstraint, alignments, or child mutations) are changed, they invokeInvalidateMeasure()orInvalidateArrange(). These clear local flags and bubble up the invalidation recursively to parent nodes, forcing only the dirty subtrees to be re-evaluated during the next frame's deferred layout pass.
To prevent circular dependencies between the ProGPU.Scene assembly (base visual layer) and the ProGPU.Layout assembly (WinUI framework layer), the ILayoutNode interface is defined in ProGPU.Scene:
public interface ILayoutNode
{
void InvalidateMeasure();
}Visual tree mutation methods (ContainerVisual.AddChild, RemoveChild, ClearChildren) check if this implements ILayoutNode. If so, they invoke InvalidateMeasure(), ensuring that any changes in visual tree structure automatically mark the layout path dirty without explicit parent layout references.
Layout caching relies heavily on comparing boundary structs (Thickness and Rect) on every node. Standard C# struct comparison utilizes generic ValueType.Equals, which triggers CPU reflection, runtime boxing, and high memory allocations.
To eliminate this bottleneck, we implemented type-safe, non-boxing, custom equality overloads for both structs:
Thickness(Margins and Paddings)Rect(Layout arrangements and clipping boundaries)
Each struct now overrides Equals(Thickness/Rect other), Equals(object? obj), GetHashCode(), and provides high-speed operators:
public bool Equals(Rect other)
{
return X == other.X && Y == other.Y && Width == other.Width && Height == other.Height;
}
public static bool operator ==(Rect left, Rect right)
{
return left.Equals(right);
}
public static bool operator !=(Rect left, Rect right)
{
return !left.Equals(right);
}These overloads compile down to direct float comparison instructions, achieving zero-allocation, ultra-fast boundary checks.
To allow graphics and layout benchmarks to be evaluated at their true physical limit, we disabled vertical synchronization (VSync) throttling across all layers of the GPU pipeline:
- Windowing Layer: Window options in the main, developer tools, and dynamic window controllers explicitly configure VSync to be disabled:
options.VSync = false;
- WebGPU Swapchain: Inside
WgpuContext.ConfigureSwapChain(), the surface capabilities of the GPU adapter are queried. IfPresentMode.Immediateis supported, the swapchain present configuration bypasses synchronization lockups:PresentMode presentMode = PresentMode.Fifo; // Fallback VSync for (uint i = 0; i < capabilities.PresentModeCount; i++) { if (capabilities.PresentModes[i] == PresentMode.Immediate) { presentMode = PresentMode.Immediate; // VSync Off break; } }
This enables the graphics swapchain to present frames as quickly as the GPU queue is filled, releasing the 60 FPS constraint and allowing framerates to soar into the hundreds or thousands of FPS.
The LOL/s benchmark stresses the visual framework by constantly removing and adding hundreds of poolable text controls to a canvas using a background thread loop.
- The Livelock Risk: If a background thread pushes UI events (like
AddChildor property changes) to the main thread's dispatcher loop as fast as possible without throttling, it will quickly overflow the main thread's event queue. The main thread then spends entire frame cycles acquiring queue locks to process actions, creating massive lock contention that completely starves the UI thread and freezes the application. - The Backpressure Solution: We introduced a thread-safe
PendingCountproperty to the mainUIThreadqueue. The background benchmark thread loops continuously without fixed sleep periods, but monitors queue occupancy:Loadingflowchart TD Start["Background Task Loop"] --> CheckBackpressure{"UIThread.PendingCount > 100?"} CheckBackpressure -- Yes --> Sleep["Thread.Sleep 1ms / Release Monitor Locks"] Sleep --> Start CheckBackpressure -- No --> Post["Post Action immediately / No Sleep"] Post --> UIThread["UIThread.RunPending - Main Thread drains queue"] UIThread --> AddChild["AddChild/RemoveChild visual tree mutation"]- Backpressure Active (>100): The background thread sleeps for exactly
1ms. This releases the queue monitor lock completely and relinquishes the CPU slice, allowing the main UI thread to drain the event queue with zero lock contention. The application remains 100% responsive and immune to livelocks. - Backpressure Inactive (<=100): The background thread runs with zero sleep, dispatching new visual mutations to the UI thread continuously to maximize throughput.
- Backpressure Active (>100): The background thread sleeps for exactly
In real-time GPU-based vector rendering, compiling high-level primitives (such as Rectangles, Ellipses, Rounded Rectangles, Paths, Lines, and Bezier curves) into dynamic vertex and index buffers is a major CPU bottleneck. Standard implementation using sequential .Add(...) calls on List<T> invokes continuous bounds checks, potential array resizing/reallocations, and element copying overhead.
To maximize throughput, the Compositor is optimized using high-performance Span<T> memory writes:
- Pre-Allocation Throttling: Instead of building meshes incrementally, the compositor determines the exact number of vertices and indices required for a primitive beforehand.
- Backing Buffer SetCount: The internal list count is directly resized using
CollectionsMarshal.SetCount(list, newCount)to avoid iterative dynamic reallocation/growth logic inside .NET'sList<T>. - Direct Memory Access: The internal backing array is extracted as a type-safe memory slice via
CollectionsMarshal.AsSpan(list).Slice(offset, count). - Fast Assembly Assignment: Vertices and indices are written directly to indices in the returned
Span<T>or pre-filled usingSpan.Fill(defaultValue)for uniform values. - Bulk Memory Clipping: Clamping vector coordinates to active clipping boundaries is performed in a single linear pass over the direct
Span<VectorVertex>reference, bypassing indexed list getters.
int originalVertexCount = _vectorVerticesList.Count;
int vertexToAdd = 2 * (N + 1);
CollectionsMarshal.SetCount(_vectorVerticesList, originalVertexCount + vertexToAdd);
var vertexSpan = CollectionsMarshal.AsSpan(_vectorVerticesList).Slice(originalVertexCount, vertexToAdd);
vertexSpan.Fill(baseVertex);This ensures that the mesh compiler achieves zero-allocation dynamic buffer construction, minimal instruction-level overhead, and runs at near-native C-speed.
In traditional UI and vector engines, every active visual element in an animation loop is modeled as a heap-allocated class object. During high-count stress tests (such as the MotionMark benchmark rendering thousands of dynamically moving curves), these allocations put immense pressure on the .NET Garbage Collector (GC), leading to periodic micro-stutters and frame drops.
ProGPU eliminates this overhead using lightweight structs and batched pipeline groupings:
- Stack-Allocated Elements: Animated shapes are modeled using compact, stack-allocated
ElementandGridPointvalue-type structs instead of class objects:public struct Element { public SegmentKind Kind; public GridPoint Start; public GridPoint Control1; public GridPoint Control2; public GridPoint End; public Vector4 Color; public float Width; public bool Split; public SolidColorBrush CachedBrush; public Pen CachedPen; }
- Zero-Allocation Layout Mapping: Grid points (e.g. 80x40 logical coordinate system) are converted to physical display boundaries in a single algebraic transform pass during rendering, avoiding intermediate object creations.
- Cohesive Path Batching: Rendering runs in two optimized modes:
- Direct GPU Shader Pipeline: Iterates through elements, identifying contiguous segments sharing visual style traits (pens/brushes). It batches drawing commands directly to the GPU using direct primitive rendering APIs, reducing draw call state swaps.
- Path Compute-Rasterizer Mode: Batches continuous curves into a single, combined
PathGeometryfigure until a logical "Split" flag is encountered. This group is drawn in one composite rasterization pass, optimizing path cache locality in the underlying compute pipelines.
Standard graphics engines struggle to apply dynamic blurred effects (such as Gaussian backdrop blurs, soft ambient drop shadows, and neon glowing halos) to standard layout elements in real-time due to high composition and memory transfer overhead. ProGPU overcomes this with a multi-pass offscreen composition and compute processing system.
graph TD
Subtree["Subtree Render Pass"] -->|Draw Elements 1x MSAA| Src["Source Offscreen Texture"]
Src -->|Horiz. Dispatch| HCompute["Gaussian Blur Compute Shader Pass 1"]
HCompute -->|Vert. Dispatch| VCompute["Gaussian Blur Compute Shader Pass 2"]
VCompute -->|Output Framebuffer| Dest["Destination Blurs/Shadows Texture"]
Dest -->|Matrix Align and Z-Order Bind| Framebuffer["Primary Swapchain Framebuffer"]
- Dynamic Texture Caching: Textures (
Source,Temp, andDestinationbuffers) are cached per-element in a specialized dictionary (_effectTextures). They are dynamically resized only when the element's actual visual bounds mutate, eliminating frame-by-frame allocation/deallocation thrashing. - Offscreen Redirection: Standard scene-graph rendering in ProGPU uses 4x MSAA for vector geometry. Since WebGPU compute shaders cannot directly read or sample multisampled textures, ProGPU compiles a specialized 1x MSAA offscreen rendering pipeline (
_vectorPipelineOffscreen,_textPipelineOffscreen,_texturePipelineOffscreen). When an element has an active effect:- The compositor preserves the active vector batch state and clips.
- It redirects all rendering of the element and its entire visual child subtree into the 1x MSAA offscreen
Sourcetexture using an isolated orthographic projection matrix. - Restores the main batch state after capture.
- Two-Pass Compute Acceleration: The compute pass binds the
Sourcetexture and executes a horizontal-pass WGSL compute shader, writing intermediate results to theTemptexture. It then binds theTemptexture to execute a vertical-pass compute shader, outputting the final blurred mask to theDestinationtexture. - High-Performance Compositing: The final blurred texture is drawn back onto the main screen swapchain as a textured quad. For drop shadows, the texture is drawn with configurable offsets, blending colors, and alpha multipliers, and the original
Sourcetexture is composited cleanly on top, maintaining crisp bounds.
To bypass CPU bottlenecks (e.g. flattening Bezier curves into thousands of lines and performing heavy triangulation), ProGPU integrates a pure GPU-bound vector path rasterizer. The engine computes vector fills analytically directly inside custom WebGPU WGSL compute shaders.
To satisfy WebGPU/WGSL uniform and storage buffer packing requirements, layout metrics are organized into sequentially packed structs matching exact 16-byte memory alignments:
[StructLayout(LayoutKind.Sequential, Pack = 16)]
public struct PathUniforms
{
public float XStart; public float YStart;
public float Scale; public uint PathIndex;
public uint AtlasX; public uint AtlasY;
public uint Width; public uint Height;
}
[StructLayout(LayoutKind.Sequential, Pack = 16)]
public struct GpuPathRecord
{
public uint StartSegment; public uint SegmentCount;
public float MinX; public float MinY;
public float MaxX; public float MaxY;
public uint Pad0; public uint Pad1;
}
[StructLayout(LayoutKind.Sequential, Pack = 16)]
public struct GpuPathSegment
{
public Vector2 P0; public Vector2 P1;
public Vector2 P2; public Vector2 P3;
public uint SegmentType; public uint Pad0;
public uint Pad1; public uint Pad2;
}The rasterizer counts curve intersections analytically using a horizontal ray casting winding-number algorithm directly in WGSL:
-
Line Intersection: Evaluates linear roots analytically:
$$t = \frac{p_y - A_y}{B_y - A_y}$$ -
Quadratic Bezier Intersection: Solves quadratic equation
$(1-t)^2 A_y + 2(1-t)t B_y + t^2 C_y - p_y = 0$ for$t \in [0, 1]$ . Valid intersections are checked against the ray, and winding adjustments are updated based on the tangent derivative:$$P'_y(t) = 2(1-t)(B_y - A_y) + 2t(C_y - B_y)$$ -
Cubic Bezier Intersection: Expands the cubic Bezier equation into
$a t^3 + b t^2 + c t + d = 0$ . The compute shader executes Cardano's formula (solve_cubichelper in WGSL) to find up to 3 real roots, updating the winding number according to the cubic tangent derivative:$$P'_y(t) = 3 a t^2 + 2 b t + c$$
- CPU Path Cache (
_pathGeometryCache): Compiled segment arrays and pre-calculated local bounds are cached for each uniquePathGeometry. Dynamic frames skip CPU figures traversal, and copy segment spans directly, reducing CPU path compilation times to 0.30ms for 100,000 shapes. - Pixel-Level Bounding Box Shader Skip: To eliminate GPU rasterization bottlenecks, the fine-rasterization pixel loop performs a screen-space bounding box check:
Pixels outside the shape boundaries immediately bypass local coordinate transforms, 4-sample subpixel loops, and expensive winding calculations. This discards ~95% of active operations per pixel, resulting in a 15x rendering speedup.
if (px < inst.screenMinX || px > inst.screenMaxX || py < inst.screenMinY || py > inst.screenMaxY) { continue; }
- 4x SSAA Quality Correctness: Replaced screen coordinates with transformed local coordinates in the Sample 2 containment checks of the
PathRasterizerShader. This ensures that under high multisampling/supersampling, anti-aliased edge pixels align perfectly, delivering sharp, hardware-accurate vector strokes and fills.
Standard Signed Distance Field (SDF) rendering often clips the outer half of strokes or the edges of anti-aliasing gradients because the generated quad boundaries are drawn exactly at the shape's mathematical dimensions. This limits pixel operations outside the bounding box, resulting in a rough, aliased border.
To achieve state-of-the-art vector quality with zero performance degradation, we implemented a dual-stage quad inflation and pixel-distance anti-aliasing framework:
-
Separated-Pass Quad Expansion: During shape compilation in
Compositor.cs, drawing of Rectangles, Ellipses, and Rounded Rectangles is divided into independent Brush (fill) and Pen (stroke) passes.-
Fill Pass (Brush): Inflates bounding quad vertices and
texCoordoffset variables outwards by a padding of1.5pixels. -
Stroke Pass (Pen): Inflates bounding quad vertices and
texCoordoffsets bythickness / 2.0 + 1.5pixels. This expansion guarantees that the outer half of a stroke of width$T$ , as well as its smooth anti-aliasing gradient, are fully rendered without quad boundary clipping.
-
Fill Pass (Brush): Inflates bounding quad vertices and
-
Pixel-Distance WGSL Stroke Anti-Aliasing: For GPU-expanded Lines, Quadratic Beziers, Cubic Beziers, and elliptical Arcs, the vertex shader computes the exact signed pixel distance from the center spline to the expanded vertex boundaries, passing it to the fragment shader via
gridIndex. The fragment shader evaluates anti-aliasing dynamically using:This calculates a crisp, subpixel-accurate smoothstep edge transition directly in screen-space pixel coordinates, eliminating aliased jagged edges on all lines and splines.let d_pixels = abs(input.gridIndex); let d_shape = d_pixels - input.strokeThickness * 0.5; shapeAlpha = 1.0 - smoothstep(-0.5, 0.5, d_shape);
ProGPU implements a lightweight, high-performance, and memory-safe theming, styling, and templating engine designed to emulate the logical capabilities of WinUI 3 but operating with minimal CPU and memory overhead.
flowchart TD
Reg["DependencyProperty.Register"] -->|Sequential Indexing| DP["Index-Based Property Mapped Arrays"]
DP -->|Precedence Resolution| GetVal["O1 GetValue Precedence Sweep"]
Theme["ThemeManager.ThemeChanged"] -->|Lazy Invalidation| Dirty["Set IsThemeDirty = true"]
Dirty -->|On-Demand Query| GetVal
subgraph Storage ["O(1) Parallel Contiguous Value and Theme Arrays"]
Local["_localValues"]
Style["_styleValues"]
DStyle["_defaultStyleValues"]
LocalTheme["_localThemeResources"]
StyleTheme["_styleThemeResources"]
DStyleTheme["_defaultStyleThemeResources"]
end
Traditional XAML frameworks store DependencyObject property values in heavy dictionaries (Dictionary<DependencyProperty, object>), which trigger expensive hash calculation, collisions, and lookup overhead inside tight render or layout loops.
ProGPU bypasses dictionaries entirely by introducing sequential indexing:
-
Sequential Indexing: Every registered
DependencyPropertyis assigned a unique, sequential, zero-basedIndexfrom a thread-safe static list during bootstrap. -
Direct Array Access:
DependencyObjectstores properties in a set of parallel contiguous flat arrays (_localValues,_styleValues,_defaultStyleValues,_effectiveValues, and_valueSources) matching the index sizes. -
Precedence Resolution: Property value resolution (
GetValue(dp)) is simplified to direct index checks on these arrays in$O(1)$ time, resolving values via native priority precedence:$$\text{Local} \succ \text{Style} \succ \text{Default Style} \succ \text{Inherited} \succ \text{Default}$$
Eagerly traversing and updating dynamic brushes across the entire visual tree on every theme change triggers substantial CPU frame stutters. ProGPU bypasses this via a lazy evaluation pipeline:
- Visual Tree Invalidation: When a theme toggle is triggered,
ThemeManager.ThemeChangedfires. The system recursively propagates a cheapIsThemeDirty = trueflag down the scene graph (NotifyThemeChanged), avoiding immediate value updates. - Parallel Flat Theme Mappings: Dynamic references are stored in parallel arrays (
_localThemeResources,_styleThemeResources, and_defaultStyleThemeResources). During subsequent property reads (GetValue(dp)), if the dirty flag is set, the system sweeps these parallel arrays, re-evaluates active key lookups against the theme palette, and rebuilds only the affected elements' effective values in a single sequential linear pass.
To support lightweight control customization without the heavy reflection, expression compilation, or string-matching of traditional bindings:
- Index-Based Callbacks:
DependencyObjectmaintains an index-sequential list of callbacks registered viaRegisterPropertyChangedCallback(dp, callback). - WinUI-Compliant Tokens: Registration returns a unique
longtoken, allowing surgical unregistration viaUnregisterPropertyChangedCallback(dp, token). - Weak, Self-Cleaning Template Binding:
TemplateBindingcoordinates bindings between controls and template roots using weak references (WeakReference<DependencyObject>). On every callback trigger, if it detects that the target control has been garbage-collected, the binding automatically unregisters itself from the source object, completely preventing memory leaks.
To support robust diagnostic capabilities:
- Multi-Window Visual Inspector: Refactored the
DevToolsvisual tree population (RefreshVisualTree) to dynamically traverse all active windows registered inWindowManager.ActiveWindows(filtering out the inspector itself), and automatically falling back to the thread-staticInputSystem.Rootfor raw Silk.NET window bindings. - Popup & Dialog Hierarchies: Merges active floating popups and dialogs from
PopupService.ActivePopupsas a dedicated branch in the visual tree, making overlay dialogs fully inspectable. - Global Invalidation Hub: Replaced thread-local repaints with a public
InvalidateAllMainWindows()hub inDevToolsService, ensuring hover overlays, inspection borders, and property changes instantly refresh across all active window compositors.
Traditional GPU engines suffer from low-resolution stretch blurriness on macOS high-DPI (Retina) screens because they configure the SwapChain to match logical coordinates, letting the operating system scale the output. ProGPU achieves true macOS Retina rendering quality while maintaining high performance through four main pillars:
- Physical-Pixel Backing Store SwapChain: The WebGPU swapchain and render pipelines are driven directly by the window's physical
FramebufferSizeinstead of logical size (e.g.2560x1600instead of1280x800). This aligns all vector and rasterization outputs exactly 1:1 with hardware pixels, eliminating OS-level linear stretching blur. - DPI-Aware Physical Glyph Caching: Computes the high-DPI scaling factor dynamically (
dpiScale = FramebufferSize.X / Size.X) and pre-rasterizes glyphs in theGlyphAtlasat their actual physical pixel font size (cmd.FontSize * dpiScale), ensuring that the atlas contains the high-resolution 2x textures. - 4x Physical Subpixel Snapping: Snippets the screen-transformed baseline cursor position to physical device pixels (
transPos * dpiScale) and snaps the horizontal coordinate to the nearest 1/4th physical pixel, completely eliminating subpixel blur on the screen. - Retina Snap-Back logical mapping: Snapped physical coordinates of the drawing quad are divided by
dpiScalebefore writing them to the vertex buffer, mapping them back to logical space for the compositor's orthographic projection matrix. The GPU hardware then renders the logical quad exactly 1-to-1 with screen physical pixels! - Direction-Aware Winding Curve Crossing Corrections: Replaced the static, direction-agnostic interval checks in both the quadratic and cubic Bezier crossing solvers with Precise Direction-Aware Half-Open Winding Intervals based on the vertical derivative sign (
deriv_y):- Upward Crossing (
deriv_y > 0.0): Valid range is[0.0, 1.0)(inclusive of start, exclusive of end). - Downward Crossing (
deriv_y < 0.0): Valid range is(0.0, 1.0](exclusive of start, inclusive of end). This eliminates boundary vertex double-counting and zero-counting across all transition types (line-to-curve, curve-to-line, curve-to-curve) in bothGlyphRasterizerandPathRasterizershaders, completely preventing horizontal seam and drop-out artifacts at curve joins (such as on letters likeG/g).
- Upward Crossing (
In high-performance GPU-bound UI frameworks, recursively traversing large, static visual subtrees (such as complex sidebar menus, navigation drawers, and presentation panels) every frame at double physical coordinates (FramebufferSize) on macOS Retina screens incurs heavy CPU-to-GPU overhead (layout traversal, vertex mesh generation, matrix multiplications, draw call issuance, and constant buffer uploads).
ProGPU introduces Layered High-DPI Visual Caching (CacheAsLayer) to completely eliminate redundant rendering loops for static or rarely modified subtrees:
flowchart TD
Compile["CompileVisualTree node"] --> CacheChecked{"node.CacheAsLayer and Compositor.IsCacheAsLayerEnabled?"}
CacheChecked -- No --> NormalPass["Standard Pass: Recurse Visual Subtree and Compile Primitives"]
CacheChecked -- Yes --> DirtyCheck{"node.IsDirty or node.LayerTexture == null?"}
DirtyCheck -- Yes --> RenderOff["Execute RenderOffscreen centered in node.LayerTexture"]
RenderOff --> MarkClean["Set node.IsDirty = false"]
MarkClean --> DrawTexture["Compile single DrawTexture command onto Swapchain"]
DirtyCheck -- No --> DrawTexture
-
Offscreen Physical Buffering: When
CacheAsLayer = trueis set on a static visual (like theNavigationView's sidebar pane), the compositor redirects rendering of the node and its entire subtree into an isolated offscreen texture (LayerTexture) allocated at exact physical pixel dimensions:$$w = \text{logicalWidth} \cdot \text{dpiScale}, \quad h = \text{logicalHeight} \cdot \text{dpiScale}$$ -
O(1) Render Bypass: On subsequent frames, if
node.IsDirty == falseand the cache is valid, the compositor completely skips visual tree traversal, geometry generation, and command decoding for the entire subtree. Instead, it issues exactly 1 Texture draw call (rendering the pre-compiledLayerTextureback onto the swapchain), achieving an instant 1.77x rendering acceleration. -
Razor-Sharp Typography & 1:1 Pixel Alignment: During offscreen rendering, the projection matrix uses logical boundaries, but text glyphs are snapped and rasterized at the physical
dpiScaleinsideCompileTextCommand. Drawing this cached layer texture back onto the physical swapchain guarantees perfect 1:1 physical pixel alignment and native-sharp typography on macOS Retina displays without bilinear filtering blur. -
Lazy Dirty-State Propagation: When any child element inside the cached subtree changes (e.g. hovered, clicked, or typed into), invalidation sets
IsDirty = trueand bubbles up to the cached parent node. The compositor automatically detects this dirty state on the next frame, re-runsRenderOffscreento update the cache in a single frame, and marks it clean again. -
Global Settings Switch: The caching system can be enabled or disabled completely at runtime globally:
-
Individual Control:
Visual.CacheAsLayer = true; -
Global Override:
Compositor.IsCacheAsLayerEnabled = true / false;(Toggleable via the Application Settings panel).
-
Individual Control:
In retained scene graphs with interleaved primitive types (such as vector geometries, offscreen computer-generated textures, and rich text visual elements), simple bulk-draw grouping causes Z-order overlap bugs. If all textures or all texts are batched and drawn at the very end of layer compilation, solid backgrounds or overlay vectors can draw on top of pre-rendered textures, resulting in black or empty areas.
ProGPU implements a Dynamic Z-Ordered Draw Call Batching mechanism within Compositor.cs to achieve optimal batching performance while strictly preserving visual Z-order:
- Pending Batch Tracking: Instead of immediate submission, consecutive vector shape and text draw commands are accumulated into contiguous ranges tracked via
_pendingVectorStartand_pendingTextStartpointers. - Ordered Flush Commits (
CommitPendingDrawCalls): Whenever a boundary-crossing operation is encountered (such as an offscreen compiled texture draw call or layer bounds transition), the compositor flushes accumulated vector and text batches usingCommitPendingDrawCalls(). This groups consecutive visual primitives into single drawing calls while guaranteeing they are submitted to the GPU command encoder in the exact Z-order depth traversed by the visual tree. - Zero-Allocation Dynamic Offsets: The batched ranges directly index into GPU-mapped vertex and index backing buffers, avoiding CPU copy operations and preserving near-native rendering speeds.
High-performance vector rendering loops are highly sensitive to Garbage Collection (GC) pressure. Passing coordinate arrays (such as Vector2[] for complex polylines, curves, or CAD structures) on every frame forces heap allocation and copying, resulting in massive GC thrashing.
ProGPU completely eliminates this overhead by introducing a zero-allocation vector drawing engine driven by ReadOnlySpan<T> and a Skia-like GpuPicture command caching architecture:
flowchart TD
subgraph AllocPool ["Zero-Allocation Frame Draw (Pooling)"]
DrawCall["DrawPolyline(Pen, ReadOnlySpan<Vector2> points)"] --> GetPool["Acquire continuous PointBuffer from DrawingContext"]
GetPool --> CopySpan["Copy points data in bulk using high-speed Span.CopyTo"]
CopySpan --> RecordCmd["Record RenderCommand with PointBufferOffset and PointBufferCount"]
end
subgraph CacheSystem ["Pre-Recorded Caching Loop (GpuPicture)"]
RecStart["GpuPictureRecorder.BeginRecording(bounds)"] --> RecDraw["Record vector commands into local buffers once"]
RecDraw --> RecEnd["EndRecording() compiles into immutable GpuPicture"]
RecEnd --> DrawCache["context.DrawPicture(picture, cameraViewMatrix)"]
DrawCache --> CompositorPlay["Compositor compiles and plays back directly in-place (Zero-Copy)"]
end
Since ReadOnlySpan<T> is a stack-only ref struct, it cannot be stored on the heap or inside standard lists. To allow zero-allocation span-based rendering, DrawingContext maintains internal pre-allocated continuous memory lists:
PointBuffer(List<Vector2>)DoubleBuffer(List<double>)Line3DBuffer(List<Line3D>)FloatBuffer(List<float>)
On every frame refresh, calling .Clear() on these buffers resets their logical Count to 0 but retains their internal backing array capacity. Drawing coordinates are copied into these pre-allocated pools using high-speed bulk Span<T>.CopyTo operations. As long as capacity is sufficient, frame-by-frame rendering runs at near-native speed with absolutely zero heap allocations.
To support both real-time dynamic rendering (where coordinates live in the active DrawingContext pools) and cached playback (where coordinates live in static arrays), we introduce the IRenderDataProvider interface:
public interface IRenderDataProvider
{
ReadOnlySpan<Vector2> GetPoints(int offset, int count);
ReadOnlySpan<double> GetDoubles(int offset, int count);
ReadOnlySpan<Line3D> GetLines3D(int offset, int count);
ReadOnlySpan<float> GetFloats(int offset, int count);
}Both DrawingContext and GpuPicture implement IRenderDataProvider. Inside WebGPU mesh compilation, the compositor queries coordinate spans directly from the active provider using the offsets and counts recorded in the RenderCommand.
- Recording: Call
GpuPictureRecorder.BeginRecording(bounds)to retrieve a recordingDrawingContext. Commands are recorded normally using the zero-allocation span APIs. Callrecorder.EndRecording()to compile the active lists into an immutableGpuPictureobject (which allocates static arrays only once during compile time). - Playback: Render a pre-recorded picture via
context.DrawPicture(picture)or apply dynamic camera transitions in GPU-space viacontext.DrawPicture(picture, cameraViewMatrix). - Zero-Copy Playback: At the compositor level, when a
DrawPicturecommand is encountered, it recursively plays back the pre-compiled picture commands directly in-place using the picture itself as theIRenderDataProvider, completely avoiding CPU copying or allocation during rendering.
// Draws polylines or polygon outlines directly from stack memory
public void DrawPolyline(Pen pen, ReadOnlySpan<Vector2> points, bool isClosed = false);
// Draws quadratic or cubic B-Spline curves
public void DrawSpline(Pen pen, ReadOnlySpan<Vector2> controlPoints, ReadOnlySpan<double> knots, int degree);
// Draws rational, weighted NURBS curves
public void DrawSpline(Pen pen, ReadOnlySpan<Vector2> controlPoints, ReadOnlySpan<double> knots, ReadOnlySpan<double> weights, int degree, bool isClosed);
// Draws 3D ACIS solids or wireframe boundaries
public void DrawAcisSolid(Pen pen, ReadOnlySpan<Line3D> edges, Matrix4x4 modelTransform);
// Hardware-accelerated dynamic chart line series
public void DrawGpuLineSeries(ReadOnlySpan<float> interleavedCoords, int pointsCount, float thickness, Brush brush);
// Hardware-accelerated dynamic chart scatter series
public void DrawGpuScatterSeries(ReadOnlySpan<float> interleavedCoords, int pointsCount, float radius, Brush brush);Wraps standard heap-allocated arrays into ReadOnlySpan<T> using new ReadOnlySpan<T>(array) and forwards to the high-performance pipeline. Assigns legacy fields (SplineWeights, Edges3D) on the created RenderCommand structures to preserve 100% test compatibility and visual tree diagnostics:
public void DrawPolyline(Pen pen, Vector2[] points, bool isClosed = false);
public void DrawSpline(Pen pen, Vector2[] controlPoints, double[] knots, int degree);
public void DrawSpline(Pen pen, Vector2[] controlPoints, double[] knots, double[]? weights, int degree, bool isClosed);
public void DrawAcisSolid(Pen pen, List<Line3D> edges, Matrix4x4 modelTransform);High-performance viewport virtualization is highly sensitive to coordinate math re-calculation and z-order sorting. To guarantee flawless macOS Retina-quality scrollbar overlay Z-order depth, precise boundary clipping, and locked 60 FPS scrolling speeds, ProGPU implements a WinUI-Style Cooperating Scroll Virtualization architecture:
flowchart TD
subgraph Parent ["ItemsControl (Templated Control)"]
Border["Border (Chrome Background)"] --> ScrollViewer["ScrollViewer (Viewport Clipping)"]
end
subgraph Child ["VirtualizingPanel (Cooperating Child)"]
Panel["UniformVirtualizingGridPanel / VirtualizingStackPanel"]
end
ScrollViewer -->|Hosts Panel inside Content| Panel
Panel -->|Traverses Visual Tree| ParentQuery{"Parent ScrollViewer found?"}
ParentQuery -- Yes --> Cooperate["Cooperating Mode: Dynamic Offset Bindings"]
ParentQuery -- No --> Standalone["Standalone Mode: Fallback ScrollBarOverlay child"]
Cooperate -->|MeasurePass: DesiredSize.Y = TotalVirtualHeight| ScrollViewer
ScrollViewer -->|Updates scrollbars and sets VerticalOffset| Cooperate
ScrollViewer -->|Physically translates panel by -VerticalOffset| Panel
Cooperate -->|UpdateViewport: Render cells at absolute position row*ItemHeight| Panel
- Cooperating Mode: When hosted inside a parent
ScrollViewer,VirtualizingPaneldynamically traverses up the visual parent chain (ScrollViewerOwner) to establish a direct binding link:- Unified Offsets: Reading and writing
ScrollOffsetbinds directly toScrollViewer.VerticalOffset. - Adaptive Viewport: The layout viewport bounds (
ViewportWidth/ViewportHeight) scale automatically with the parentScrollViewerwindow boundaries. - Extent Reporting: During the measure pass (
MeasureOverride), the panel computes the total height of all items (TotalVirtualHeight) and returns it as its desired size. This informs theScrollViewerof the total scroll extent, sizing the capsule scrollbar perfectly. - Z-Order Supremacy: The panel's local scrollbar overlay visual is removed, allowing the
ScrollViewerto draw its native glassmorphic capsule scrollbar in its ownOnRenderpass. Because the scrollbar is rendered after all visual children (including the panel and its cell cards) are painted, the scrollbar remains perfectly on top of all item cards and intercepts clicks first.
- Unified Offsets: Reading and writing
- Standalone Mode: If a
ScrollVieweris not found, the panel falls back to Standalone Mode, drawing its own internalScrollBarOverlaychild visual and intercepting pointer wheel events directly, ensuring full backward compatibility.
To eliminate floating-point coordinate drift and keep layout compilation cycles fast:
- In cooperating mode, the
ScrollViewerphysically translates itsContentcontainer by-_verticalOffsetand-_horizontalOffsetduring the arrange pass. - The virtualizing panel detects this physical shift and places the active visible cell visuals at their absolute virtual coordinate coordinates (e.g.,
row * ItemHeightfor grids ori * ItemHeightfor stack panels) relative to the panel, letting the parent graphics pipeline translate them onto the screen. This reduces layout calculations to simple, zero-copy integer multiplication.
CAD drawings (like DXF files) contain hundreds of thousands or millions of vector elements (lines, circles, polyline arcs, splines, and complex hatches). Recursively compiling these vector primitives from a dynamic visual tree every frame on camera changes (zoom/pan) is CPU-prohibitive.
ProGPU introduces Hardware-Accelerated Static WebGPU Buffers (Option B) which compiles all vector primitives once into a static, GPU-mapped vertex/index store (DxfStaticBuffer). Panning and zooming are executed entirely on the GPU via updates to the viewport uniforms, maintaining a locked 60+ FPS on massive, million-entity CAD models.
While static geometry scales infinitely on the GPU, TrueType Font (TTF) text is drawn as textured quads pointing to a bitmap-cached GlyphAtlas. Zooming in stretches these pre-rendered quads, causing bilinear texture blur because the glyph atlas texture was rasterized at a static zoom scale.
ProGPU resolves this by implementing Crisp Static Text Buffers via Dynamic Re-compilation:
flowchart TD
ZoomChange{"Context.Zoom != _lastZoom?"}
ZoomChange -- No --> DrawStatic["Draw Static Dxf Buffer - 100% GPU Bound (Panning Free)"]
ZoomChange -- Yes --> Recompile["Trigger RecompileStaticText on CPU"]
Recompile --> ScaleDPI["Scale effective dpiScale = _currentDpiScale * Context.Zoom"]
ScaleDPI --> RasterGlyph["Rasterize Glyph at physical FontSize * dpiScale * Zoom inside Atlas"]
RasterGlyph --> ModelSpace["Divide quad vertex coords by effective dpiScale (cancel out Zoom)"]
ModelSpace --> WriteGPU["Dynamic Copy-on-Write vertex/index re-upload to GpuBuffer"]
WriteGPU --> DrawStatic
- Panning is Completely Free: Since panning does not affect font size or rasterization dimensions, panning a static drawing remains 100% GPU-bound and runs with zero CPU overhead.
-
Retina-Sharp Snapping: On camera zoom changes, the compositor triggers a surgical, sub-millisecond re-compilation of ONLY the text commands using the new zoom factor:
$$\text{effectiveDpiScale} = \text{dpiScale} \cdot \text{Zoom}$$ -
Glyph Sizing: Glyphs are rasterized into the shared
GlyphAtlasat their exact, high-resolution physical size (FontSize * effectiveDpiScale), ensuring pixel-perfect Retina snapping. -
Automatic Scaling Cancelation: The compiled quad vertex positions (
$v_0, v_1, v_2, v_3$ ) are divided byeffectiveDpiScaleto map them back to base model/world coordinates. When the vertex shader multiplies them by the custom model-to-screen MVP matrix (which scales byZoom), the zoom factor is mathematically canceled out, mapping the quad 1-to-1 to physical screen pixels with zero texture stretching or blur!
To support instantaneous zoom transitions on massive CAD models containing thousands of text elements (such as Schemat IOS Karvina CZ.dxf), ProGPU integrates three advanced graphics-pipeline optimizations:
-
$O(\text{TextCount})$ Pre-Filtered Text Records Cache:- Problem: Scanning millions of drawing commands recursively on the CPU during zoomed snapping steps to filter out text elements introduced noticeable interface stutters.
-
Solution: During the initial compilation of the static buffer, the compositor captures the exact
DrawTextcommands and their parent block transformations into a flatTextRecordsarray in theDxfStaticBuffer:Subsequent snapped zoom changes bypass the drawing hierarchy entirely and recompile only the text records, reducing complexity frompublic struct StaticTextRecord { public RenderCommand Command; public Matrix4x4 Transform; }
$O(\text{TotalElements})$ to a highly efficient$O(\text{TextElements})$ execution.
-
Discrete Font Snapping & Quad Scaling:
-
Problem: As the camera zoom levels increase, font sizes become extremely large (up to 128f), which rapidly bloats and thrashes the shared
GlyphAtlastexture ($2048 \times 2048$ ), triggering frequent cache evictions. Computing 4-way subpixel snap coordinates for huge fonts also increases memory area consumption by$4\times$ . -
Solution:
-
Clamping: Caps the maximum physical font size rasterized into the atlas to
64f(instead of128f). GPU bilinear filtering scales these large high-resolution sources up without visual quality loss, using$4\times$ less atlas area. -
Size Snapping: Snaps
rasterFontSizeto discrete steps (0.5px steps below 24px, 2px steps above 24px) for perfect cache hit ratios. Quad quad boundaries are scaled proportionally byscaleRatio = physicalFontSize / rasterFontSizeto ensure mathematical size precision on screen remains 100% exact. -
Subpixel Bypassing: Disables subpixel snapping for font sizes larger than
24f(since subpixel shifts are visually imperceptible on large characters), saving an additional$4\times$ in atlas footprint.
-
Clamping: Caps the maximum physical font size rasterized into the atlas to
-
Problem: As the camera zoom levels increase, font sizes become extremely large (up to 128f), which rapidly bloats and thrashes the shared
-
WebGPU Queue & Driver Submission Batching:
-
Problem: Previously, rasterizing each new glyph synchronously created a temporary uniform buffer, constructed a WebGPU bind group, instantiated a command encoder, and immediately executed a sequential queue submission (
QueueSubmit). For drawings with thousands of characters, this sequential driver loop caused severe CPU/GPU Metal synchronization bottlenecks on macOS. -
Solution: Implemented batching APIs (
BeginBatch/EndBatch) inGlyphAtlas.csto lazily pool and combine multiple glyph compute dispatches. All rasterizations are now recorded into a singleCommandEncoderand executed in one unifiedQueueSubmitat the end of the compile pass, yielding a$1000\times+$ reduction in driver submission overhead.
-
Problem: Previously, rasterizing each new glyph synchronously created a temporary uniform buffer, constructed a WebGPU bind group, instantiated a command encoder, and immediately executed a sequential queue submission (
To eliminate the continuous CPU memory allocation overhead of creating small, temporary GPU uniform buffers on every render pass, we implemented a Pre-allocated Ring Uniform Buffer pattern in both GlyphAtlas and PathAtlas:
-
Single Bulk Pre-allocation: Allocates a single large
GpuBufferof256KBonce at system startup. This pre-allocated ring buffer acts as the backing storage for up to 4,000 active glyph or vector path dispatches. -
256-Byte Alignment Compliance: Follows the WebGPU standard (
minUniformBufferOffsetAlignmentboundary constraint of 256 bytes) by rounding up structural uniform offsets with a fast bitwise operation:$$\text{alignedSize} = (\text{SizeOf} + 255) & \sim 255$$ -
Fast Queue Copy-on-Write: Inside batch rasterization and pending path loops, parameters are written directly to the pre-allocated ring buffer at the current
_ringOffsetusingQueueWriteBuffer, completely avoiding buffer creation/destruction:_context.Wgpu.QueueWriteBuffer(_context.Queue, _uniformRingBuffer.BufferPtr, _ringOffset, &uniforms, (uint)Marshal.SizeOf<GlyphUniforms>());
-
Binding Slice Offsets: Dynamic bind groups are configured pointing to the exact slice within the ring buffer using
Offset = _ringOffsetandSize = Marshal.SizeOf<Uniforms>(). On each batch completion,_ringOffsetis incremented byalignedSize, and it resets to0at the start of a new batch loop. This achieves zero CPU allocations inside dynamic rasterization loops.
Updating dense vector meshes and text quads during snapped zoom events can cause severe CPU-GPU hardware execution stalls. If the CPU disposes and recreates vertex/index buffers while the GPU command queue is actively reading from them, the graphics driver is forced to block CPU execution to synchronize hardware lifecycles.
To prevent these stalls and achieve perfectly fluid rendering, we implemented a Double-Buffering Swapchain pattern:
- Asynchronous Back-Buffering: Maintains dual buffer sets in
DxfStaticBuffer:- Front-Buffers (
TextVertexBuffer,TextIndexBuffer,TextIndexCount) currently being drawn by the compositor. - Back-Buffers (
_textVertexBufferBack,_textIndexBufferBack,_textIndexCountBack) dedicated to accommodating the next camera layout recalculation.
- Front-Buffers (
- Non-Blocking Dynamic Copy: When
UpdateTextBufferis invoked during snapped zooms, it resizes and writes to the back-buffers asynchronously. - Zero-Allocation Swapping: Swaps the front and back buffer references instantly using cheap variable re-assignment on the CPU:
var tempVertexBuffer = TextVertexBuffer; TextVertexBuffer = _textVertexBufferBack; _textVertexBufferBack = tempVertexBuffer;
- Static Bind-Group Stability: Because vertex and index buffer mappings are bound directly via render encoder draw commands rather than static composition bind groups, swapping front/back buffers bypasses bind-group recreation or layout invalidations entirely, ensuring stutter-free, instant zoom actions.
Offscreen Gaussian blur and drop shadow dispatches are highly sensitive to parameter fluctuations during keyframe animations or hover transitions. Smooth float radius adjustments (e.g. transitioning from 1.0f to 3.0f) dynamically modify the computed iteration count:
To stabilize effect execution, we implemented a Snapped Radii Pipeline:
- Discrete Increments: Symmetrically snaps incoming
radiusandblurRadiusparameters to discrete0.5fpixel boundaries at the entry points ofApplyGaussianBlurandApplyDropShadow:float snappedRadius = MathF.Round(radius * 2f) / 2f;
- Pipeline and Bind-Group Lock: Snapping ensures that the computed iteration count remains perfectly locked and stable during intermediate keyframes. WGSL shader binding entries, textures, and command layouts remain identical across frame transitions, delivering extremely fluid hover animations and eliminating transient render delays.
The ProGPU solution is partitioned into modular, highly specialized C# projects. Each project governs a specific layer of the UI, vector, or graphics compilation loops:
| Project | Assembly Name | Core Architectural Responsibility | Key Components & Classes |
|---|---|---|---|
ProGPU.Backend |
ProGPU.Backend.dll |
Low-level hardware infrastructure and WebGPU swapchain orchestration. | WgpuContext, Window, Shaders, RenderPipelineCache |
ProGPU.Compute |
ProGPU.Compute.dll |
Orchestration of WebGPU GPGPU compute pipelines and parallel filter dispatches. | ComputeAccelerator, ComputeShaders |
ProGPU.Vector |
ProGPU.Vector.dll |
Mathematical primitives, Bezier models, path segment parsing, and atlas mapping. | PathGeometry, PathFigure, GpuPathSegment, PathAtlas |
ProGPU.Text |
ProGPU.Text.dll |
TrueType Font (TTF) parsing, glyph extraction, word-wrapping, and line layout engines. | TtfFont, GlyphAtlas, TextLayout |
ProGPU.Scene |
ProGPU.Scene.dll |
Retained scene-graph visual tree, decoupled layout boundaries, and compositor compiler. | Compositor, ContainerVisual, DrawingVisual, ILayoutNode |
ProGPU.Layout |
ProGPU.Layout.dll |
XAML-compatible sizing negotiation lifecycle (Measure / Arrange) and layout panels. |
LayoutNode, StackPanel, GridPanel, CanvasPanel |
ProGPU.WinUI |
ProGPU.WinUI.dll |
High-level interactive UI control suite layered on top of layout nodes. | Border, Grid, Pivot, RichTextBlock, ScrollViewer, SplitView |
ProGPU.Virtualization |
ProGPU.Virtualization.dll |
Dynamic scrolling viewport orchestration and UI virtualization controllers. | VirtualizingPanel, ViewportInfo |
ProGPU.Samples |
ProGPU.Samples.dll |
Showcase bootstrap, keyframe and physics animation drivers, diagnostics, and stress-test suites. | Program, AppState, MainWindowController, MotionMarkShowcaseVisual |
ProGPU routes all graphics and compute tasks directly to the GPU using specialized WGSL (WebGPU Shading Language) shaders. The following sections detail their purpose, execution pipelines, and exact implementations.
- Role: Primary graphics pipeline shader for standard UI rendering. Responsible for rasterizing vector shapes (rectangles, ellipses, rounded rectangles) and evaluating Bezier curves and elliptical arcs on the GPU.
- Why It is Used: Avoids uploading dense pre-tessellated mesh structures. Instead, it utilizes cheap mathematical Signed Distance Fields (SDFs) and GPU vertex expansion to draw vector primitives with zero CPU overhead.
-
Implementation Mechanics:
-
GPU Stroke Expansion & Miter Scaling (
sType == 3u): Expands lines dynamically in the vertex shader. Computes normal vectors ($miterN$ ) at segment junctions, scales them by$1/\cos(\theta)$ ($miterScale$ ), and offsets vertices to form precise, variable-thickness miter joints. Passes the signed pixel distance from the center line to the fragment shader viagridIndexfor zero-cost edge anti-aliasing. -
Dynamic Bezier Evaluation (
sType == 5u & 6u): Replaces CPU Bezier flattening. For Quadratics and Cubics, the vertex shader interpolates coordinates directly based on the thread'svertexIndexand parametric factor$t \in [0, 1]$ , calculating curve positions and tangents to offset vertices outward along normal vectors on the fly, storing signed pixel distances ingridIndex. -
Dynamic Arc Evaluation (
sType == 11u): Replaces CPU arc flattening for valid path strokes. The compositor sends the transformed ellipse center plus two axis vectors, and the vertex shader evaluates arc positions and tangents parametrically before stroke expansion. -
Analytical SDF Fragment Evaluation (
sType < 3u): Computes Signed Distance Fields for Rectangles, Ellipses, and Rounded Rectangles. Anti-aliases boundaries dynamically using screen-space partial derivatives:$$\text{fw} = \max(\text{fwidth}(d), 0.0001)$$ $$\alpha = 1.0 - \text{smoothstep}(-0.5 \cdot \text{fw}, 0.5 \cdot \text{fw}, d)$$ -
Pixel-Distance Stroke Anti-Aliasing (
sType == 3u \|\| 5u \|\| 6u \|\| 11u): Resolves aliasing for lines, curves, and arcs by evaluating screen-space smoothstep transitions using the interpolatedgridIndexpixel distance to the stroke boundary:$$d_{\text{shape}} = \text{abs}(\text{gridIndex}) - \text{strokeThickness} \cdot 0.5$$ $$\alpha = 1.0 - \text{smoothstep}(-0.5, 0.5, d_{\text{shape}})$$ -
Gradient Interpolation: Evaluates Linear (
brushType == 1u) and Radial (brushType == 2u) gradients dynamically for up to 4 stop colors by calculating projection coordinates and interpolating between bounds using stop offsets.
-
GPU Stroke Expansion & Miter Scaling (
- Role: Specialized graphics shader for high-speed, sharp text display.
- Why It is Used: Traditional text rasterization blurs heavily under scaling. The TextShader samples high-precision SDF textures and applies dilation offsets and power-based sharpness filters to ensure text remains crisp at any display size or zoom level.
- Implementation Mechanics:
- Samples the single-channel glyph atlas:
let alpha = textureSample(atlasTexture, atlasSampler, input.texCoord).r; - Applies a dilation scale based on the requested stroke thickness:
let dilated = clamp(alpha * input.strokeThickness, 0.0, 1.0); - Filters sharpness using a power curve driven by the corner radius:
let finalAlpha = pow(dilated, input.cornerRadius);
- Samples the single-channel glyph atlas:
- Role: WebGPU compute shader tasked with pre-rasterizing vector glyph outlines into the glyph atlas texture.
- Why It is Used: Bypasses slow CPU-based glyph rasterizers entirely, using parallel GPU threads to rasterize outlines directly on the GPU.
-
Implementation Mechanics:
- Operates on a
$16 \times 16$ thread group. - Calculates intersections using a 16x supersampled (SSAA) analytical winding-number raycaster.
- Solves quadratic equations directly inside the WGSL shader (
solve_quadratic) to evaluate Bezier curve boundaries, updating winding directions according to the curve's vertical tangent derivative. - Writes the calculated coverage mask directly to the storage texture:
textureStore(atlasTexture, writeCoord, vec4<f32>(coverage, 0.0, 0.0, 0.0));
- Operates on a
- Role: Advanced WebGPU compute shader that computes analytical non-zero winding fills for arbitrary paths.
- Why It is Used: Bypasses CPU segment flattening and triangulation completely, allowing the GPU to raycast complex Bezier geometry directly.
-
Implementation Mechanics:
- Computes intersections of horizontal rays with Line, Quadratic Bezier, and Cubic Bezier segments.
- Features an analytical Cardano's formula solver (
solve_cubicinside WGSL) to evaluate cubic Bezier roots:$$p = b - \frac{a^2}{3}, \quad q = c - \frac{ab}{3} + \frac{2a^3}{27}, \quad D = \frac{q^2}{4} + \frac{p^3}{27}$$ If$D \leq 0$ , it extracts up to 3 real roots using trigonometric cosine angles, updating the winding number according to the tangent derivative$P'_y(t) = 3 a t^2 + 2 b t + c$ . - Executes 4-point supersampling (SSAA) using subpixel sampling coordinate offsets (
+0.25,+0.75) in local space (fp2), achieving hardware-accurate anti-aliased edge coverage.
- Role: Parallel compute shaders for high-performance backdrop and glass blurs.
- Why It is Used: Bypasses slow pixel shader convolution passes by executing parallel thread blocks directly on texture buffers.
-
Implementation Mechanics:
- Operates in two consecutive passes (Horizontal, then Vertical) to split rendering complexity from
$O(K^2)$ to$O(K)$ instructions per pixel. - Executes an unrolled 5-tap Gaussian kernel using hardcoded weights to avoid memory fetch latency:
$$\text{color} = 0.0625 \cdot T[-2] + 0.25 \cdot T[-1] + 0.375 \cdot T[0] + 0.25 \cdot T[1] + 0.0625 \cdot T[2]$$ - Clamps texture coordinate bounds inside
textureLoadto eliminate edge bleed artifacts.
- Operates in two consecutive passes (Horizontal, then Vertical) to split rendering complexity from
- Role: WebGPU compute shader calculating soft drop shadows and glowing neon halos for layout elements.
- Why It is Used: Evaluates dynamic blurring and translation offsets over element boundaries in a single dispatch pass.
-
Implementation Mechanics:
- Operates on a
$16 \times 16$ thread block. - Takes a
Paramsuniform block specifying translatingoffset, shadowcolor, andblurRadius. - Loops over a sliding window of size
[-blurRadius, blurRadius]. - Extracts the source offscreen texture's alpha channel, averages the coverage, and outputs the shifted, blurred, and color-multiplied mask back to the destination buffer: $$\text{shadowColor} = \vec{C}{\text{params}} \cdot (A{\text{sum}} / \text{count})$$
- Operates on a
To support high-quality rendering diagnostics and verify vector structures, ProGPU includes two dedicated diagnostic utilities:
Located in tools/TtfDiag/, this is a generic console tool designed to inspect outline structures, endpoint coordinates, and control points of TrueType fonts. It is especially useful for diagnosing text rendering quality, drop-out artifacts, or glyph parsing inconsistencies.
- Usage:
# Run using the system's Arial font (supplemental) fallback to inspect specific glyphs (e.g. 'G' and 'g') dotnet run --project tools/TtfDiag -- Arial Gg # Run with an absolute path to a custom font and custom character sequence dotnet run --project tools/TtfDiag -- /System/Library/Fonts/Supplemental/Georgia.ttf ABC
- Output: Dumps the exact TrueType outline geometry, closed/filled figure status, segment types (Lines/Quadratic Beziers), and precise coordinates using standard invariant decimal formatting.
Located in tools/DxfDiag/, this is a standalone command-line utility to inspect DXF vector files. It lists all available layouts and layers, prints active layout geometric bounds, recursive block hierarchies, nested insert attributes (tags/values), and detects coordinate outliers exceeding absolute limits (outliers.txt in the local directory.
- Usage:
# Run on a target DXF drawing file to inspect the default active space layout dotnet run --project tools/DxfDiag -- <path-to-dxf-file> # Run on a target DXF drawing file and explicitly target a specific layout space (e.g. 'A0') dotnet run --project tools/DxfDiag -- <path-to-dxf-file> --layout A0
- Output: Generates a detailed audit of entity counts, viewport settings, block trees, and coordinates, saving the report to
outliers.txtand logging a summary to the console.
ProGPU is designed to act as an embedded high-performance graphics substrate inside standard host XAML frameworks. We provide native integration packages for both Avalonia (ProGPU.Avalonia) and Uno Platform (ProGPU.Uno), allowing developers to overlay low-allocation WebGPU rendering canvases directly inside standard desktop applications.
The integration layer hosts a headless, offscreen WgpuContext and Compositor instance inside a custom control subclass (Control in Avalonia, ContentControl in Uno). WebGPU renders all visual tree and CAD vectors offscreen, which are then blitted directly to the host's screen.
graph TD
subgraph UIThread ["Host UI Thread (Input & Sizing)"]
Size[Sizing Negotiation: Measure & Arrange] --> Input[Pointer Event Capture & Translation]
end
subgraph GPUThread ["GPU & WebGPU Staging Loop"]
Input -->|InputSystem.Inject| WG[WebGPU Core Offscreen Render]
Size -->|Logical Bounds| WG
WG -->|CommandEncoderCopy| ST[Staging Buffer VRAM]
ST -->|Sync MapRead| MP[Mapped CPU Pointer]
MP -->|Direct Pointer Blit| WB[WriteableBitmap 96 DPI]
WB -->|Invalidate / DrawImage| SCR[High-DPI Retina Screen]
end
Due to standard platform-agnostic FFI limitations in wgpu-native, raw WGPUTexture pointers cannot be shared directly with the compositor's graphics context (Metal/D3D) as IOSurfaceRef or id<MTLTexture> handles without writing custom native Rust/C++ bridging wrappers.
To bypass these FFI opaque struct constraints and deliver 100% stable, platform-independent rendering, ProGPU implements a highly optimized Direct Bitmap Blitting pipeline:
-
Aligned GPU Staging Buffers: WebGPU allocates a staging buffer backed by
BufferUsage.MapRead | BufferUsage.CopyDst. The row pitch (BytesPerRow) is aligned to the nearest 256 bytes per WebGPU specifications to satisfy FFI layout requirements:$$\text{BytesPerRow} = (\text{width} \cdot \text{bytesPerPixel} + 255) \ & \ \sim 255$$ -
Synchronous MapRead Polling: Each frame, a command encoder executes
CopyTextureToBufferfrom the offscreen target to the staging buffer. The buffer is mapped viaBufferMapAsync, and the UI thread pollswgpuDevicePollin a light spin loop until mapping completes. -
Direct Row Pointer Blitting: Once mapped, the raw VRAM memory address is extracted. The control performs a high-speed pointer-based copy utilizing native
System.Buffer.MemoryCopystraight into the locked buffer address of the host's high-DPIWriteableBitmap:This row-by-row blitting executes in microseconds on the CPU, achieving near-zero visual overhead and bypassing bilinear filtering blur.using (var locked = _writeableBitmap.Lock()) { byte* srcBytes = (byte*)mappedPtr; byte* dstBytes = (byte*)locked.Address; uint rowBytes = _renderWidth * bytesPerPixel; for (uint y = 0; y < _renderHeight; y++) { byte* srcRow = srcBytes + (y * _bytesPerRow); byte* dstRow = dstBytes + (y * (uint)locked.RowBytes); System.Buffer.MemoryCopy(srcRow, dstRow, rowBytes, rowBytes); } }
On macOS Retina displays (e.g. DpiScale = 2.0), standard platform-specific graphics renderers often apply the display's scaling factor twice when drawing a high-DPI bitmap, blowing up the layout and creating blurry graphics.
ProGPU resolves this double-scaling bug through strict physical-to-logical coordination:
- 96 DPI Isolation: The host
WriteableBitmapis instantiated at a constant 96 DPI (new Vector(96, 96)), making its logical size match its physical size. - Logical-Bounds Offscreen Rendering: Viewport dimensions passed to
Compositor.RenderOffscreenare strictly mapped in logical coordinates, while the internal WebGPU pipeline multiplies them byDpiScaleto align the physical viewport. - Clean Down-Scaling: During the draw pass, the physical staging bitmap is scaled down into the host control's logical bounds using a standard 1-to-1 stretch layout (
Stretch.Fillin Uno,context.DrawImagein Avalonia). The physical pixels map precisely 1:1 with screen hardware coordinates, yielding absolute razor-sharp text and graphics.
The integration libraries bridge the event-handling loop symmetrically:
- Coordinate Translation: Pointer event handlers (
OnPointerMoved,OnPointerPressed, etc.) intercept native positions, translate them into logicalVector2boundaries, and route them to ProGPU's input engine:InputSystem.InjectMouseMove(new Vector2((float)pos.X, (float)pos.Y));
- Input State Invalidation: Input events mark the active WinUI input state dirty, forcing immediate layouts hit-testing and scheduling dynamic repaint requests to update hover overlays and cursors instantly.
To allow embedded graphics and animation benches to run at their physical display limit, standard timer loops are replaced by self-scheduling graphics dispatchers:
- Avalonia: Hooks directly into the system's VSync loop using:
This self-scheduling tick fires callbacks exactly aligned with the physical monitor's refresh rate, unlocking 120 FPS / 144 FPS rendering without frame tearing.
TopLevel.RequestAnimationFrame(OnAnimationTick);
- Uno Platform: Subscribes directly to
CompositionTarget.Renderingto drive the WebGPU command submissions and refresh statistics exactly aligned with each compositor pass.
To bypass the overhead of copying pixels from VRAM to CPU staging buffers and back to VRAM (double-copy blitting), ProGPU implements a cutting-edge Zero-Copy Shared Texture Rendering Pipeline. This architecture achieves direct GPU-to-GPU memory sharing between the offscreen WebGPU rendering engine and the host UI composition tree.
sequenceDiagram
participant WebGPU as WebGPU Engine
participant OS as OS Shared Resource (IOSurface / D3D11)
participant Avalonia as Avalonia Compositor Tree
participant GPU as physical GPU VRAM
WebGPU->>OS: 1. Render directly to Shared Handle (Zero CPU Copy)
OS->>GPU: 2. Texture contents persist in VRAM
Avalonia->>OS: 3. Import Shared Handle via ICompositionGpuInterop
Avalonia->>GPU: 4. Draw directly from VRAM (Zero Copy / 120 FPS+)
The Zero-Copy pipeline eliminates host CPU copies entirely by allocating a hardware-backed shared OS memory handle directly in C#, wrapping it inside WebGPU as a render target, and importing it into the host visual tree:
| Operating System | Shared Resource Type | Native Handle Reference | Allocation Strategy |
|---|---|---|---|
| macOS | Apple IOSurface |
IOSurfaceRef (global handle) |
CoreFoundation/AppKit unmanaged dictionary creation |
| Windows | Direct3D11 Shared Texture | DXGI HANDLE (global shared key) |
Standalone ID3D11Device with D3D11_RESOURCE_MISC_SHARED |
CoreFoundation and Objective-C runtime P/Invokes are used to construct the surface configuration plist:
IOSurfaceWidth&IOSurfaceHeight: Target dimensions.IOSurfaceBytesPerElement: 4 bytes per pixel.IOSurfacePixelFormat:'BGRA'(packed 32-bit integer1111970369).IOSurfaceBytesPerRow: Aligned to 256 bytes.IOSurfaceAllocSize: Total byte size.
Direct COM VTable indexing is utilized to create resources dynamically:
D3D11CreateDevice: Instantiates a standalone hardware D3D11 device.CreateTexture2D: Allocates the texture withD3D11_BIND_RENDER_TARGET | D3D11_BIND_SHADER_RESOURCEbind flags and theD3D11_RESOURCE_MISC_SHAREDmisc flag.QueryInterface: Extracts theIDXGIResourceCOM pointer.GetSharedHandle: Obtains the global shared handle pointer.
The host control hooks into Avalonia's composition engine during initialization:
- Query Interop Interface:
var interop = await compositor.TryGetCompositionGpuInterop();
- Verify Compatibility:
Verify that the compositor's graphics backend supports the active platform's handle type (
IOSurfaceRefon macOS,D3D11TextureGlobalSharedHandleon Windows). - Import Image:
Create a
PlatformHandlefrom the allocated raw pointer and import it:var platformHandle = new PlatformHandle(_sharedHandle, _gpuHandleType); _importedGpuImage = _gpuInterop.ImportImage(platformHandle, properties);
- Present via Composition Surface:
Create a standard
CompositionSurfaceVisualand assign itsSurfaceto aCompositionDrawingSurface. On every tick, simply call:This triggers a hardware-accelerated present, drawing the shared texture directly in the compositor loop without CPU copying._ = _drawingSurface.UpdateAsync(_importedGpuImage);
Standard cross-platform wgpu-native bindings do not export helper functions out-of-the-box to wrap arbitrary IOSurfaceRef or shared ID3D11Texture2D handles into WebGPU texture objects.
To complete the zero-copy pipeline on the WebGPU side, a small custom native wrapper (written in Rust or C++) must bridge the HAL (Hardware Abstraction Layer) boundary:
// Custom native Rust crate bridging wgpu-core and OS handles
use wgpu_core::hub::Global;
use wgpu_hal::api::{Metal, Dx12};
#[no_mangle]
pub unsafe extern "C" fn wgpuDeviceCreateTextureFromMacIOSurface(
device_ptr: *mut libc::c_void,
iosurface_ptr: *mut libc::c_void,
width: u32,
height: u32
) -> *mut libc::c_void {
let global = &*Global::default();
// 1. Extract raw device representation
let device_id = std::mem::transmute(device_ptr);
// 2. Fetch the Metal device and wrap the IOSurface handle via wgpu_hal
let surface: Metal::Texture = Metal::texture_from_raw(iosurface_ptr as *mut _);
// 3. Register the newly created texture inside the wgpu-core context
let texture_id = global.device_create_texture_from_hal::<Metal>(
device_id,
surface,
width,
height
);
std::mem::transmute(texture_id)
}This bridge allows WebGPU command encoders to bind the texture as a standard RenderPassColorAttachment, completing the zero-copy pipeline.
To achieve VSync-locked rendering (120 FPS+) and completely eliminate UI-thread blocking or frame flickering, ProGPU utilizes a high-performance Asynchronous Double-Buffered Update Loop driven by a Dedicated Background Device Polling Thread.
This architecture guarantees 0% CPU blocking on the main UI thread and prevents read-write VRAM conflicts between the renderer and the host compositor.
sequenceDiagram
participant UI as UI Thread (RenderFrameAsync)
participant BG as Background Polling Thread
participant WGPU as WebGPU Device / Queue
participant Swap as SwapchainImage (Double Buffered)
participant Comp as Avalonia Compositor Thread
UI->>WGPU: 1. Render scene offscreen to WgpuTexture (Image A)
UI->>WGPU: 2. Queue CopyTextureToStagingBuffer
UI->>WGPU: 3. Invoke MapBufferAsync (non-blocking Task)
Note over UI,BG: UI thread yields control immediately
Loop Continuous Polling
BG->>WGPU: 4. wgpuDevicePoll(Device, false) every 2ms
End
WGPU-->>BG: 5. Mapping complete! Trigger MapCallback
BG-->>UI: 6. Complete TaskCompletionSource (Resume UI)
UI->>Swap: 7. CopyMappedToSharedTexture (MemoryCopy / UpdateSubresource)
UI->>WGPU: 8. BufferUnmap
UI->>Comp: 9. UpdateAsync (Swapchain Image A)
Note over UI,Comp: Image A is now bound to Compositor. Swap to Image B.
A dedicated SwapchainImage class encapsulates the graphics assets for a single frame. The host control manages a pool of two swapchain images (SwapchainImage[2]):
- Compositor Frame Lock: One image is locked by the Avalonia compositor for current presentation.
- Renderer Target: The other image is being written to asynchronously by the WebGPU rendering loop.
- Role Swap: Once rendering and memory copies are completed, the roles are swapped in an alternating cycle:
_currentWriteImageIndex = (_currentWriteImageIndex + 1) % 2.
private class SwapchainImage : IDisposable
{
public IntPtr SharedHandle;
public ICompositionImportedGpuImage? ImportedImage;
public GpuTexture? WgpuTexture;
public IntPtr StagingBuffer;
public uint StagingBufferSize;
public uint BytesPerRow;
// Windows Specific Direct3D 11 Resources
public IntPtr WinD3DDevice;
public IntPtr WinTexture2D;
}WebGPU asynchronous operations (such as staging buffer mapping) require the device queue event loop to be polled via wgpuDevicePoll.
To keep the UI and Avalonia render threads completely unblocked, ProGPU runs a continuous, low-latency background polling thread that executes wgpuDevicePoll every 2 milliseconds:
private void StartPolling()
{
_pollingThread = new Thread(() => {
while (!_pollingCts.Token.IsCancellationRequested) {
wgpuDevicePoll(_wgpuContext.Device, false, null);
Thread.Sleep(2);
}
}) { IsBackground = true, Name = "ProGpuDevicePolling" };
_pollingThread.Start();
}The buffer mapping callback is wrapped in a standard C# TaskCompletionSource<bool>. Calling await MapBufferAsync(...) suspends the rendering task without blocking any CPU execution context. The background polling thread completes the mapping asynchronously, waking up the rendering task instantly:
private Task MapBufferAsync(IntPtr buffer, MapMode mode, nuint size)
{
unsafe {
var tcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
var handle = GCHandle.Alloc(tcs);
var userData = (void*)GCHandle.ToIntPtr(handle);
_wgpuContext.Wgpu.BufferMapAsync((GpuBuffer*)buffer, mode, 0, size, s_mapCallback, userData);
return tcs.Task;
}
}To comply with the C# compiler constraints that prohibit await operations inside unsafe contexts, ProGPU segregates low-level pointer copying into two dedicated synchronous unsafe helper functions:
CopyTextureToStagingBuffer: Encodes the offscreen render-target texture copy to the staging buffer and submits the command buffer.CopyMappedToSharedTexture: Retrieves the staging buffer's mapped range, locks the native OS texture, copies raw bytes row-by-row, unlocks the texture, and unmaps the buffer.
// macOS row-by-row IOSurface memory copy
GpuSharingInterop.IOSurfaceLock(image.SharedHandle, 0, null);
void* destPtr = GpuSharingInterop.IOSurfaceGetBaseAddress(image.SharedHandle);
System.Buffer.MemoryCopy(srcRow, destRow, rowBytes, rowBytes);
GpuSharingInterop.IOSurfaceUnlock(image.SharedHandle, 0, null);
// Windows D3D11 UpdateSubresource call via COM VTable index 49
GpuSharingInterop.COMHelper.CallUpdateSubresource(context, image.WinTexture2D, 0, IntPtr.Zero, mappedPtr, image.BytesPerRow, 0);If graphics interop is not supported by the environment (e.g. software rendering, missing drivers, or Linux configurations lacking Vulkan opaque handles), the control gracefully falls back to the Decoupled Render-Thread Blitting Pipeline (Phase 2). This ensures 100% functionality and visual parity across all host configurations!