On Life and Lisp

Dissecting the Apple M1 GPU, the end

Tue, 26 Aug 2025 00:00:00 -0500

In 2020, Apple released the M1 with a custom GPU. We got to work reverse-engineering the hardware and porting Linux. Today, you can run Linux on a range of M1 and M2 Macs, with almost all hardware working: wireless, audio, and full graphics acceleration.

Our story begins in December 2020, when Hector Martin kicked off Asahi Linux. I was working for Collabora working on Panfrost, the open source Mesa3D driver for Arm Mali GPUs. Hector put out a public call for guidance from upstream open source maintainers, and I bit. I just intended to give some quick pointers. Instead, I bought myself a Christmas present and got to work. In between my university coursework and Collabora work, I poked at the shader instruction set.

One thing led to another. Within a few weeks, I drew a triangle.

In 3D graphics, once you can draw a triangle, you can do anything.

Pretty soon, I started work on a shader compiler. After my final exams that semester, I took a few days off from Collabora to bring up an OpenGL driver capable of spinning gears with my new compiler.

Over the next year, I kept reverse-engineering and improving the driver until it could run 3D games on macOS.

Meanwhile, Asahi Lina wrote a kernel driver for the Apple GPU. My userspace OpenGL driver ran on macOS, leaving her kernel driver as the missing piece for an open source graphics stack. In December 2022, we shipped graphics acceleration in Asahi Linux.

In January 2023, I started my final semester in my Computer Science program at the University of Toronto. For years I juggled my courses with my part-time job and my hobby driver. I faced the same question as my peers: what will I do after graduation?

Maybe Panfrost? I started reverse-engineering of the Mali Midgard GPU back in 2017, when I was still in high school. That led to an internship at Collabora in 2019 once I graduated, turning into my job throughout four years of university. During that time, Panfrost grew from a kid’s pet project based on blackbox reverse-engineering, to a professional driver engineered by a team with Arm’s backing and hardware documentation. I did what I set out to do, and the project succeeded beyond my dreams. It was time to move on.

What did I want to do next?

Finish what I started with the M1. Ship a great driver.
Bring full, conformant OpenGL drivers to the M1. Apple’s drivers are not conformant, but we should strive for the industry standard.
Bring full, conformant Vulkan to Apple platforms, disproving the myth that Vulkan isn’t suitable for Apple hardware.
Bring Proton gaming to Asahi Linux. Thanks to Valve’s work for the Steam Deck, Windows games can run better on Linux than even on Windows. Why not reap those benefits on the M1?

Panfrost was my challenge until we “won”. My next challenge? Gaming on Linux on M1.

Once I finished my coursework, I started full-time on gaming on Linux. Within a month, we shipped OpenGL 3.1 on Asahi Linux. A few weeks later, we passed official conformance for OpenGL ES 3.1. That put us at feature parity with Panfrost. I wanted to go further.

OpenGL (ES) 3.2 requires geometry shaders, a legacy feature not supported by either Arm or Apple hardware. The proprietary OpenGL drivers emulate geometry shaders with compute, but there was no open source prior art to borrow. Even though multiple Mesa drivers need geometry/tessellation emulation, nobody did the work to get there.

My early progress on OpenGL was fast thanks to the mature common code in Mesa. It was time to pay it forward. Over the rest of the year, I implemented geometry/tessellation shader emulation. And also the rest of the owl. In January 2024, I passed conformance for the full OpenGL 4.6 specification, finishing up OpenGL.

Vulkan wasn’t too bad, either. I polished the OpenGL driver for a few months, but once I started typing a Vulkan driver, I passed 1.3 conformance in a few weeks.

What remained was wiring up the geometry/tessellation emulation to my shiny new Vulkan driver, since those are required for Direct3D. Et voilà, Proton games.

Along the way, Karol Herbst passed OpenCL 3.0 conformance on the M1, running my compiler atop his “rusticl” frontend.

Meanwhile, when the Vulkan 1.4 specification was published, we were ready and shipped a conformant implementation on the same day.

After that, I implemented sparse texture support, unlocking Direct3D 12 via Proton.

…Now what?

Ship a great driver? Check.
Conformant OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0? Check.
Conformant Vulkan 1.4? Check.
Proton gaming? Check.

That’s a wrap.

We’ve succeeded beyond my dreams. The challenges I chased, I have tackled. The drivers are fully upstream in Mesa. Performance isn’t too bad. With the Vulkan on Apple myth busted, conformant Vulkan is now coming to macOS via LunarG’s KosmicKrisp project building on my work.

Satisfied, I am now stepping away from the Apple ecosystem. My friends in the Asahi Linux orbit will carry the torch from here. As for me?

Onto the next challenge!

Vulkan 1.4 sur Asahi Linux

Mon, 02 Dec 2024 00:00:00 -0500

English version follows.

Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui.

Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote.

Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés.

Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les nôtres sont les seuls pilotes conformes pour le materiel d’Apple de n’importe quel standard graphique.

Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour utiliser la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite.

Pour plus d’informations, consultez l’article du blog de Khronos.

Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one.

That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh to get the latest drivers.

Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared.

Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard.

Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in our Vulkan 1.3 driver.

For more information, see the Khronos blog post.

AAA gaming on Asahi Linux

Thu, 10 Oct 2024 00:00:00 -0500

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0.

Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well!

Installation

First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead.

The stack

Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference:

FEX emulates x86 on Arm.
Wine translates Windows to Linux.
DXVK and vkd3d-proton translate DirectX to Vulkan.

There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages.

While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4.

Vulkan

The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features.

Tessellation

Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024.

Geometry shaders

Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner.

Enhanced robustness

“Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1.

Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness.

Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load.

load R, buffer, index * 16
ulesel R[0], index, size, R[0], 0
ulesel R[1], index, size, R[1], 0
ulesel R[2], index, size, R[2], 0
ulesel R[3], index, size, R[3], 0

There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects.

ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo
ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi
load R, buffer, index * 16

Two instructions, not four.

Next steps

Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077.

While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed.

Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ.

Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”.

Acknowledgements

This work has been years in the making with major contributions from…

… Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source.

We hope you enjoy the magic.

Happy gaming.

Vulkan 1.3 on the M1 in 1 month

Wed, 05 Jun 2024 00:00:00 -0500

Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers.

Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers.

Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words:

All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized.

Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games.

We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period?

It’s unprecedented…

Challenge accepted.

April 2

It begins with a text.

Faith… I think I want to write a Vulkan driver.

Her advice?

Just start typing.

There’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay.

April 3

To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add.

April 4

With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works.

That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes.

April 5

Fleshing out yesterday’s code, all copy tests pass.

April 6

We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on.

April 7

What makes state “dynamic”? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called “pipelines”. If games create all their pipelines during a loading screen, there is no compiler “stutter” during gameplay. The idea hasn’t quite panned out: many game developers don’t know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object extension that makes pipelines optional.

We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines.

Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance.

I am not reasonable.

To eliminate stuttering in OpenGL, we make state dynamic with four strategies:

Conditional code.
Precompiled variants.
Indirection.
Prologs and epilogs.

Wait, what-a-logs?

AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too.

For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs.

Putting it together, we get a (dynamic) triangle:

April 8

Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours.

/* Translate an American VkBorderColor into a Canadian agx_border_colour */
enum agx_border_colour
translate_border_color(VkBorderColor color)
{
   switch (color) {
   case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK:
      return AGX_BORDER_COLOUR_TRANSPARENT_BLACK;
   ...
   }
}

Test results are getting there.

Pass: 149770, Fail: 7741, Crash: 2396

That’s good enough for vkQuake.

April 9

Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line.

Pass: 255209, Fail: 3818, Crash: 599

98.3% pass rate for 1.3 on our 1 week anniversary.

Not bad.

April 10

SuperTuxKart has a Vulkan renderer.

April 11

Zink works too.

April 12

I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks.

April 16

The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests.

April 17

A few tests crash inside our register allocator. Their shaders contain a peculiar construction:

if (condition) {
   while (true) { }
}

condition is always false, but the compiler doesn’t know that.

Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy.

April 18

Remember copies? They’re slow, and every frame currently requires a copy to get on screen.

For “zero copy” rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses “modifiers” for this purpose, so we implement the EXT_image_drm_format_modifier extension. And by implement, I mean copy.

Copies to avoid copies.

April 20

“I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.”

…

“Ma’am, this is a Wendy’s.”

April 22

As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead.

Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second.

I think we’re okay.

April 24

Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code.

It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK.

Which means he spent a summer adding it to Honeykrisp.

Thanks, Mohamed ;-)

April 25

Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU.

For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking.

For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right?

…Actually, we can.

A little witchcraft makes GPU query copies as easy as C.

void copy_query(struct params *p, int i) {
  uintptr_t dst = p->dest + i * p->stride;
  int query = p->first + i;

  if (p->available[query] || p->partial) {
    int q = p->index[query];
    write_result(dst, p->_64, p->results[q]);
  }

  ...
}

April 26

The final boss: border colours, hard mode.

Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours:

(0, 0, 0, 0) – transparent black
(0, 0, 0, 1) – opaque black
(1, 1, 1, 1) – opaque white

We handled these on April 8. Unfortunately, there are two problems.

First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color extension.

Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle.

Some formats are implicitly reordered. Common “BGRA” formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format’s reordering. If the application uses a BARB swizzle with a BGRA format, the driver uses an RABR swizzle with an RGBA format.

There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information.

Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black.

There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha.

That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result.

The problem is opaque black: (0, 0, 0, 1). Swapping red and alpha gives (1, 0, 0, 0). Transparent red? Uh-oh.

We’re stuck. No known hardware configuration implements correct Vulkan semantics.

Is hope lost?

Do we give up?

A reasonable person would.

I am not reasonable.

Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support.

As you know, I am not reasonable.

Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests.

April 27

All that’s left is some last minute bug fixing, and…

Pass: 686930, Fail: 0

Success.

The future

The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux.

That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned.

Conformant OpenGL 4.6 on the M1

Wed, 14 Feb 2024 00:00:00 -0500

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers.

Already installed? Just dnf upgrade --refresh.

Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender.

Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2.

While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan.

Compared to 4.1, OpenGL 4.6 adds dozens of required features, including:

Robustness
SPIR-V
Clip control
Cull distance
Compute shaders
Upgraded transform feedback

Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set.

How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on.

For a taste of the challenges we overcame, let’s look at robustness.

Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance.

For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”.

“Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface.

All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility.

Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled:

Zero (Direct3D, Vulkan with robustBufferAccess2)
Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess)
Arbitrary values, but can’t crash (OpenGL ES)

OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result.

As an example, a uniform buffer load without robustness might look like:

load.i32 result, buffer, index

Robustness adds a single unsigned minimum (umin) instruction:

umin idx, index, last
load.i32 result, buffer, idx

Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency.

There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up.

We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot.

Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address:

Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads.

One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like:

imad idx, stride/4, vertex, offset/4
load.i32 result, base, idx

The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction:

load.v4i32 result, base, vertex << 2

…with the hardware calculating the address:

What about robustness?

We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”.

Let’s handle the latter problem. We can rewrite the addressing equation as:

That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness:

umin idx, vertex, last valid
load.v4i32 result, base, idx << 2

We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base:

The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as:

The driver calculates the right-hand side and passes it into the shader.

One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes.

Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction.

In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes.

…But it would be no fun if our hardware was reasonable.

Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”.

For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality.

With robustness, the specifications all agree that image loads return…

Zero if the X- or Y-coordinate is out-of-bounds
Zero if the level is out-of-bounds

Meanwhile, image loads on the M1 GPU return…

Zero if the X- or Y-coordinate is out-of-bounds
Values from the last level if the level is out-of-bounds

Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance.

The obvious workaround is to never load from an invalid level:

if (level <= levels) {
    return imageLoad(x, y, level);
} else {
    return 0;
}

That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching:

vec4 data = imageLoad(x, y, level);

return (level <= levels) ? data : 0;

This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not:

image_load R, x, y, level
ulesel R[0], level, levels, R[0], 0
ulesel R[1], level, levels, R[1], 0
ulesel R[2], level, levels, R[2], 0
ulesel R[3], level, levels, R[3], 0

Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround:

bool valid = (level <= levels);
int x_ = valid ? x : 20000;

return imageLoad(x_, y, level);

Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly:

ulesel x_, level, levels, x, #20000
image_load R, x_, y, level

If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance.

Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

The first conformant M1 GPU driver

Tue, 22 Aug 2023 00:00:00 -0500

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux!

For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers.

Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry.

To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max.

Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-)

Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux.

Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics.

Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation…

It’s not too late though. They should follow our lead!

OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster.

Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition.

“Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms.

Can we put these two features together to write to an image atomically?

Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20).

Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem.

The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address.

If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel.

Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve.

We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better.

There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance.

In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly:

# Inputs x, y in r0l, r0h.
# Output in r1.

add r2, #0, r0, lsl 4
or  r1, r0, r2
and r1, r1, #0xf0f0f0f
add r2, #0, r1, lsl 2
or  r1, r1, r2
and r1, r1, #0x33333333
add r2, #0, r1, lsl 1
or  r1, r1, r2
and r1, r1, #0x55555555
add r1, r1l, r1h, lsl 1

We could stop here, but what if there’s a dedicated instruction to interleave bits? PowerVR has a “shuffle” instruction shfl, and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too. Unfortunately, even if it was, the proprietary compiler won’t use it when compiling our test shaders. That makes it difficult to reverse-engineer the instruction – if it exists – by observing compiled shaders.

It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check.

Dougall Johnson provided the guess. When considering the instructions we already know about, he took special notice of the “reverse bits” instruction. Since reversing bits is a type of bit shuffle, the interleave instruction should be encoded similarly. The bit reverse instruction has a two-bit field specifying the operation, with value 01. Related instructions to count the number of set bits and find the first set bit have values 10 and 11 respectively. That encompasses all known “complex bit manipulation” instructions.

`00`	? ? ?
`01`	Reverse bits
`10`	Count set bits
`11`	Find first set

There is one value of the two-bit enumeration that is unobserved and unknown: 00. If this interleave instruction exists, it’s probably encoded like the bit reverse but with operation code 00 instead of 01.

There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there.

Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver.

All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion ($2^32$) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction.

As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests.

And that’s what matters.

Thank you to Khronos and Software in the Public Interest for supporting open drivers.

OpenGL 3.1 on Asahi Linux

Tue, 06 Jun 2023 00:00:00 -0500

Upgrade your Asahi Linux systems, because your graphics drivers are getting a big boost: leapfrogging from OpenGL 2.1 over OpenGL 3.0 up to OpenGL 3.1! Similarly, the OpenGL ES 2.0 support is bumping up to OpenGL ES 3.0. That means more playable games and more functioning applications.

Back in December, I teased an early screenshot of SuperTuxKart’s deferred renderer working on Asahi, using OpenGL ES 3.0 features like multiple render targets and instancing. Now you too can enjoy SuperTuxKart with advanced lighting the way it’s meant to be:

As before, these drivers are experimental and not yet conformant to the OpenGL or OpenGL ES specifications. For now, you’ll need to run our -edge packages to opt-in to the work-in-progress drivers, understanding that there may be bugs. Please refer to our previous post explaining how to install the drivers and how to report bugs to help us improve.

With that disclaimer out of the way, there’s a LOT of new functionality packed into OpenGL 3.0, 3.1, and OpenGL ES 3.0 to make this release. Highlights include:

Multiple render targets
Multisampling
Transform feedback
Texture buffer objects
..and more.

For now, let’s talk about…

Multisampling

Vulkan and OpenGL support multisampling, short for multisampled anti-aliasing. In graphics, aliasing causes jagged diagonal edges due to rendering at insufficient resolution. One solution to aliasing is rendering at higher resolutions and scaling down. Edges will be blurred, not jagged, which looks better. Multisampling is an efficient implementation of that idea.

A multisampled image contains multiple samples for every pixel. After rendering, a multisampled image is resolved to a regular image with one sample per pixel, typically by averaging the samples within a pixel.

Apple GPUs support multisampled images and framebuffers. There’s quite a bit of typing to plumb the programmer’s view of multisampling into the form understood by the hardware, but there’s no fundamental incompatibility.

The trouble comes with sample shading. Recall that in modern graphics, the colour of each fragment is determined by running a fragment shader given by the programmer. If the fragments are pixels, then each sample within that pixel gets the same colour. Running the fragment shader once per pixel still benefits from multisampling thanks to higher quality rasterization, but it’s not as good as actually rendering at a higher resolution. If instead the fragments are samples, each sample gets a unique colour, equivalent to rendering at a higher resolution (supersampling). In Vulkan and OpenGL, fragment shaders generally run per-pixel, but with “sample shading”, the application can force the fragment shader to run per-sample.

How does sample shading work from the drivers’ perspective? On a typical GPU, it is simple: the driver compiles a fragment shader that calculates the colour of a single sample, and sets a hardware bit to execute it per-sample instead of per-pixel. There is only one bit of state associated with sample shading. The hardware will execute the fragment shader multiple times per pixel, writing out pixel colours independently.

Easy, right?

Alas, Apple’s “AGX” GPU is not typical.

AGX always executes the shader once per pixel, not once per sample, like older GPUs that did not support sample shading. AGX does support it, though.

How? The AGX instruction set allows pixel shaders to output different colours to each sample. The instruction used to output a colour¹ takes a set of samples to modify, encoded as a bit mask. The default all-1’s mask writes the same value to all samples in a pixel, but a mask setting a single bit will write only the single corresponding sample.

This design is unusual, and it requires driver backflips to translate “fragment shaders” into hardware pixel shaders. How do we do it?

Physically, the hardware executes our shader once per pixel. Logically, we’re supposed to execute the application’s fragment shader once per sample. If we know the number of samples per pixel, then we can wrap the application’s shader in a loop over each sample. So, if the original fragment shader is:

interpolated colour = interpolate at current sample(input colour);
output current sample(interpolated colour);

then we will transform the program to the pixel shader:

for (sample = 0; sample < number of samples; ++sample) {
    sample mask = (1 << sample);
    interpolated colour = interpolate at sample(input colour, sample);
    output samples(sample mask, interpolated colour);
}

The original fragment shader runs inside the loop, once per sample. Whenever it interpolates inputs at the current sample position, we change it to instead interpolate at a specific sample given by the loop counter sample. Likewise, when it outputs a colour for a sample, we change it to output the colour to the single sample given by the loop counter.

If the story ended here, this mechanism would be silly. Adding sample masks to the instruction set is more complicated than a single bit to invoke the shader multiple times, as other GPUs do. Even Apple’s own Metal driver has to implement this dance, because Metal has a similar approach to sample shading as OpenGL and Vulkan. With all this extra complexity, is there a benefit?

If we generated that loop at the end, maybe not. But if we know at compile-time that sample shading is used, we can run our full optimizer on this sample loop. If there is an expression that is the same for all samples in a pixel, it can be hoisted out of the loop.² Instead of calculating the same value multiple times, as other GPUs do, the value can be calculated just once and reused for each sample. Although it complicates the driver, this approach to sample shading isn’t Apple cutting corners. If we slapped on the loop at the end and did no optimizations, the resulting code would be comparable to what other GPUs execute in hardware. There might be slight differences from spawning fewer threads but executing more control flow instructions³, but that’s minor. Generating the loop early and running the optimizer enables better performance than possible on other GPUs.

So is the mechanism only an optimization? Did Apple stumble on a better approach to sample shading that other GPUs should adopt? I wouldn’t be so sure.

Let’s pull the curtain back. AGX has its roots as a mobile GPU intended for iPhones, with significant PowerVR heritage. Even if it powers Mac Pros today, the mobile legacy means AGX prefers software implementations of many features that desktop GPUs implement with dedicated hardware.

Yes, I’m talking about blending.

Blending is an operation in graphics APIs to combine the fragment shader output colour with the existing colour in the framebuffer. It is usually used to implement alpha blending, to let the background poke through translucent objects.

When multisampling is used without sample shading, although the fragment shader only runs once per pixel, blending happens per-sample. Even if the fragment shader outputs the same colour to each sample, if the framebuffer already had different colours in different samples, blending needs to happen per-sample to avoid losing that information already in the framebuffer.

A traditional desktop GPU blends with dedicated hardware. In the mobile space, there’s a mix of dedicated hardware and software. On AGX, blending is purely software. Rather than configure blending hardware, the driver must produce variants of the fragment shader that include instructions to implement the desired blend mode. With alpha blending, a fragment shader like:

colour = calculate lighting();
output(colour);

becomes:

colour = calculate lighting();
dest = load destination colour;
alpha = colour.alpha;
blended = (alpha * colour) + ((1 - alpha) * dest));
output(blended);

Where’s the problem?

Blending happens per sample. Even if the application intends to run the fragment shader per pixel, the shader must run per sample for correct blending. Compared to other GPUs, this approach to blending would regress performance when blending and multisampling are enabled but sample shading is not.

On the other hand, exposing multisample pixel shaders to the driver solves the problem neatly. If both the blending and the multisample state are known, we can first insert instructions for blending, and then wrap with the sample loop. The above program would then become:

for (sample = 0; sample < number of samples; ++sample_id) {
    colour = calculate lighting();

    dest = load destination colour at sample (sample);
    alpha = colour.alpha;
    blended = (alpha * colour) + ((1 - alpha) * dest);

    sample mask = (1 << sample);
    output samples(sample_mask, blended);
}

In this form, the fragment shader is asymptotically worse than the application wanted: the fragment shader is executed inside the loop, running per-sample unnecessarily.

Have no fear, the optimizer is here. Since colour is the same for each sample in the pixel, it does not depend on the sample ID. The compiler can move the entire original fragment shader (and related expressions) out of the per-sample loop:

colour = calculate lighting();
alpha = colour.alpha;
inv_alpha = 1 - alpha;
colour_alpha = alpha * colour;

for (sample = 0; sample < number of samples; ++sample_id) {
    dest = load destination colour at sample (sample);
    blended = colour_alpha + (inv_alpha * dest);

    sample mask = (1 << sample);
    output samples(sample_mask, blended);
}

Now blending happens per sample but the application’s fragment shader runs just once, matching the performance characteristics of traditional GPUs. Even better, all of this happens without any special work from the compiler. There’s no magic multisampling optimization happening here: it’s just a loop.

By the way, what do we do if we don’t know the blending and multisample state at compile-time? Hope is not lost…

…but that’s a story for another day.

What’s next?

While OpenGL ES 3.0 is an improvement over ES 2.0, we’re not done. In my work-in-progress branch, OpenGL ES 3.1 support is nearly finished, which will unlock compute shaders.

The final goal is a Vulkan driver running modern games. We’re a while away, but the baseline Vulkan 1.0 requirements parallel OpenGL ES 3.1, so our work translates to Vulkan. For example, the multisampling compiler passes described above are common code between the drivers. We’ve tested them against OpenGL, and now they’re ready to go for Vulkan.

And yes, the team is already working on Vulkan.

Until then, you’re one pacman -Syu away from enjoying OpenGL 3.1!

Store a formatted value to local memory acting as a tilebuffer.↩︎
Via common subexpression elimination if the loop is unrolled, otherwise via code motion.↩︎
Since the number of samples is constant, all threads branch in the same direction so the usual “GPUs are bad at branching” advice does not apply.↩︎

Passing the reins on Panfrost

Mon, 10 Apr 2023 00:00:00 -0500

Today is my last day at Collabora and my last day leading the Panfrost driver.

It’s been a wild ride.

In 2017, I began work on the chai driver for Mali T (Midgard). chai would later be merged into Lyude Paul’s and Connor Abbott’s BiOpenly project for Mali G (Bifrost) to form Panfrost.

In 2019, I joined Collabora to accelerate work on the driver stack. The initial goal was to run GNOME on a Mali-T860 Chromebook.

Huge success.

Today, Panfrost supports a broad spectrum of Mali GPUs, conformant to the OpenGL ES 3.1 specification on Mali-G52 and Mali-G57. It’s hard to overstate how far we’ve come. I’ve had the thrills of architecting several backend shader compilers as well as the Gallium-based OpenGL driver, while my dear colleague Boris Brezillon has put together a proof-of-concept Vulkan driver which I think you’ll hear more about soon.

Lately, my focus has been ensuring the project can stand on its own four legs. I have every confidence in other Collaborans hacking on Panfrost, including Boris and Italo Nicola. The project has a bright future. It’s time for me to pass the reins.

I’m still alive. I plan to continue working on Mesa drivers for a long time, including the common infrastructure upon which Panfrost relies. And I’ll still send the odd Panfrost patch now and then. That said, my focus will shift.

I’m not ready to announce what’s in store yet… but maybe you can read between the lines!

Apple GPU drivers now in Asahi Linux

Wed, 07 Dec 2022 00:00:00 -0500

We’re excited to announce our first Apple GPU driver release!

We’ve been working hard over the past two years to bring this new driver to everyone, and we’re really proud to finally be here. This is still an alpha driver, but it’s already good enough to run a smooth desktop experience and some games.

Read on to find out more about the state of things today, how to install it (it’s an opt-in package), and how to report bugs!

Status

This release features work-in-progress OpenGL 2.1 and OpenGL ES 2.0 support for all current Apple M-series systems. That’s enough for hardware acceleration with desktop environments, like GNOME and KDE. It’s also enough for older 3D games, like Quake3 and Neverball. While there’s always room for improvement, the driver is fast enough to run all of the above at 60 frames per second at 4K.

Please note: these drivers have not yet passed the OpenGL (ES) conformance tests. There will be bugs!

What’s next? Supporting more applications. While OpenGL (ES) 2 suffices for some applications, newer ones (especially games) demand more OpenGL features. OpenGL (ES) 3 brings with it a slew of new features, like multiple render targets, multisampling, and transform feedback. Work on these features is well under way, but they will each take a great deal of additional development effort, and all are needed before OpenGL (ES) 3.0 is available.

What about Vulkan? We’re working on it! Although we’re only shipping OpenGL right now, we’re designing with Vulkan in mind. Most of the work we’re putting toward OpenGL will be reused for Vulkan. We estimated that we could ship working OpenGL 2 drivers much sooner than a working Vulkan 1.0 driver, and we wanted to get hardware accelerated desktops into your hands as soon as possible. For the most part, those desktops use OpenGL, so supporting OpenGL first made more sense to us than diving into the Vulkan deep end, only to use Zink to translate OpenGL 2 to Vulkan to run desktops. Plus, there is a large spectrum of OpenGL support, with OpenGL 2.1 containing a fraction of the features of OpenGL 4.6. The same is true for Vulkan: the baseline Vulkan 1.0 profile is roughly equivalent to OpenGL ES 3.1, but applications these days want Vulkan 1.3 with tons of extensions and “optional” features. Zink’s “layering” of OpenGL on top of Vulkan isn’t magic: it can only expose the OpenGL features that the underlying Vulkan driver has. A baseline Vulkan 1.0 driver isn’t even enough to get OpenGL 2.1 on Zink! Zink itself advertises support for OpenGL 4.6, but of course that’s only when paired with Vulkan drivers that support the equivalent of OpenGL 4.6… and that gets us back to a tremendous amount of time and effort.

When will OpenGL 3 support be ready? OpenGL 4? Vulkan 1.0? Vulkan 1.3? In community open source projects, it’s said that every time somebody asks when a feature will be done, it delays that feature by a month. Well, a lot of people have been asking…

At any rate, for a sneak peek… here is SuperTuxKart’s deferred renderer running at full speed, making liberal use of OpenGL ES 3 features like multiple render targets~

Anatomy of a GPU driver

Modern GPUs consist of many distinct “layered” parts. There is…

a memory management unit and an interface to submit memory-mapped work to the hardware
fixed-function 3D hardware to rasterize triangles, perform depth/stencil testing, and more
programmable “shader cores” (like little CPUs with bespoke instruction sets) with work dispatched by the fixed-function hardware

This “layered” hardware demands a “layered” graphics driver stack. We need…

a kernel driver to map memory and submit memory-mapped work
a userspace driver to translate OpenGL and Vulkan calls into hardware-specific data structures in graphics memory
a compiler translating shading programming languages like GLSL to the hardware’s instruction set

That’s a lot of work, calling for a team effort! Fortunately, that layering gives us natural boundaries to divide work among our small team.

Alyssa Rosenzweig is writing the OpenGL driver and compiler.
Asahi Lina is writing the kernel driver and helping with OpenGL.
Dougall Johnson is reverse-engineering the instruction set with Alyssa.

Meanwhile, Ella Stanforth is working on a Vulkan driver, reusing the kernel driver, the compiler, and some code shared with the OpenGL driver.

Of course, we couldn’t build an OpenGL driver in under two years just ourselves. Thanks to the power of free and open source software, we stand on the shoulders of FOSS giants. The compiler implements a “NIR” backend, where NIR is a powerful intermediate representation, including GLSL to NIR translation. The kernel driver users the “Direct Rendering Manager” (DRM) subsystem of the Linux kernel to minimize boilerplate. Finally, the OpenGL driver implements the “Gallium3D” API inside of Mesa, the home for open source OpenGL and Vulkan drivers. Through Mesa and Gallium3D, we benefit from thirty years of OpenGL driver development, with common code translating OpenGL into the much simpler Gallium3D. Thanks to the incredible engineering of NIR, Mesa, and Gallium3D, our ragtag team of reverse-engineers can focus on what’s left: the Apple hardware.

Installation instructions

To get the new drivers, you need to run the linux-asahi-edge kernel and also install the mesa-asahi-edge Mesa package.

$ sudo pacman -Syu
$ sudo pacman -S linux-asahi-edge mesa-asahi-edge
$ sudo update-grub

Since only one version of Mesa can be installed at a time, pacman will prompt you to replace mesa with mesa-asahi-edge. This is normal!

We also recommend running Wayland instead of Xorg at this point, so if you’re using the KDE Plasma environment, make sure to install the Wayland session:

$ sudo pacman -S plasma-wayland-session

Then reboot, pick the Wayland session at the top of the login screen (SDDM), and enjoy! You might want to adjust the screen scale factor in System Settings → Display and Monitor (Plasma Wayland defaults to 100% or 200%, while 150% is often nicer). If you have “Force font DPI” enabled under Appearance → Fonts, you should disable that (it is saved separately for Wayland and Xorg, and shouldn’t be necessary on Wayland sessions). Log out and back in for these changes to fully apply.

Xorg and Xorg-based desktop environments should work, but there are a few known issues:

Expect screen tearing (this might be fixed soon)
VSync does not work (some KDE animations will be too fast, and GL apps will not limit their FPS even with VSync enabled). This is a limitation of Xorg on the Apple DCP display controllers, which do not support VBlank interrupts.
There are still driver bugs triggered by Xorg/KWin. We’re looking into this.

The linux-asahi-edge kernel can be installed side-by-side with the standard linux-asahi package, but both versions should be kept in sync, so make sure to always update your packages together! You can always pick the linux-asahi kernel in the GRUB boot menu, which will disable GPU acceleration and the DCP display driver.

When the packages are updated in the future, it’s possible that graphical apps will stop starting up after an update until you reboot, or they may fall back to software rendering. This is normal. Until the UAPI is stable, we’ll have to break compatibility between Mesa and the kernel every now and then, so you will need to reboot to make things work after updates. In general, if apps do keep working with acceleration after any particular Mesa update, then it’s probably safe not to reboot, but you should still do it to make sure you’re running the latest kernel!

Reporting bugs

Since the driver is still in development, there are lots of known issues and we’re still working hard on improving conformance test results. Please don’t open new bugs for random apps not working! It’s still the early days and we know there’s a lot of work to do. Here’s a quick guide of how to report bugs:

If you find an app that does not start up at all, please don’t report it as a bug. Lots of apps won’t work because they require a newer GL version than what we support. Please set the LIBGL_ALWAYS_SOFTWARE=1 environment variable for those apps to fall back to software rendering. If it is a popular app that is part of the Arch Linux ARM repository, you can make a comment on this issue instead, so we can add Mesa quirks to workaround.
If you run into issues caused by linux-asahi-edge unrelated to the GPU, please add a comment to this issue. This includes display output issues! (Resolutions, backlight control, display power control, etc.)
If the GPU locks up and all GPU apps stop working, run asahi-diagnose (for example, from an SSH session), open a new bug on the AsahiLinux/linux repository, attach the file generated by that command, and tell us what you were doing that caused the lockup.
For other GPU issues (rendering glitches, apps that crash after starting up correctly, and things like that), run asahi-diagnose and make a comment on this issue, attaching the file generated by that command. Don’t forget to tell us about your environment!
In the future, if a driver update causes a regression (rendering problems or crashes for apps that previously worked properly), you can open a bug directly in the Mesa tracker.

We hope you enjoy our driver! Remember, things are still moving quickly, so make sure to update your packages regularly to get updates and bug fixes!

Co-written with Asahi Lina. Can you tell who wrote what?

Clip control on the Apple GPU

Mon, 22 Aug 2022 00:00:00 -0500

After a year in development, the open source “Asahi” driver for the Apple GPU is running real games. There’s more to do, but Neverball is already playable (and a lot of fun!).

Neverball uses legacy “fixed function” OpenGL. Rather than supply programmable shaders like OpenGL 2, old OpenGL 1 applications configure a fixed set of graphics effects like fog and alpha testing. Modern GPUs don’t implement these features in hardware. Instead, the driver synthesizes shaders implementing the desired graphics. This translation is complicated, but we get it for “free” as an open source driver in Mesa. If we implement the modern shader pipeline, Mesa will handle fixed function OpenGL for us transparently. That’s a win for open source drivers, and a win for GPU acceleration on Asahi Linux.

To implement the modern OpenGL features, we rely on reverse-engineering the behaviour of Apple’s Metal driver, as we don’t have hardware documentation. Although Metal uses the same shader pipeline as OpenGL, it doesn’t support all the OpenGL features that the hardware does, which puts us in bind. In the past, I’ve relied on educated guesswork to bridge the gap, but there’s another solution… and it’s a doozy.

For motivation, consider the clip space used in OpenGL. In every other API on the planet, the Z component (depth) of points in the 3D world range from 0 to 1, where 0 is “near” and 1 is “far”. In OpenGL, however, Z ranges from negative 1 to 1. As Metal uses the 0/1 clip space, implementing OpenGL on Metal requires emulating the -1/1 clip space by inserting extra instructions into the vertex shader to transform the Z coordinate. Although this emulation adds overhead, it works for ANGLE’s open source implementation of OpenGL ES on Metal.

Like ANGLE, Apple’s OpenGL driver internally translates to Metal. Because Metal uses the 0 to 1 clip space, it should require this emulation code. Curiously, when we disassemble shaders compiled with their OpenGL implementation, we don’t see any such emulation. That means Apple’s GPU must support -1/1 clip spaces in addition to Metal’s preferred 0/1. The problem is figuring out how to use this other clip space.

We expect that there’s a bit toggling between these clip spaces. The logical place for such a bit is the viewport packet, but there’s no obvious difference between the viewport packets emitted by Metal and OpenGL-on-Metal. Ordinarily, we would identify the bit by toggling the clip space in Metal and comparing memory dumps. However, according to Apple’s documentation, there’s no way to change the clip space in Metal.

That’s an apparently contradiction. There’s no way to use the -1/1 clip space with Metal, but Apple’s OpenGL-on-Metal translator uses uses the -1/1 clip space. What gives?

Here’s a little secret: there are two graphics APIs called “Metal”. There’s the Metal you know, a limited API that Apple documents for App Store developers, an API that lacks useful features supported by OpenGL and Vulkan.

And there’s the Metal that Apple uses themselves, an internal API adding back features that Apple doesn’t want you using. While ANGLE implements OpenGL ES on the documented Metal, Apple can implement OpenGL on the secret Metal.

Apple does not publish documentation or headers for this richer Metal API, but if we’re lucky, we can catch a glimpse behind the curtain. The undocumented classes and methods making up the internal Metal API are still available in the production Metal binaries. To use them, we only need the missing headers. Fortunately, Objective-C symbols contain enough information to reconstruct header files, allowing us to experiment with undocumented methods with “extra” functionality inherited from OpenGL.

Compared to the desktop GPUs found in Intel Macs, Apple’s own GPU implements a slim, modern feature set mapping well to Metal. Most of the “extra” functionality is emulated. It is interesting to know the emulation happens in their Metal driver instead of their OpenGL frontend, but that’s unsurprising, as it allows their Metal drivers for Intel and AMD GPUs to implement the functionality natively. While this information is fascinating for “macOS hermeneutics”, it won’t help us with our Apple GPU mystery.

What will help us are the catch-all mystery methods named setOpenGLModeEnabled, apparently enabling “OpenGL mode”.

Mystery methods named like just beg to be called.

The render pipeline descriptor has such a method. That descriptor contains state that can change every draw. In some graphics APIs, like OpenGL with ARB_clip_control and Vulkan with VK_EXT_depth_clip_control, the application can change the clip space every draw. Ideally, the clip space state would be part of this descriptor.

We can test this optimistic guess by augmenting our Metal test bench to call [MTLRenderPipelineDescriptorInternal setOpenGLModeEnabled: YES].

It feels strange to call this hidden method. It’s stranger when the code compiles and runs just fine.

We can then compare traces between OpenGL mode and the normal Metal mode. Seemingly, enabling OpenGL mode toggles a plethora of random unknown bits. Even if one of them is what we want, it’s a bit unsatisfying that the “real” Metal would lack a proper [setClipSpace: MTLMinusOneToOne] method, rather than this blunt hack reconfiguring a pile of loosely related API behaviours.

Alas, for all the random changes in “OpenGL mode”, none seem to affect clipping behaviour.

Hope is not yet lost. There’s another setOpenGLModeEnabled method, this time in the render pass descriptor. Rather than pipeline state that can change every draw, this descriptor’s state can only change in between render passes. Changing that state in between draws would require an expensive flush to main memory, similar to the partial renders seen elsewhere with the Apple GPU. Nevertheless, it’s worth a shot.

Changing our test bench to call [MTLRenderPassDescriptorInternal setOpenGLModeEnabled: YES], we find another collection of random bits changed. Most of them are in hardware packets, and none of those seem to control clip space, either.

One bit does stand out. It’s not a hardware bit.

In addition to the packets that the userspace driver prepares for the hardware, userspace passes to the kernel a large block of render pass state describing everything from tile size to the depth/stencil buffers. Such a design is unusual. Ordinarily, GPU kernel drivers are only concerned with memory management and scheduling, remaining oblivious of 3D graphics. By contrast, Apple processes this state in the kernel forwarding the state to the GPU’s firmware to configure the actual hardware.

Comparing traces, the render pass “OpenGL mode” sets an unknown bit in this kernel-processed block. If we set the same bit in our OpenGL driver, we find the clip space changes to -1/1. Victory, right?

Almost. Because this bit is render pass state, we can’t use it to change the clip space between draws. That’s okay for baseline OpenGL and Vulkan, but it prevents us from efficiently implementing the ARB_clip_control and VK_EXT_depth_clip_control extensions. There are at least three (inefficient) implementations.

The first is ignoring the hardware support and emulating one of the clip spaces by inserting extra instructions into the vertex shader when the “wrong” clip space is used. In addition to extra overhead, that requires shader variants for the different clip spaces.

Shader variants are terrible.

In new APIs like Vulkan, Metal, and D3D12, everything needed to compile a shader is known up-front as part of a monolithic pipeline. That means pipelines are compiled when they’re created, not when they’re used, and are never recompiled. By contrast, older APIs like OpenGL and D3D11 allow using the same shader with different API states, requiring some drivers to recompile shaders on the fly. Compiling shaders is slow, so shader variants can cause unpredictable drops in an application’s frame rate, familiar to desktop gamers as stuttering. If we use this approach in our OpenGL driver, switching clip modes could cause stuttering due to recompiling shaders. In bad circumstances, that stutter could even happen long after the mode is switched.

That option is undesirable, so the second approach is always inserting emulation instructions that read the desired clip space at run-time, reserving a uniform (push constant) for the transformation. That way, the same shader is usable with either clip space, eliminating shader variants. However, that has even higher overhead than the first method. If an application frequently changes clip spaces within a render pass, this approach will be the most efficient of the three. If it does not, this approach adds constant overhead to every application. Knowing which approach is better requires the driver to have a magic crystal ball.¹

The final option is using the hardware clip space bit and splitting the render pass when the clip space is changed. Here, the shaders are optimal and do not require variants. However, splitting the render pass wastes tremendous memory bandwidth if the application changes clip spaces frequently. Nevertheless, this approach has some support from the ARB_clip_control specification:

Some [OpenGL] implementations may introduce a flush when changing the clip control state. Hence frequent clip control changes are not recommended.

Each approach has trade-offs. For now, the easiest “option” is sticking our head in the sand and giving up on ARB_clip_control altogether. The OpenGL extension is optional until we get to OpenGL 4.5. Apple doesn’t implement it in their OpenGL stack. Because ARB_clip_control is primarily for porting Direct3D games, native OpenGL games are happy without it. Certainly, Neverball doesn’t mind. For now, we can use the hardware bit to use the -1/1 clip space unconditionally in OpenGL and 0/1 unconditionally in Vulkan. That does not require any emulation or flushing, though it prevents us from advertising the extensions.

That’s enough to run Neverball on macOS, using our userspace OpenGL driver in Mesa, and Apple’s proprietary kernel driver. There’s a catch: Neverball has to present with the deprecated X11 server on macOS. Years ago, Apple engineers² contributed Mesa support for X11 on macOS (XQuartz), allowing us to run X11 applications with our Mesa driver. However, there’s no support for Apple’s own Cocoa windowing system, meaning native macOS applications won’t work with our driver. There’s also no easy way to run Linux’s newer Wayland display server on macOS. Nevertheless, Neverball does not use Cocoa directly. Instead, it uses the cross-platform SDL2 library to create its window, which internally uses Cocoa, X11, or Wayland as appropriate for the operating system. With enough sweat and tears, we can build an macOS/X11 version of SDL2 and link Neverball with that.

This Neverball/macOS/X11 port was frustrating, especially when the game is one apt install away on Linux. That’s a job for Asahi Lina, who has been hard at work writing a Linux kernel driver for Apple’s GPU. When our work converges, my userspace Mesa driver will run on Linux with her kernel driver to implement a full open source graphics stack for 3D acceleration on Asahi Linux.

Please temper your expectations: even with hardware documentation, an optimized Vulkan driver stack (with enough features to layer OpenGL 4.6 with Zink) requires many years of full time work. At least for now, nobody is working on this driver full time³. Reverse-engineering slows the process considerably. We won’t be playing AAA games any time soon.

That said, thanks to the tremendous shared code in Mesa, a basic OpenGL driver is doable by a single person. I’m optimistic that we’ll have native OpenGL 2.1 in Asahi Linux by the end of the year. That’s enough to accelerate your desktop environment and browser. It’s also enough to play older games (like Neverball). Even without fancy features, GPU acceleration means smooth animations and better battery life.

In that light, the Asahi Linux future looks bright.

This crystal ball is called “Vulkan, Metal, or D3D12”, and it has its own problems.↩︎
Hi Jeremy!↩︎
I work full-time at Collabora on my baby, the open source Panfrost driver for Mali GPUs.↩︎

The Apple GPU and the Impossible Bug

Fri, 13 May 2022 00:00:00 -0500

In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:

The driver fails to render large amounts of geometry.

Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.

It’s hard to pinpoint how much we can render without faults. It’s not just the geometry complexity that matters. The same geometry can render with simple shaders but fault with complex ones.

That suggests rendering detailed geometry with a complex shader “takes too long”, and the GPU is timing out. Maybe it renders only the parts it finished in time.

Given the hardware architecture, this explanation is unlikely.

This hypothesis is easy to test, because we can control for timing with a shader that takes as long as we like:

for (int i = 0; i < LARGE_NUMBER; ++i) {
    /* some work to prevent the optimizer from removing the loop */
}

After experimenting with such a shader, we learn…

If shaders have a time limit to protect against infinite loops, it’s astronomically high. There’s no way our bunny hits that limit.
The symptoms of timing out differ from the symptoms of our driver rendering too much geometry.

That theory is out.

Let’s experiment more. Modifying the shader and seeing where it breaks, we find the only part of the shader contributing to the bug: the amount of data interpolated per vertex. Modern graphics APIs allow specifying “varying” data for each vertex, like the colour or the surface normal. Then, for each triangle the hardware renders, these “varyings” are interpolated across the triangle to provide smooth inputs to the fragment shader, allowing efficient implementation of common graphics techniques like Blinn-Phong shading.

Putting the pieces together, what matters is the product of the number of vertices (geometry complexity) times amount of data per vertex (“shading” complexity). That product is “total amount of per-vertex data”. The GPU faults if we use too much total per-vertex data.

Why?

When the hardware processes each vertex, the vertex shader produces per-vertex data. That data has to go somewhere. How this works depends on the hardware architecture. Let’s consider common GPU architectures.¹

Traditional immediate mode renderers render directly into the framebuffer. They first run the vertex shader for each vertex of a triangle, then run the fragment shader for each pixel in the triangle. Per-vertex “varying” data is passed almost directly between the shaders, so immediate mode renderers are efficient for complex scenes.

There is a drawback: rendering directly into the framebuffer requires tremendous amounts of memory access to constantly write the results of the fragment shader and to read out back results when blending. Immediate mode renderers are suited to discrete, power-hungry desktop GPUs with dedicated video RAM.

By contrast, tile-based deferred renderers split rendering into two passes. First, the hardware runs all vertex shaders for the entire frame, not just for a single model. Then the framebuffer is divided into small tiles, and dedicated hardware called a tiler determines which triangles are in each tile. Finally, for each tile, the hardware runs all relevant fragment shaders and writes the final blended result to memory.

Tilers reduce memory traffic required for the framebuffer. As the hardware renders a single tile at a time, it keeps a “cached” copy of that tile of the framebuffer (called the “tilebuffer”). The tilebuffer is small, just a few kilobytes, but tilebuffer access is fast. Writing to the tilebuffer is cheap, and unlike immediate renderers, blending is almost free. Because main memory access is expensive and mobile GPUs can’t afford dedicated video memory, tilers are suited to mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s AGX.

Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones. Tilers work well on the desktop, but there are some drawbacks.

First, at the start of a frame, the contents of the tilebuffer are undefined. If the application needs to preserve existing framebuffer contents, the driver needs to load the framebuffer from main memory and store it into the tilebuffer. This is expensive.

Second, because all vertex shaders are run before any fragment shaders, the hardware needs a buffer to store the outputs of all vertex shaders. In general, there is much more data required than space inside the GPU, so this buffer must be in main memory. This is also expensive.

Ah-ha. Because AGX is a tiler, it requires a buffer of all per-vertex data. We fault when we use too much total per-vertex data, overflowing the buffer.

…So how do we allocate a larger buffer?

On some tilers, like older versions of Arm’s Mali GPU, the userspace driver computes how large this “varyings” buffer should be and allocates it.² To fix the faults, we can try increasing the sizes of all buffers we allocate, in the hopes that one of them contains the per-vertex data.

No dice.

It’s prudent to observe what Apple’s Metal driver does. We can cook up a Metal program drawing variable amounts of geometry and trace all GPU memory allocations that Metal performs while running our program. Doing so, we learn that increasing the amount of geometry drawn does not increase the sizes of any allocated buffers. In fact, it doesn’t change anything in the command buffer submitted to the kernel, except for the single “number of vertices” field in the draw command.

We know that buffer exists. If it’s not allocated by userspace – and by now it seems that it’s not – it must be allocated by the kernel or firmware.

Here’s a funny thought: maybe we don’t specify the size of the buffer at all. Maybe it’s okay for it to overflow, and there’s a way to handle the overflow.

It’s time for a little reconnaissance. Digging through what little public documentation exists for AGX, we learn from one WWDC presentation:

The Tiled Vertex Buffer stores the Tiling phase output, which includes the post-transform vertex data…

But it may cause a Partial Render if full. A Partial Render is when the GPU splits the render pass in order to flush the contents of that buffer.

Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can overflow. To cope, the GPU stops accepting new geometry, renders the existing geometry, and restarts rendering.

Since partial renders hurt performance, Metal application developers need to know about them to optimize their applications. There should be performance counters flagging this issue. Poking around, we find two:

Number of partial renders.
Number of bytes used of the parameter buffer.

Wait, what’s a “parameter buffer”?

Remember the rumours that AGX is derived from PowerVR? The public PowerVR optimization guides explain:

[The] list containing pointers to each vertex passed in from the application… is called the parameter buffer (PB) and is stored in system memory along with the vertex data.

Each varying requires additional space in the parameter buffer.

The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.

What happens when PowerVR overflows the parameter buffer?

An old PowerVR presentation says that when the parameter buffer is full, the “render is flushed”, meaning “flushed data must be retrieved from the frame buffer as successive tile renders are performed”. In other words, it performs a partial render.

Back to the Apple M1, it seems the hardware is failing to perform a partial render. Let’s revisit the broken render.

Notice parts of the model are correctly rendered. The parts that are not only have the black clear colour of the scene rendered at the start. Let’s consider the logical order of events.

First, the hardware runs vertex shaders for the bunny until the parameter buffer overflows. This works: the partial geometry is correct.

Second, the hardware rasterizes the partial geometry and runs the fragment shaders. This works: the shading is correct.

Third, the hardware flushes the partial render to the framebuffer. This must work for us to see anything at all.

Fourth, the hardware runs vertex shaders for the rest of the bunny’s geometry. This ought to work: the configuration is identical to the original vertex shaders.

Fifth, the hardware rasterizes and shades the rest of the geometry, blending with the old partial render. Because AGX is a tiler, to preserve that existing partial render, the hardware needs to load it back into the tilebuffer. We have no idea how it does this.

Finally, the hardware flushes the render to the framebuffer. This should work as it did the first time.

The only problematic step is loading the framebuffer back into the tilebuffer after a partial render. Usually, the driver supplies two “extra” fragment shaders. One clears the tilebuffer at the start, and the other flushes out the tilebuffer contents at the end.

If the application needs the existing framebuffer contents preserved, instead of writing a clear colour, the “load tilebuffer” program instead reads from the framebuffer to reload the contents. Handling this requires quite a bit of code, but it works in our driver.

Looking closer, AGX requires more auxiliary programs.

The “store” program is supplied twice. I noticed this when initially bringing up the hardware, but the reason for the duplication was unclear. Omitting each copy separately and seeing what breaks, the reason becomes clear: one program flushes the final render, and the other flushes a partial render.³

…What about the program that loads the framebuffer into the tilebuffer?

When a partial render is possible, there are two “load” programs. One writes the clear colour or loads the framebuffer, depending on the application setting. We understand this one. The other always loads the framebuffer.

…Always loads the framebuffer, as in, for loading back with a partial render even if there is a clear at the start of the frame?

If this program is the issue, we can confirm easily. Metal must require it to draw the same bunny, so we can write a Metal application drawing the bunny and stomp over its GPU memory to replace this auxiliary load program with one always loading with black.

Doing so, Metal fails in a similar way. That means we’re at the root cause. Looking at our own driver code, we don’t specify any program for this partial render load. Up until now, that’s worked okay. If the parameter buffer is never overflowed, this program is unused. As soon as a partial render is required, however, failing to provide this program means the GPU dereferences a null pointer and faults. That explains our GPU faults at the beginning.

Following Metal, we supply our own program to load back the tilebuffer after a partial render…

…which does not fix the rendering! Cursed, this GPU. The faults go away, but the render still isn’t quite right for the first few frames, indicating partial renders are still broken. Notice the weird artefacts on the feet.

Curiously, the render “repairs itself” after a few frames, suggesting the parameter buffer stops overflowing. This implies the parameter buffer can be resized (by the kernel or by the firmware), and the system is growing the parameter buffer after a few frames in response to overflow. This mechanism makes sense:

The hardware can’t allocate more parameter buffer space itself.
Overflowing the parameter buffer is expensive, as partial renders require tremendous memory bandwidth.
Overallocating the parameter buffer wastes memory for applications rendering simple geometry.

Starting the parameter buffer small and growing in response to overflow provides a balance, reducing the GPU’s memory footprint and minimizing partial renders.

Back to our misrendering. There are actually two buffers being used by our program, a colour buffer (framebuffer)… and a depth buffer. The depth buffer isn’t directly visible, but facilitates the “depth test”, which discards far pixels that are occluded by other close pixels. While the partial render mechanism discards geometry, the depth test discards pixels.

That would explain the missing pixels on our bunny. The depth test is broken with partial renders. Why? The depth test depends on the depth buffer, so the depth buffer must also be stored after a partial render and loaded back when resuming. Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.

And with that, we get our bunny.

These explanations are massive oversimplifications of how modern GPUs work, but it’s good enough for our purposes here.↩︎
This is a worse idea than it sounds. Starting with the new Valhall architecture, Mali allocates varyings much more efficiently.↩︎
Why the duplication? I have not yet observed Metal using different programs for each. However, for front buffer rendering, partial renders need to be flushed to a temporary buffer for this scheme to work. Of course, you may as well use double buffering at that point.↩︎

Fun and Games with Exposure Notifications

Mon, 07 Sep 2020 00:00:00 -0500

Exposure Notifications is a protocol developed by Apple and Google for facilitating COVID-19 contact tracing on mobile phones by exchanging codes with nearby phones over Bluetooth, implemented within the Android and iOS operating systems, now available here in Toronto.

Wait – phones? Android and iOS only? Can’t my Debian laptop participate? It has a recent Bluetooth chip. What about phones running GNU/Linux distributions like the PinePhone or Librem 5?

Exposure Notifications breaks down neatly into three sections: a Bluetooth layer, some cryptography, and integration with local public health authorities. Linux is up to the task, via BlueZ, OpenSSL, and some Python.

Given my background, will this build to be a reverse-engineering epic resulting in a novel open stack for a closed system?

…

Not at all. The specifications for the Exposure Notifications are available for both the Bluetooth protocol and the underlying cryptography. A partial reference implementation is available for Android, as is an independent Android implementation in microG. In Canada, the key servers run an open source stack originally built by Shopify and now maintained by the Canadian Digital Service, including open protocol documentation.

All in all, this is looking to be a smooth-sailing weekend¹ project.

The devil’s in the details.

Bluetooth

Exposure Notifications operates via Bluetooth Low Energy “advertisements”. Scanning for other devices is as simple as scanning for advertisements, and broadcasting is as simple as advertising ourselves.

On an Android phone, this is handled deep within Google Play Services. Can we drive the protocol from userspace on a regular GNU/Linux laptop? It depends. Not all laptops support Bluetooth, not all Bluetooth implementations support Bluetooth Low Energy, and I hear not all Bluetooth Low Energy implementations properly support undirected transmissions (“advertising”).

Luckily in my case, I develop on an Debianized Chromebook with a Wi-Fi/Bluetooth module. I’ve never used the Bluetooth, but it turns out the module has full support for advertisements, verified with the lescan (Low Energy Scan) command of the hcitool Bluetooth utility.

hcitool is a part of BlueZ, the standard Linux library for Bluetooth. Since lescan is able to detect nearby phones running Exposure Notifications, pouring through its source code is a good first step to our implementation. With some minor changes to hcitool to dump packets as raw hex and to filter for the Exposure Notifications protocol, we can print all nearby Exposure Notifications advertisements. So far, so good.

That’s about where the good ends.

While scanning is simple with reference code in hcitool, advertising is complicated by BlueZ’s lack of an interface at the time of writing. While a general “enable advertising” routine exists, routines to set advertising parameters and data per the Exposure Notifications specification are unavailable. This is not a showstopper, since BlueZ is itself an open source userspace library. We can drive the Bluetooth module the same way BlueZ does internally, filling in the necessary gaps in the API, while continuing to use BlueZ for the heavy-lifting.

Some care is needed to multiplex scanning and advertising within a single thread while remaining power efficient. The key is that advertising, once configured, is handled entirely in hardware without CPU intervention. On the other hand, scanning does require CPU involvement, but it is not necessary to scan continuously. Since COVID-19 is thought to transmit from sustained exposure, we only need to scan every few minutes. (Food for thought: how does this connect to the sampling theorem?)

Thus we can order our operations as:

Configure advertising
Scan for devices
Wait for several minutes
Repeat.

Since most of the time the program is asleep, this loop is efficient. It additionally allows us to reconfigure advertising every ten to fifteen minutes, in order to change the Bluetooth address to prevent tracking.

All of the above amounts to a few hundred lines of C code, treating the Exposure Notifications packets themselves as opaque random data.

Cryptography

Yet the data is far from random; it is the result of a series of operations in terms of secret keys defined by the Exposure Notifications cryptography specification. Every day, a “temporary exposure key” is generated, from which a “rolling proximity identifier key” and an “associated encrypted metadata key” are derived. These are used to generate a “rolling proximity identifier” and the “associated encrypted metadata”, which are advertised over Bluetooth and changed in lockstep with the Bluetooth random addresses.

There are lots of moving parts to get right, but each derivation reuses a common encryption primitive: HKDF-SHA256 for key derivation, AES-128 for the rolling proximity identifier, and AES-128-CTR for the associated encrypted metadata. Ideally, we would grab a state-of-the-art library of cryptography primitives like NaCl or libsodium and wire everything up.

First, some good news: once these routines are written, we can reliably unit test them. Though the specification states that “test vectors… are available upon request”, it isn’t clear who to request from. But Google’s reference implementation is itself unit-tested, and sure enough, it contains a TestVectors.java file, from which we can grab the vectors for a complete set of unit tests.

After patting ourselves on the back for writing unit tests, we’ll need to pick a library to implement the cryptography. Suppose we try NaCl first. We’ll quickly realize the primitives we need are missing, so we move onto libsodium, which is backwards-compatible with NaCl. For a moment, this will work – libsodium has upstream support for HKDF-SHA256. Unfortunately, the version of libsodium shipping in Debian testing is too old for HKDF-SHA256. Not a big problem – we can backwards port the implementation, written in terms of the underlying HMAC-SHA256 operations, and move on to the AES.

AES is a standard symmetric cipher, so libsodium has excellent support… for some modes. However standard, AES is not one cipher; it is a family of ciphers with different key lengths and operating modes, with dramatically different security properties. “AES-128-CTR” in the Exposure Notifications specification is clearly 128-bit AES in CTR (Counter) mode, but what about “AES-128” alone, stated to operate on a “single AES-128 block”?

The mode implicitly specified is known as ECB (Electronic Codebook) mode and is known to have fatal security flaws in most applications. Because AES-ECB is generally insecure, libsodium does not have any support for this cipher mode. Great, now we have two problems – we have to rewrite our cryptography code against a new library, and we have to consider if there is a vulnerability in Exposure Notifications.

ECB’s crucial flaw is that for a given key, identical plaintext will always yield identical ciphertext, regardless of position in the stream. Since AES is block-based, this means identical blocks yield identical ciphertext, leading to trivial cryptanalysis.

In Exposure Notifications, ECB mode is used only to derive rolling proximity identifiers from the rolling proximity identifier key and the timestamp, by the equation:

RPI_ij = AES_128_ECB(RPIK_i, PaddedData_j)

…where PaddedData is a function of the quantized timestamp. Thus the issue is avoided, as every plaintext will be unique (since timestamps are monotonically increasing, unless you’re trying to contact trace Back to the Future).

Nevertheless, libsodium doesn’t know that, so we’ll need to resort to a ubiquitous cryptography library that doesn’t, uh, take security quite so seriously…

I’ll leave the implications up to your imagination.

Database

While the Bluetooth and cryptography sections are governed by upstream specifications, making sense of the data requires tracking a significant amount of state. At minimum, we must:

Record received packets (the Rolling Proximity Identifier and the Associated Encrypted Metadata).
Query received packets for diagnosed identifiers.
Record our Temporary Encryption Keys.
Query our keys to upload if we are diagnosed.

If we were so inclined, we could handwrite all the serialization and concurrency logic and hope we don’t have a bug that results in COVID-19 mayhem.

A better idea is to grab SQLite, perhaps the most deployed software in the world, and express these actions as SQL queries. The database persists to disk, and we can even express natural unit tests with a synthetic in-memory database.

With this infrastructure, we’re now done with the primary daemon, recording Exposure Notification identifiers to the database and broadcasting our own identifiers. That’s not interesting if we never do anything with that data, though. Onwards!

Key retrieval

Once per day, Exposure Notifications implementations are expected to query the server for Temporary Encryption Keys associated with diagnosed COVID-19 cases. From these keys, the cryptography implementation can reconstruct the associated Rolling Proximity Identifiers, for which we can query the database to detect if we have been exposed.

Per Google’s documentation, the servers are expected to return a zip file containing two files:

export.bin: a container serialized as Protocol Buffers containing Diagnosis Keys
export.sig: a signature for the export with the public health agency’s key

The signature is not terribly interesting to us. On Android, it appears the system pins the public keys of recognized public health agencies as an integrity check for the received file. However, this public key is given directly to Google; we don’t appear to have an easy way to access it.

Does it matter? For our purposes, it’s unlikely. The Canadian key retrieval server is already transport-encrypted via HTTPS, so tampering with the data would already require compromising a certificate authority in addition to intercepting the requests to https://canada.ca. Broadly speaking, that limits attackers to nation-states, and since Canada has no reason to attack its own infrastructure, that limits our threat model to foreign nation-states. International intelligence agencies probably have better uses of resources than getting people to take extra COVID tests.

It’s worth noting other countries’ implementations could serve this zip file over plaintext HTTP, in which case this signature check becomes important.

Focusing then on export.bin, we may import the relevant protocol buffer definitions to extract the keys for matching against our database. Since this requires only read-only access to the database and executes infrequently, we can safely perform this work from a separate process written in a higher-level language like Python, interfacing with the cryptography routines over the Python foreign function interface ctypes. Extraction is easy with the Python protocol buffers implementation, and downloading should be as easy as a GET request with the standard library’s urllib, right?

Here we hit a gotcha: the retrieval endpoint is guarded behind an HMAC, requiring authentication to download the zip. The protocol documentation states:

Of course there’s no reliable way to truly authenticate these requests in an environment where millions of devices have immediate access to them upon downloading an Application: this scheme is purely to make it much more difficult to casually scrape these keys.

Ah, security by obscurity. Calculating the HMAC itself is simple given the documentation, but it requires a “secret” HMAC key specific to the server. As the documentation is aware, this key is hardly secret, but it’s not available on the Canadian Digital Service’s official repositories. Interoperating with the upstream servers would require some “extra” tricks.

From purely academic interest, we can write and debug our implementation without any such authorization by running our own sandbox server. Minus the configuration, the server source is available, so after spinning up a virtual machine and fighting with Go versioning, we can test our Python script.

Speaking of a personal sandbox…

Key upload

There is one essential edge case to the contact tracing implementation, one that we can’t test against the Canadian servers. And edge cases matter. In effect, the entire Exposure Notifications infrastructure is designed for the edge cases. If you don’t care about edge cases, you don’t care about digital contact tracing (so please, stay at home.)

The key feature – and key edge case – is uploading Temporary Exposure Keys to the Canadian key server in case of a COVID-19 diagnosis. This upload requires an alphanumeric code generated by a healthcare provider upon diagnosis, so if we used the shared servers, we couldn’t test an implementation. With our sandbox, we can generate as many alphanumeric codes as we’d like.

Once sandboxed, there isn’t much to the implementation itself: the keys are snarfed out of the SQLite database, we handshake with the server over protocol buffers marshaled over POST requests, and we throw in some public-key cryptography via the Python bindings to libsodium.

This functionality neatly fits into a second dedicated Python script which does not interface with the main library. It’s exposed as a command line interface with flow resembling that of the mobile application, adhering reasonably to the UNIX philosophy. Admittedly I’m not sure wrestling with the command line is top on the priority list of a Linux hacker ill with COVID-19. Regardless, the interface is suitable for higher-level (graphical) abstractions.

Problem solved, but of course there’s a gotcha: if the request is malformed, an error should be generated as a key robustness feature. Unfortunately, while developing the script against my sandbox, a bug led the request to be dropped unexpectedly, rather than returning with an error message. On the server implemented in Go, there was an apparent nil dereference. Oops. Fixing this isn’t necessary for this project, but it’s still a bug, even if it requires a COVID-19 diagnosis to trigger. So I went and did the Canadian thing and sent a pull request.

Conclusion

All in all, we end up with a Linux implementation of Exposure Notifications functional in Ontario, Canada. What’s next? Perhaps supporting contact tracing systems elsewhere in the world – patches welcome. Closer to home, while functional, the aesthetics are not (yet) anything to write home about – perhaps we could write a touch-based Linux interface for mobile Linux interfaces like Plasma Mobile and Phosh, maybe even running it on a Android flagship flashed with postmarketOS to go full circle.

Source code for liben is available for any one who dares go near. Compiling from source is straightforward but necessary at the time of writing. As for packaging?

Here’s hoping COVID-19 contact tracing will be obsolete by the time liben hits Debian stable.

Today (Monday) is Labour Day, so this is a 3-day weekend. But I started on Saturday and posted this today, so it technically counts.↩︎

Hilariously Fast Volume Computation with the Divergence Theorem

Fri, 16 Feb 2018 00:00:00 -0500

(No, there won’t be jokes.)

The following presents a fast algorithm for volume computation of a simple, closed, triangulated 3D mesh. This assumption is a consequence of the divergence theorem. Further extensions may generalise to other meshes as well, although that is presently out of scope.

We begin with the definition of volume as the triple integral over a region of the constant one:

\[V = \iiint_R 1 \mathrm{d}V\]

Let $\mathbf{F}$ be a function in $\mathbb{R}^3$ such that its divergence is equal to one. For the purposes of this paper, we choose:

\[\mathbf{F}(x, y, z) = <x, 0, 0>\]

It can easily be verified that

\[\mathrm{div} \mathbf{F} = \frac{\partial F}{\partial x} + \frac{\partial F}{\partial y} + \frac{\partial F}{\partial z} = 1 + 0 + 0 = 1\]

Therefore,

\[V = \iiint_R 1 dV = \iiint_R \mathrm{div} \mathbf{F}(x, y, z) \mathrm{d}V\]

By the Divergence Theorem, this is equal to the surface integral:

\[V = \iint_S \mathbf{F}(x, y, z) \mathrm{d}\mathbf{S}\]

This surface integral, defined over the surface S of the 3D mesh, is equal to the sum of its piecewise triangle parts. Let $T_i$ denote the surface of the $i$’th triangle in the mesh. Then,

\[V = \sum_{i = 0} \iint_{T_i} \mathbf{F}(x, y, z) \mathrm{d}\mathbf{S}\]

Let $T_{in}$ represent the $n$’th vertex of the $i$’th triangle. Let $\Delta_1$ equal the vector difference between $T_{i1}$ and $T_{i0}$, and $\Delta_2$ likewise equal to $T_{i2} - T{i0}$. Each individual triangle $T_i$ may thus be parametrised as:

\[\mathbf{r}(u, v) = T_{i0} + u\Delta_1 + v\Delta_2\]

Then, simple differentiation yields:

\[\mathbf{r}_u = \Delta_1\] \[\mathbf{r}_v = \Delta_2\]

Therefore,

\[\mathbf{r}_u \times \mathbf{r}_v = \Delta_1 \times \Delta_2\]

Thus, the surface integral can be rewritten in terms of this parametrisation, substituting in the definition of $\mathbf{F}$ as needed:

\[V = \sum_{i = 0} \iint_{T_i} \mathbf{F}(x, y, z) (\mathbf{r}_u \times \mathbf{r}_v) dA\] \[= \sum_{i = 0} \iint_{T_i} \mathbf{F}(x, y, z) \dot (\Delta_{i1} \times \Delta_{i2}) dA\] \[= \sum_{i = 0} \iint_{T_i} <x, 0, 0> \dot (\Delta_{i1} \times \Delta_{i2}) dA\]

This cross product is constant throughout the triangle and easy to calculate from the vertex data. Only the X component of the cross product should be calculated; the others are equal to zero due to the dot product with the zero components of $\mathbf{F}$. $V$ can be thus be rewritten as:

\[V = \sum_{i = 0} (\Delta_{i1} \times \Delta_{i2})_x \iint_{T_i} x dA\]

We now focus on the surface integral $\iint_{T_i} x dA$. Expanding with the parametrisation yields:

\[\iint_{T_i} x dA = \int_{0}^{1} \int_{0}^{u} x dv du = \int_{0}^{1} \int_{0}^{u} (T_{i0x} + u \Delta_{i1x} + v \Delta_{i2x}) dv du\]

This integral can be directly evaluated, treating vertex data as constants:

\[\int_{0}^{1} \int_{0}^{1-u} (T_{i0x} + u \Delta_{i1x} + v \Delta_{i2x}) dv du\] \[= T_{i0x} \int_{0}^{1} \int_{0}^{1-u} dv du + \Delta_{i1x} \int_{0}^{1} \int_{0}^{1-u} u dv du + \Delta_{i2x}) \int_{0}^{1} \int_{0}^{1-u} v dv du\] \[= T_{i0x} (\frac{1}{2}) + \Delta_{i1x} (\frac{1}{6}) + \Delta_{i2x} (\frac{1}{6})\] \[= T_{i0x} (\frac{1}{2}) + (T_{i1x} - T_{i0x})(\frac{1}{6}) + (T_{i2x} - T_{i0x})(\frac{1}{6})\] \[= T_{i0x} (\frac{1}{6}) + (T_{i1x})(\frac{1}{6}) + (T_{i2x})(\frac{1}{6})\] \[= \frac{1}{6}(T_{i0x} + T_{i1x} + T_{i2x})\]

Substituting into the original sum and pulling out a constant factor of $\frac{1}{6}$ to avoid the inner loop division, this yields the following compact formula for the volume:

\[V = \frac{1}{6} \sum_{i = 0} (\Delta_{i1} \times \Delta_{i2})_x (T_{i0x} + T_{i1x} + T_{i2x})\]

Performance analysis

The final algorithm contains no numerical integration nor differentiation. In contrast to common naive algorithms for volume, which are equivalent to rendering the mesh and then sampling the render, an expensive operation, there is only a single loop in this algorithm, over the triangles. Thus, this algorithm for volume computation is O(n) to the number of the triangles. Furthermore, the per-triangle calculation is similarly efficient: given the natural expansion of the cross product, the inner part contains seven additions and three multiplications. On the outside of the loop is only a single multiplication. Thus, for a mesh of $n$ triangles, the algorithm requires $8n - 1$ additions and $3n + 1$ multiplications, or $11n$ floating point operations. This is very fast.

For a ballpark number, if volume needs to be calculated every frame in a high-performance 60 frames per second application, without the aid of a GPU, only using the CPU capabilities of a $35 Raspberry Pi, around 30 million triangles could be measured every frame.

Motivation

The vector calculus exam is soon, and I need to study. Plus, who doesn’t love 3D graphics?!

~~I would be (pleasantly) surprised if the algorithm is novel.~~ Further research after posting reveals the paper Efficient Feature Extraction for 2D/3D Objects in Mesh Representation by Cha Zheng and Tsuhan Chen, which appears to describe the same algorithm, although the derivation is different. It was fun while it lasted!