<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"><channel><title>Alyssa Rosenzweig</title><link>https://alyssarosenzweig.ca/</link><description>Musing of a graphics witch</description><docs>http://www.rssboard.org/rss-specification</docs><generator>python-feedgen</generator><language>en</language><lastBuildDate>Wed, 11 Mar 2026 01:22:48 +0000</lastBuildDate><item><title>Plan du réseau du métro de Toronto</title><link>https://alyssarosenzweig.ca/blog/plan-du-reseau-ctt.html</link><description>&lt;p&gt;&lt;a href="/CTT-Plan.png"&gt;&lt;img src="/CTT-Plan.png"
alt="Plan du réseau du métro de la CTT" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;La plus grande ville d’un pays bilingue mérite le transport en commun
bilingue. Vu l’intérêt prolongé pour la francisation de la ville reine,
voici enfin un plan du réseau traduit.&lt;/p&gt;
&lt;p&gt;Il y a plusieurs ans, un internaut &lt;a
href="https://i.redd.it/y9hkgaj5tcty.jpg"&gt;traduisit le plan du réseau de
la Société de transport de Montréal&lt;/a&gt;. La règle est simple : traduire
le nom de chaque station, soit de français en anglais, soit d’anglais en
français. Puisque le plan original mélange nos langues nationales, tout
traduire le rend aussi bilingue qu’&lt;em&gt;Ô Canada&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Étant donné la popularité de ce plan montréalais, je me demandais
pourquoi personne n’a fait pareil à notre place. Bien que les
francophones soient en position minoritaire à Toronto, je suis certaine
que ce plan répondra aux besoins de dizaines d’amateurs de trains
franco-torontois, dont la plupart je connais déjà.&lt;/p&gt;
&lt;p&gt;Ce plan est une œuvre dérivée du &lt;a
href="https://commons.wikimedia.org/wiki/File:Toronto_rapid_transit_map_2026.svg"&gt;plan
anglais par Transportfan70 et Craftwerker&lt;/a&gt;, disponible selon les
termes de la licence &lt;a
href="https://creativecommons.org/licenses/by/3.0/deed.fr"&gt;Creative
Commons Attribution 3.0 Non Transposé&lt;/a&gt;. Si vous voudriez corriger mes
fautes, voici &lt;a href="/CTT-Plan.svg"&gt;le SVG modifié&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Merci de soutenir le bilinguisme canadien et le meilleur système de
transport en commun en Amerique du Nord. Si vous avez besoin de moi, je
serai en train d’attendre un train qui ne viendra jamais grâce à une
panne demain.&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/plan-du-reseau-ctt.html</guid><pubDate>Tue, 10 Mar 2026 00:00:00 -0500</pubDate></item><item><title>Dissecting the Apple M1 GPU, the end</title><link>https://alyssarosenzweig.ca/blog/asahi-gpu-part-n.html</link><description>&lt;p&gt;In 2020, Apple released the M1 with a custom GPU. We got to work
reverse-engineering the hardware and porting Linux. Today, you can run
Linux on a range of M1 and M2 Macs, with almost all hardware working:
wireless, audio, and full graphics acceleration.&lt;/p&gt;
&lt;p&gt;Our story begins in December 2020, when &lt;a
href="https://marcan.st/"&gt;Hector Martin&lt;/a&gt; kicked off &lt;a
href="https://asahilinux.org"&gt;Asahi Linux&lt;/a&gt;. I was working for &lt;a
href="http://collabora.com/"&gt;Collabora&lt;/a&gt; working on Panfrost, the open
source Mesa3D driver for Arm Mali GPUs. Hector put out a public call for
guidance from upstream open source maintainers, and I bit. I just
intended to give some quick pointers. Instead, I &lt;a
href="https://xkcd.com/356/"&gt;bought myself a Christmas present&lt;/a&gt; and
got to work. In between my university coursework and Collabora work, I
poked at the &lt;a href="/blog/asahi-gpu-part-1.html"&gt;shader instruction
set&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing led to another. Within a few weeks, I &lt;a
href="/blog/asahi-gpu-part-2.html"&gt;drew a triangle&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In 3D graphics, once you can draw a triangle, you can do &lt;a
href="https://www.youtube.com/watch?v=dKmzEgpEdNw"&gt;anything&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Pretty soon, I started work on a &lt;a
href="/blog/asahi-gpu-part-3.html"&gt;shader compiler&lt;/a&gt;. After my final
exams that semester, I took a few days off from Collabora to bring up &lt;a
href="/blog/asahi-gpu-part-4.html"&gt;an OpenGL driver&lt;/a&gt; capable of
spinning gears with my new compiler.&lt;/p&gt;
&lt;p&gt;Over the next year, I kept &lt;a
href="/blog/asahi-gpu-part-5.html"&gt;reverse-engineering&lt;/a&gt; and improving
the driver until &lt;a href="/blog/asahi-gpu-part-6.html"&gt;it could run 3D
games on macOS&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Meanwhile, &lt;a href="https://lina.yt"&gt;Asahi Lina&lt;/a&gt; wrote a kernel
driver for the Apple GPU. My userspace OpenGL driver ran on macOS,
leaving her kernel driver as the missing piece for an open source
graphics stack. In December 2022, we &lt;a
href="/blog/asahi-gpu-part-7.html"&gt;shipped graphics acceleration in
Asahi Linux&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In January 2023, I started my final semester in my Computer Science
program at the &lt;a href="https://www.utoronto.ca/"&gt;University of
Toronto&lt;/a&gt;. For years I juggled my courses with my part-time job and my
hobby driver. I faced the same question as my peers: what will I do
after graduation?&lt;/p&gt;
&lt;p&gt;Maybe Panfrost? I started reverse-engineering of the Mali Midgard GPU
back in 2017, when I was still in high school. That led to an internship
at Collabora in 2019 once I graduated, turning into my job throughout
four years of university. During that time, Panfrost grew from a kid’s
pet project based on blackbox reverse-engineering, to a professional
driver engineered by a team with Arm’s backing and hardware
documentation. I did what I set out to do, and the project succeeded
beyond my dreams. &lt;a href="/blog/passing-reins-panfrost.html"&gt;It was
time to move on&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What &lt;em&gt;did&lt;/em&gt; I want to do next?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Finish what I started with the M1. Ship a great driver.&lt;/li&gt;
&lt;li&gt;Bring full, conformant OpenGL drivers to the M1. Apple’s drivers are
not conformant, but we should strive for the industry standard.&lt;/li&gt;
&lt;li&gt;Bring full, conformant Vulkan to Apple platforms, disproving the
myth that Vulkan isn’t suitable for Apple hardware.&lt;/li&gt;
&lt;li&gt;Bring Proton gaming to Asahi Linux. Thanks to Valve’s work for the
Steam Deck, Windows games can run better on Linux than even on Windows.
Why not reap those benefits on the M1?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Panfrost was my challenge until we “won”. My next challenge? Gaming
on Linux on M1.&lt;/p&gt;
&lt;p&gt;Once I finished my coursework, I started full-time on gaming on
Linux. Within a month, we shipped &lt;a
href="/blog/opengl3-on-asahi-linux.html"&gt;OpenGL 3.1 on Asahi Linux&lt;/a&gt;.
A few weeks later, we passed &lt;a
href="/blog/first-conformant-m1-gpu-driver.html"&gt;official conformance
for OpenGL ES 3.1&lt;/a&gt;. That put us at feature parity with Panfrost. I
wanted to go further.&lt;/p&gt;
&lt;p&gt;OpenGL (ES) 3.2 requires geometry shaders, a legacy feature not
supported by either Arm or Apple hardware. The proprietary OpenGL
drivers emulate geometry shaders with compute, but there was no open
source prior art to borrow. Even though multiple Mesa drivers need
geometry/tessellation emulation, nobody did the work to get there.&lt;/p&gt;
&lt;p&gt;My early progress on OpenGL was fast thanks to the mature common code
in Mesa. It was time to pay it forward. Over the rest of the year, I
implemented geometry/tessellation shader emulation. And also the rest of
the owl. In January 2024, I passed conformance for the full &lt;a
href="/blog/conformant-gl46-on-the-m1.html"&gt;OpenGL 4.6&lt;/a&gt;
specification, finishing up OpenGL.&lt;/p&gt;
&lt;p&gt;Vulkan wasn’t too bad, either. I polished the OpenGL driver for a few
months, but once I started typing a Vulkan driver, I passed &lt;a
href="/blog/vk13-on-the-m1-in-1-month.html"&gt;1.3 conformance&lt;/a&gt; in a few
weeks.&lt;/p&gt;
&lt;p&gt;What remained was wiring up the geometry/tessellation emulation to my
shiny new Vulkan driver, since those are required for Direct3D. Et
voilà, &lt;a href="/blog/aaa-gaming-on-m1.html"&gt;Proton games&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Along the way, &lt;a href="https://chaos.social/@karolherbst"&gt;Karol
Herbst&lt;/a&gt; passed OpenCL 3.0 conformance on the M1, running my compiler
atop his “rusticl” frontend.&lt;/p&gt;
&lt;p&gt;Meanwhile, when the Vulkan 1.4 specification was published, we were
ready and &lt;a href="/blog/vulkan-14-sur-asahi-linux.html"&gt;shipped a
conformant implementation on the same day&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;After that, I implemented sparse texture support, unlocking Direct3D
12 via Proton.&lt;/p&gt;
&lt;p&gt;…Now what?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ship a great driver? &lt;strong&gt;Check&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Conformant OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0?
&lt;strong&gt;Check&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Conformant Vulkan 1.4? &lt;strong&gt;Check&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Proton gaming? &lt;strong&gt;Check&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s a wrap.&lt;/p&gt;
&lt;p&gt;We’ve succeeded beyond my dreams. The challenges I chased, I have
tackled. The drivers are fully upstream in Mesa. Performance isn’t too
bad. With the Vulkan on Apple myth busted, conformant Vulkan is now
coming to macOS via &lt;a
href="https://www.lunarg.com/a-vulkan-on-metal-mesa-3d-graphics-driver/"&gt;LunarG’s
KosmicKrisp&lt;/a&gt; project building on my work.&lt;/p&gt;
&lt;p&gt;Satisfied, I am now stepping away from the Apple ecosystem. My
friends in the Asahi Linux orbit will carry the torch from here. As for
me?&lt;/p&gt;
&lt;p&gt;&lt;a
href="https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-the-xe-hpg-architecture.html"&gt;Onto
the next challenge!&lt;/a&gt;&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/asahi-gpu-part-n.html</guid><pubDate>Tue, 26 Aug 2025 00:00:00 -0500</pubDate></item><item><title>Vulkan 1.4 sur Asahi Linux</title><link>https://alyssarosenzweig.ca/blog/vulkan-14-sur-asahi-linux.html</link><description>&lt;p&gt;&lt;em&gt;English version follows.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Aujourd’hui, &lt;a href="https://www.khronos.org/"&gt;Khronos Group&lt;/a&gt; a
sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet
&lt;a href="https://asahilinux.org/"&gt;Asahi Linux&lt;/a&gt; est fier d’annoncer le
premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre
pilote graphique &lt;a
href="/blog/vk13-on-the-m1-in-1-month.html"&gt;Honeykrisp&lt;/a&gt; est &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products#submission_812"&gt;reconnu
par Khronos&lt;/a&gt; comme conforme à cette nouvelle version dès
aujourd’hui.&lt;/p&gt;
&lt;p&gt;Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir
installé Fedora Asahi Remix, executez
&lt;code style="white-space:nowrap"&gt;dnf upgrade --refresh&lt;/code&gt; pour
obtenir la dernière version du pilote.&lt;/p&gt;
&lt;p&gt;Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y
compris les horodatages et la lecture locale avec le rendu dynamique.
L’industrie suppose que ces fonctionnalités devront être plus courantes,
et nous y sommes préparés.&lt;/p&gt;
&lt;p&gt;Sortir un pilote conforme reflète notre engagement en faveur des
standards graphiques et du logiciel libre. Asahi Linux est aussi
compatible avec &lt;a href="/blog/conformant-gl46-on-the-m1.html"&gt;OpenGL
4.6&lt;/a&gt;, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications
pertinentes. D’ailleurs, les nôtres sont les seuls pilotes conformes
pour le materiel d’Apple de n’importe quel standard graphique.&lt;/p&gt;
&lt;p&gt;Même si le pilote est sorti, il faut encore compiler une version
expérimentale de Vulkan-Loader pour utiliser la nouvelle version de
Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponibles
comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de
suite.&lt;/p&gt;
&lt;p&gt;Pour plus d’informations, &lt;a
href="https://www.khronos.org/news/press/khronos-streamlines-development-and-deployment-of-gpu-accelerated-applications-with-vulkan-1.4"&gt;consultez
l’article du blog de Khronos&lt;/a&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Today, the &lt;a href="https://www.khronos.org/"&gt;Khronos Group&lt;/a&gt;
released the 1.4 specification of Vulkan, the standard graphics API. The
&lt;a href="https://asahilinux.org/"&gt;Asahi Linux&lt;/a&gt; project is proud to
announce the first Vulkan 1.4 driver for Apple hardware. Our &lt;a
href="/blog/vk13-on-the-m1-in-1-month.html"&gt;Honeykrisp&lt;/a&gt; driver is &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products#submission_812"&gt;Khronos-recognized&lt;/a&gt;
as conformant to the new version since day one.&lt;/p&gt;
&lt;p&gt;That driver is already available in our official repositories. After
installing Fedora Asahi Remix, run &lt;code style="white-space:nowrap"&gt;dnf
upgrade --refresh&lt;/code&gt; to get the latest drivers.&lt;/p&gt;
&lt;p&gt;Vulkan 1.4 standardizes several important features, including
timestamps and dynamic rendering local read. The industry expects that
these features will become more common, and we are prepared.&lt;/p&gt;
&lt;p&gt;Releasing a conformant driver reflects our commitment to graphics
standards and software freedom. Asahi Linux is also compatible with
OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the
relevant specifications. For that matter, ours are the only conformant
drivers on Apple hardware for any graphics standard.&lt;/p&gt;
&lt;p&gt;Although the driver is released, you still need to build an
experimental version of Vulkan-Loader to access the new Vulkan version.
Nevertheless, you can immediately use all the new features as extensions
in our Vulkan 1.3 driver.&lt;/p&gt;
&lt;p&gt;For more information, &lt;a
href="https://www.khronos.org/news/press/khronos-streamlines-development-and-deployment-of-gpu-accelerated-applications-with-vulkan-1.4"&gt;see
the Khronos blog post&lt;/a&gt;.&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/vulkan-14-sur-asahi-linux.html</guid><pubDate>Mon, 02 Dec 2024 00:00:00 -0500</pubDate></item><item><title>AAA gaming on Asahi Linux</title><link>https://alyssarosenzweig.ca/blog/aaa-gaming-on-m1.html</link><description>&lt;p&gt;Gaming on Linux on M1 is here! We’re thrilled to release our Asahi
game playing toolkit, which integrates our Vulkan 1.3 drivers with x86
emulation and Windows compatibility. Plus a bonus: conformant OpenCL
3.0.&lt;/p&gt;
&lt;p&gt;Asahi Linux now ships the only conformant &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_3470"&gt;OpenGL®&lt;/a&gt;,&lt;!--
[OpenGL® ES](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1045),--&gt;
&lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opencl#submission_433"&gt;OpenCL™&lt;/a&gt;,
and &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products#submission_7910"&gt;Vulkan®&lt;/a&gt;
drivers for this hardware. As for gaming… while today’s release is an
alpha, &lt;a
href="https://store.steampowered.com/app/870780/Control_Ultimate_Edition/"&gt;&lt;strong&gt;Control&lt;/strong&gt;&lt;/a&gt;
runs well!&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/Control-small.png"&gt;&lt;img src="/blog/Games-Asahi/Control-small.avif" alt="Control"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;h2 id="installation"&gt;Installation&lt;/h2&gt;
&lt;p&gt;First, install &lt;a href="https://asahilinux.org/fedora/"&gt;Fedora Asahi
Remix&lt;/a&gt;. Once installed, get the latest drivers with
&lt;code style="white-space:nowrap"&gt;dnf upgrade --refresh &amp;amp;&amp;amp;
reboot&lt;/code&gt;. Then just &lt;code
style="white-space:nowrap"&gt;dnf install steam&lt;/code&gt; and play. While all
M1/M2-series systems work, most games require 16GB of memory due to
emulation overhead.&lt;/p&gt;
&lt;h2 id="the-stack"&gt;The stack&lt;/h2&gt;
&lt;p&gt;Games are typically x86 Windows binaries rendering with DirectX,
while our target is Arm Linux with Vulkan. We need to handle each
difference:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://fex-emu.com/"&gt;FEX&lt;/a&gt; emulates x86 on Arm.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.winehq.org/"&gt;Wine&lt;/a&gt; translates Windows to
Linux.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/doitsujin/dxvk"&gt;DXVK&lt;/a&gt; and &lt;a
href="https://github.com/HansKristian-Work/vkd3d-proton"&gt;vkd3d-proton&lt;/a&gt;
translate DirectX to Vulkan.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There’s one curveball: page size. Operating systems allocate memory
in fixed size “pages”. If an application expects smaller pages than the
system uses, they will break due to insufficient alignment of
allocations. That’s a problem: x86 expects 4K pages but Apple systems
use 16K pages.&lt;/p&gt;
&lt;p&gt;While Linux can’t mix page sizes between processes, it &lt;em&gt;can&lt;/em&gt;
virtualize another Arm Linux kernel with a different page size. So we
run games inside a tiny virtual machine using &lt;a
href="https://github.com/AsahiLinux/muvm"&gt;muvm&lt;/a&gt;, passing through
devices like the GPU and game controllers. The hardware is happy because
the system is 16K, the game is happy because the virtual machine is 4K,
and you’re happy because you can play &lt;a
href="https://store.steampowered.com/app/377160/Fallout_4/"&gt;&lt;strong&gt;Fallout
4&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/Fallout4-small.png"&gt;&lt;img src="/blog/Games-Asahi/Fallout4-small.avif" alt="Fallout 4"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;h2 id="vulkan"&gt;Vulkan&lt;/h2&gt;
&lt;p&gt;The final piece is an adult-level Vulkan driver, since translating
DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote
&lt;a href="/blog/vk13-on-the-m1-in-1-month.html"&gt;Honeykrisp&lt;/a&gt;, the only
Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support.
Let’s look at some new features.&lt;/p&gt;
&lt;h3 id="tessellation"&gt;Tessellation&lt;/h3&gt;
&lt;p&gt;Tessellation enables games like &lt;a
href="https://store.steampowered.com/app/292030/The_Witcher_3_Wild_Hunt/"&gt;&lt;strong&gt;The
Witcher 3&lt;/strong&gt;&lt;/a&gt; to generate geometry. The M1 has hardware
tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We
must instead tessellate with arcane compute shaders, as detailed in &lt;a
href="https://www.youtube.com/live/pDsksRBLXPk"&gt;today’s talk at
XDC2024&lt;/a&gt;.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/Witcher3-small.png"&gt;&lt;img src="/blog/Games-Asahi/Witcher3-small.avif" alt="The Witcher 3"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;h3 id="geometry-shaders"&gt;Geometry shaders&lt;/h3&gt;
&lt;p&gt;Geometry shaders are an older, cruder method to generate geometry.
Like tessellation, the M1 lacks geometry shader hardware so we emulate
with compute. Is that fast? No, but geometry shaders are slow &lt;a
href="http://www.joshbarczak.com/blog/?p=667"&gt;even on desktop GPUs&lt;/a&gt;.
They don’t need to be fast – just fast enough for games like &lt;a
href="https://store.steampowered.com/app/1139900/Ghostrunner/"&gt;&lt;strong&gt;Ghostrunner&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/Ghostrunner-small.png"&gt;&lt;img src="/blog/Games-Asahi/Ghostrunner-small.avif" alt="Ghostrunner"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;h3 id="enhanced-robustness"&gt;Enhanced robustness&lt;/h3&gt;
&lt;p&gt;“Robustness” permits an application’s shaders to access buffers
out-of-bounds without crashing the hardware. In OpenGL and Vulkan,
out-of-bounds loads may return arbitrary elements, and out-of-bounds
stores may corrupt the buffer. Our OpenGL driver &lt;a
href="/blog/conformant-gl46-on-the-m1.html"&gt;exploits this definition&lt;/a&gt;
for efficient robustness on the M1.&lt;/p&gt;
&lt;p&gt;Some games require stronger guarantees. In DirectX, out-of-bounds
loads return zero, and out-of-bounds stores are ignored. DXVK therefore
requires &lt;a
href="https://docs.vulkan.org/guide/latest/robustness.html#_vk_ext_robustness2"&gt;&lt;code&gt;VK_EXT_robustness2&lt;/code&gt;&lt;/a&gt;,
a Vulkan extension strengthening robustness.&lt;/p&gt;
&lt;p&gt;Like before, we implement robustness with compare-and-select
instructions. A naïve implementation would &lt;em&gt;compare&lt;/em&gt; a loaded
index with the buffer size and &lt;em&gt;select&lt;/em&gt; a zero result if
out-of-bounds. However, our GPU loads are vector while arithmetic is
scalar. Even if we disabled page faults, we would need up to four
compare-and-selects per load.&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb1-1"&gt;&lt;a href="#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt; R&lt;span class="op"&gt;,&lt;/span&gt; buffer&lt;span class="op"&gt;,&lt;/span&gt; index &lt;span class="op"&gt;*&lt;/span&gt; &lt;span class="dv"&gt;16&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-2"&gt;&lt;a href="#cb1-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; size&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-3"&gt;&lt;a href="#cb1-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; size&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-4"&gt;&lt;a href="#cb1-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;2&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; size&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;2&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-5"&gt;&lt;a href="#cb1-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;3&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; size&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;3&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;There’s a trick: reserve &lt;em&gt;64 gigabytes&lt;/em&gt; of zeroes using
virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in
64 gigabytes, any index into this region loads zeroes. For out-of-bounds
loads, we simply replace the buffer address with the reserved address
while preserving the index. Replacing a 64-bit address costs just two
32-bit compare-and-selects.&lt;/p&gt;
&lt;div class="sourceCode" id="cb2"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb2-1"&gt;&lt;a href="#cb2-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel buffer&lt;span class="op"&gt;.&lt;/span&gt;lo&lt;span class="op"&gt;,&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; size&lt;span class="op"&gt;,&lt;/span&gt; buffer&lt;span class="op"&gt;.&lt;/span&gt;lo&lt;span class="op"&gt;,&lt;/span&gt; RESERVED&lt;span class="op"&gt;.&lt;/span&gt;lo&lt;/span&gt;
&lt;span id="cb2-2"&gt;&lt;a href="#cb2-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel buffer&lt;span class="op"&gt;.&lt;/span&gt;hi&lt;span class="op"&gt;,&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; size&lt;span class="op"&gt;,&lt;/span&gt; buffer&lt;span class="op"&gt;.&lt;/span&gt;hi&lt;span class="op"&gt;,&lt;/span&gt; RESERVED&lt;span class="op"&gt;.&lt;/span&gt;hi&lt;/span&gt;
&lt;span id="cb2-3"&gt;&lt;a href="#cb2-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt; R&lt;span class="op"&gt;,&lt;/span&gt; buffer&lt;span class="op"&gt;,&lt;/span&gt; index &lt;span class="op"&gt;*&lt;/span&gt; &lt;span class="dv"&gt;16&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Two instructions, not four.&lt;/p&gt;
&lt;h2 id="next-steps"&gt;Next steps&lt;/h2&gt;
&lt;p&gt;Sparse texturing is next for Honeykrisp, which will unlock more DX12
games. The alpha already runs DX12 games that don’t require sparse, like
&lt;a
href="https://store.steampowered.com/app/1091500/Cyberpunk_2077/"&gt;&lt;strong&gt;Cyberpunk
2077&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/Cyberpunk2077-small.png"&gt;&lt;img src="/blog/Games-Asahi/Cyberpunk2077-small.avif" alt="Cyberpunk 2077"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;p&gt;While many games are playable, newer AAA titles don’t hit 60fps
&lt;em&gt;yet&lt;/em&gt;. Correctness comes first. Performance improves next. Indie
games like &lt;a
href="https://store.steampowered.com/app/367520/Hollow_Knight/"&gt;&lt;strong&gt;Hollow
Knight&lt;/strong&gt;&lt;/a&gt; do run full speed.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/HollowKnight-small.png"&gt;&lt;img src="/blog/Games-Asahi/HollowKnight-small.avif" alt="Hollow Knight"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;p&gt;Beyond gaming, we’re adding general purpose x86 emulation based on
this stack. For more information, &lt;a
href="https://docs.fedoraproject.org/en-US/fedora-asahi-remix/x86-support/"&gt;see
the FAQ&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Today’s alpha is a taste of what’s to come. Not the final form, but
enough to enjoy &lt;a
href="https://store.steampowered.com/app/620/Portal_2/"&gt;&lt;strong&gt;Portal
2&lt;/strong&gt;&lt;/a&gt; while we work towards “1.0”.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/blog/Games-Asahi/Portal2-small.png"&gt;&lt;img src="/blog/Games-Asahi/Portal2-small.avif" alt="Portal 2"&gt;&lt;/a&gt;
&lt;/figure&gt;
&lt;h2 id="acknowledgements"&gt;Acknowledgements&lt;/h2&gt;
&lt;p&gt;This work has been years in the making with major contributions
from…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/"&gt;Alyssa Rosenzweig&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lina.yt/me"&gt;Asahi Lina&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a
href="https://social.treehouse.systems/@chaos_princess"&gt;chaos_princess&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/davide125"&gt;Davide Cavalca&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mastodon.social/@dougall"&gt;Dougall Johnson&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ella.gay"&gt;Ella Stanforth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gfxstrand.net/faith/welcome/"&gt;Faith
Ekstrand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://social.treehouse.systems/@janne"&gt;Janne
Grunau&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chaos.social/@karolherbst"&gt;Karol Herbst&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://social.treehouse.systems/@marcan"&gt;marcan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mary.zone"&gt;Mary Guillemard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://neal.gompa.dev/"&gt;Neal Gompa&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sinrega.org"&gt;Sergio López&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/TellowKrinkle"&gt;TellowKrinkle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/teohhanhui"&gt;Teoh Han Hui&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mastodon.gamedev.place/@robclark"&gt;Rob
Clark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sonicadvance1"&gt;Ryan Houdek&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;… Plus hundreds of developers whose work we build upon, spanning the
Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the
magic of open source.&lt;/p&gt;
&lt;p&gt;We hope you enjoy the magic.&lt;/p&gt;
&lt;p&gt;Happy gaming.&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/aaa-gaming-on-m1.html</guid><pubDate>Thu, 10 Oct 2024 00:00:00 -0500</pubDate></item><item><title>Vulkan 1.3 on the M1 in 1 month</title><link>https://alyssarosenzweig.ca/blog/vk13-on-the-m1-in-1-month.html</link><description>&lt;style&gt;u{text-decoration-thickness:0.09em;text-decoration-color:skyblue}&lt;/style&gt;
&lt;p&gt;Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is
the first &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/vulkan#submission_780"&gt;conformant
Vulkan®&lt;/a&gt; for Apple hardware on any operating system, implementing the
full 1.3 spec without “portability” waivers.&lt;/p&gt;
&lt;p&gt;Honeykrisp is &lt;strong&gt;not yet released&lt;/strong&gt; for end users. We’re
continuing to add features, improve performance, and port to more
hardware. &lt;a
href="https://gitlab.freedesktop.org/alyssa/mesa/-/tree/honeykrisp-20240506-2/src/asahi/vulkan?ref_type=heads"&gt;Source
code&lt;/a&gt; is available for developers.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/holocure.png"&gt;&lt;img src="/holocure.avif" alt="HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton."&gt;&lt;/a&gt;
&lt;figcaption aria-hidden="true"&gt;
&lt;a href="https://kay-yu.itch.io/holocure"&gt;HoloCure&lt;/a&gt; running on
Honeykrisp ft. DXVK, FEX, and Proton.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Honeykrisp is not based on prior M1 Vulkan efforts, but rather &lt;a
href="https://mastodon.gamedev.place/@gfxstrand"&gt;Faith Ekstrand&lt;/a&gt;’s
open source &lt;a
href="https://www.collabora.com/news-and-blog/news-and-events/introducing-nvk.html"&gt;NVK
driver&lt;/a&gt; for NVIDIA GPUs. In her words:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan
driver and started by copying+pasting from it. My hope is that NVK will
eventually become the driver that everyone copies and pastes from. To
that end, I’m building NVK with all the best practices we’ve developed
for Vulkan drivers over the last 7.5 years and trying to keep the
code-base clean and well-organized.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Why spend years implementing features from scratch when we can reuse
NVK? There will be friction starting out, given NVIDIA’s desktop
architecture differs from the M1’s mobile roots. In exchange, we get a
modern driver designed for desktop games.&lt;/p&gt;
&lt;p&gt;We’ll need to pass a half-million tests ensuring correctness, &lt;a
href="https://www.khronos.org/conformance/adopters"&gt;submit the
results&lt;/a&gt;, and then we’ll become conformant after 30 days of industry
review. Starting from NVK and our OpenGL 4.6 driver… can we write a
driver passing the Vulkan 1.3 conformance test suite &lt;em&gt;faster&lt;/em&gt;
than the 30 day review period?&lt;/p&gt;
&lt;p&gt;It’s unprecedented…&lt;/p&gt;
&lt;p&gt;Challenge accepted.&lt;/p&gt;
&lt;h3 id="april-2"&gt;April 2&lt;/h3&gt;
&lt;p&gt;It begins with a text.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Faith… I think I want to write a Vulkan driver.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Her advice?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Just start typing.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There’s no copy-pasting yet – we just add M1 code to NVK and remove
NVIDIA as we go. Since the kernel mediates our access to the hardware,
we begin connecting “NVK” to &lt;a href="https://vt.social/@lina"&gt;Asahi
Lina&lt;/a&gt;’s kernel driver using code shared with OpenGL. Then we plug in
our shader compiler and hit the hay.&lt;/p&gt;
&lt;h3 id="april-3"&gt;April 3&lt;/h3&gt;
&lt;p&gt;To access resources, GPUs use “descriptors” containing the address,
format, and size of a resource. Vulkan bundles descriptors into “sets”
per the application’s “descriptor set layout”. When compiling shaders,
the driver lowers descriptor accesses to marry the set layout with the
hardware’s data structures. As our descriptors differ from NVIDIA’s, our
next task is adapting NVK’s descriptor set lowering. We start with a
simple but correct approach, deleting far more code than we add.&lt;/p&gt;
&lt;h3 id="april-4"&gt;April 4&lt;/h3&gt;
&lt;p&gt;With working descriptors, we can compile compute shaders. Now we
program the fixed-function hardware to dispatch compute. We first add
bookkeeping to map Vulkan command buffers to lists of M1 “control
streams”, then we generate a compute control stream. We copy that code
from our OpenGL driver, translate the GL into Vulkan, and compute
works.&lt;/p&gt;
&lt;p&gt;That’s enough to move on to “copies” of buffers and images. We
implement Vulkan’s copies with compute shaders, internally dispatched
with Vulkan commands as if we were the application. The first copy test
passes.&lt;/p&gt;
&lt;h3 id="april-5"&gt;April 5&lt;/h3&gt;
&lt;p&gt;Fleshing out yesterday’s code, &lt;em&gt;all&lt;/em&gt; copy tests pass.&lt;/p&gt;
&lt;h3 id="april-6"&gt;April 6&lt;/h3&gt;
&lt;p&gt;We’re ready to tackle graphics. The novelty is handling graphics
state like depth/stencil. That’s straightforward, but there’s a
&lt;em&gt;lot&lt;/em&gt; of state to handle. Faith’s code collects all “dynamic
state” into a single structure, which we translate into hardware control
words. As usual, we grab that translation from our OpenGL driver, blend
with NVK, and move on.&lt;/p&gt;
&lt;h3 id="april-7"&gt;April 7&lt;/h3&gt;
&lt;p&gt;What makes state “dynamic”? Dynamic state can change without
recompiling shaders. By contrast, static state is baked into shader
binaries called “pipelines”. If games create all their pipelines during
a loading screen, there is no compiler “stutter” during gameplay. The
idea hasn’t quite panned out: many game developers don’t know their
state ahead-of-time so cannot create pipelines early. In response,
Vulkan has &lt;u&gt;&lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state.html"&gt;made&lt;/a&gt;&lt;/u&gt;
&lt;u&gt;&lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state2.html"&gt;ever&lt;/a&gt;&lt;/u&gt;
&lt;u&gt;&lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state3.html"&gt;more&lt;/a&gt;&lt;/u&gt;
&lt;u&gt;&lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_vertex_input_dynamic_state.html"&gt;state&lt;/a&gt;&lt;/u&gt;
&lt;u&gt;&lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_graphics_pipeline_library.html"&gt;dynamic&lt;/a&gt;&lt;/u&gt;,
punctuated with the &lt;a
href="https://www.khronos.org/blog/you-can-use-vulkan-without-pipelines-today"&gt;&lt;code&gt;EXT_shader_object&lt;/code&gt;&lt;/a&gt;
extension that makes pipelines &lt;em&gt;optional&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;We want full dynamic state and shader objects. Unfortunately, the M1
bakes random state into shaders: vertex attributes, fragment outputs,
blending, even linked interpolation qualifiers. Like most of the
industry in the 2010s, the M1’s designers bet on pipelines.&lt;/p&gt;
&lt;p&gt;Faced with this hardware, a reasonable driver developer would
double-down on pipelines. DXVK would stutter, but we’d pass
conformance.&lt;/p&gt;
&lt;p&gt;I am not reasonable.&lt;/p&gt;
&lt;p&gt;To eliminate stuttering in OpenGL, we make state dynamic with four
strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Conditional code.&lt;/li&gt;
&lt;li&gt;Precompiled variants.&lt;/li&gt;
&lt;li&gt;Indirection.&lt;/li&gt;
&lt;li&gt;Prologs and epilogs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Wait, what-a-logs?&lt;/p&gt;
&lt;p&gt;AMD also bakes state into shaders… with a twist. They divide the
hardware binary into three parts: a &lt;em&gt;prolog&lt;/em&gt;, the shader, and an
&lt;em&gt;epilog&lt;/em&gt;. Confining dynamic state to the periphery eliminates
shader variants. They compile prologs and epilogs on the fly, but that’s
fast and doesn’t stutter. Linking shader parts is a quick concatenation,
or long jumps avoid linking altogether. This strategy works for the M1,
too.&lt;/p&gt;
&lt;p&gt;For Honeykrisp, let’s follow NVK’s lead and treat &lt;em&gt;all&lt;/em&gt; state
as dynamic. No other Vulkan driver has implemented full dynamic state
and shader objects this early on, but it avoids refactoring later. Today
we add the code to build, compile, and cache prologs and epilogs.&lt;/p&gt;
&lt;p&gt;Putting it together, we get a (dynamic) triangle:&lt;/p&gt;
&lt;p&gt;&lt;a href="/hk-triangle.png"&gt;&lt;img src="/hk-triangle.avif"
alt="Classic rainbow triangle" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="april-8"&gt;April 8&lt;/h3&gt;
&lt;p&gt;Guided by the list of failing tests, we wire up the little bits
missed along the way, like translating border colours.&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb1-1"&gt;&lt;a href="#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="co"&gt;/* Translate an American VkBorderColor into a Canadian agx_border_colour */&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-2"&gt;&lt;a href="#cb1-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="kw"&gt;enum&lt;/span&gt; agx_border_colour&lt;/span&gt;
&lt;span id="cb1-3"&gt;&lt;a href="#cb1-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;translate_border_color&lt;span class="op"&gt;(&lt;/span&gt;VkBorderColor color&lt;span class="op"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-4"&gt;&lt;a href="#cb1-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-5"&gt;&lt;a href="#cb1-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;   &lt;span class="cf"&gt;switch&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;color&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-6"&gt;&lt;a href="#cb1-6" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;   &lt;span class="cf"&gt;case&lt;/span&gt; VK_BORDER_COLOR_INT_TRANSPARENT_BLACK&lt;span class="op"&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-7"&gt;&lt;a href="#cb1-7" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;      &lt;span class="cf"&gt;return&lt;/span&gt; AGX_BORDER_COLOUR_TRANSPARENT_BLACK&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-8"&gt;&lt;a href="#cb1-8" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;   &lt;span class="op"&gt;...&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-9"&gt;&lt;a href="#cb1-9" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;   &lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-10"&gt;&lt;a href="#cb1-10" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Test results are getting there.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pass&lt;/strong&gt;: 149770, &lt;strong&gt;Fail&lt;/strong&gt;: 7741,
&lt;strong&gt;Crash&lt;/strong&gt;: 2396&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That’s good enough for &lt;a
href="https://github.com/Novum/vkQuake"&gt;vkQuake&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="/vkquake.png"&gt;&lt;img src="/vkquake.avif"
alt="Vulkan port of Quake running on Honeykrisp" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;/p&gt;
&lt;h3 id="april-9"&gt;April 9&lt;/h3&gt;
&lt;p&gt;Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1.
Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to
the finish line.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pass&lt;/strong&gt;: 255209, &lt;strong&gt;Fail&lt;/strong&gt;: 3818,
&lt;strong&gt;Crash&lt;/strong&gt;: 599&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;98.3% pass rate for 1.3 on our 1 week anniversary.&lt;/p&gt;
&lt;p&gt;Not bad.&lt;/p&gt;
&lt;h3 id="april-10"&gt;April 10&lt;/h3&gt;
&lt;p&gt;SuperTuxKart has a Vulkan renderer.&lt;/p&gt;
&lt;p&gt;&lt;a href="/hkr-stk.png"&gt;&lt;img src="/hkr-stk.avif"
alt="SuperTuxKart rendering with Honeykrisp, showing Pepper (from Pepper and Carrot) riding her broomstick in the STK Enterprise" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="april-11"&gt;April 11&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://docs.mesa3d.org/drivers/zink.html"&gt;Zink&lt;/a&gt; works
too.&lt;/p&gt;
&lt;p&gt;&lt;a href="/hkr-stk-zink.png"&gt;&lt;img src="/hkr-stk-zink.avif"
alt="SuperTuxKart rendering with Zink on Honeykrisp, same scene but with better lighting" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="april-12"&gt;April 12&lt;/h3&gt;
&lt;p&gt;I tracked down some fails to a test bug, where an arbitrary
verification threshold was too strict to pass on some devices. I filed a
bug report, and it’s &lt;a
href="https://github.com/KhronosGroup/VK-GL-CTS/commit/5fd73c841d775dff1ad52d8340d79dc120d64696"&gt;resolved&lt;/a&gt;
within a few weeks.&lt;/p&gt;
&lt;h3 id="april-16"&gt;April 16&lt;/h3&gt;
&lt;p&gt;The tests for “descriptor indexing” revealed a compiler bug affecting
subgroup shuffles in non-uniform control flow. The M1’s shuffle
instruction is quirky, but it’s easy to workaround. Fixing that fixes
the descriptor indexing tests.&lt;/p&gt;
&lt;h3 id="april-17"&gt;April 17&lt;/h3&gt;
&lt;p&gt;A few tests crash inside our register allocator. Their shaders
contain a peculiar construction:&lt;/p&gt;
&lt;div class="sourceCode" id="cb2"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb2-1"&gt;&lt;a href="#cb2-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="cf"&gt;if&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;condition&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb2-2"&gt;&lt;a href="#cb2-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;   &lt;span class="cf"&gt;while&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;&lt;span class="kw"&gt;true&lt;/span&gt;&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt; &lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb2-3"&gt;&lt;a href="#cb2-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;condition&lt;/code&gt; is always false, but the compiler doesn’t know
that.&lt;/p&gt;
&lt;p&gt;Infinite loops are nominally invalid since shaders must terminate in
finite time, but this shader is syntactically valid. “All loops contain
a break” seems obvious for a shader, but it’s false. It’s
straightforward to fix register allocation, but what a doozy.&lt;/p&gt;
&lt;h3 id="april-18"&gt;April 18&lt;/h3&gt;
&lt;p&gt;Remember copies? They’re slow, and every frame currently requires a
copy to get on screen.&lt;/p&gt;
&lt;p&gt;For “zero copy” rendering, we need enough Linux window system
integration to negotiate an efficient surface layout across process
boundaries. Linux uses “modifiers” for this purpose, so we implement the
&lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_drm_format_modifier.html"&gt;&lt;code&gt;EXT_image_drm_format_modifier&lt;/code&gt;&lt;/a&gt;
extension. And by implement, I mean copy.&lt;/p&gt;
&lt;p&gt;Copies to avoid copies.&lt;/p&gt;
&lt;h3 id="april-20"&gt;April 20&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux
Vulkan Mac.”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;…&lt;/p&gt;
&lt;p&gt;&lt;em&gt;“Ma’am, this is a Wendy’s.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="april-22"&gt;April 22&lt;/h3&gt;
&lt;p&gt;As bug fixing slows down, we step back and check our driver
architecture. Since we treat all state as dynamic, we don’t pre-pack
control words during pipeline creation. That adds theoretical CPU
overhead.&lt;/p&gt;
&lt;p&gt;Is that a problem? After some optimization, &lt;a
href="https://github.com/zmike/vkoverhead"&gt;vkoverhead&lt;/a&gt; says we’re
pushing 100 million draws per second.&lt;/p&gt;
&lt;p&gt;I think we’re okay.&lt;/p&gt;
&lt;h3 id="april-24"&gt;April 24&lt;/h3&gt;
&lt;p&gt;Time to light up YCbCr. If we don’t use special YCbCr hardware, this
feature is “software-only”. However, it touches a &lt;em&gt;lot&lt;/em&gt; of
code.&lt;/p&gt;
&lt;p&gt;It touches so much code that &lt;a
href="https://mohamexiety.github.io/posts/final_report/"&gt;Mohamed
Ahmed&lt;/a&gt; spent an entire summer adding it to NVK.&lt;/p&gt;
&lt;p&gt;Which means he spent a summer adding it to Honeykrisp.&lt;/p&gt;
&lt;p&gt;Thanks, Mohamed ;-)&lt;/p&gt;
&lt;h3 id="april-25"&gt;April 25&lt;/h3&gt;
&lt;p&gt;Query copies are next. In Vulkan, the application can query the
number of samples rendered, writing the result into an opaque “query
pool”. The result can be copied from the query pool on the CPU or
GPU.&lt;/p&gt;
&lt;p&gt;For the CPU, the driver maps the pool’s internal data structure and
copies the result. This may require nontrivial repacking.&lt;/p&gt;
&lt;p&gt;For the GPU, we need to repack in a compute shader. That’s harder,
because we can’t just run C code on the GPU, right?&lt;/p&gt;
&lt;p&gt;…Actually, we can.&lt;/p&gt;
&lt;p&gt;A little witchcraft makes GPU query copies as easy as C.&lt;/p&gt;
&lt;div class="sourceCode" id="cb3"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb3-1"&gt;&lt;a href="#cb3-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="dt"&gt;void&lt;/span&gt; copy_query&lt;span class="op"&gt;(&lt;/span&gt;&lt;span class="kw"&gt;struct&lt;/span&gt; params &lt;span class="op"&gt;*&lt;/span&gt;p&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="dt"&gt;int&lt;/span&gt; i&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-2"&gt;&lt;a href="#cb3-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;  &lt;span class="dt"&gt;uintptr_t&lt;/span&gt; dst &lt;span class="op"&gt;=&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;dest &lt;span class="op"&gt;+&lt;/span&gt; i &lt;span class="op"&gt;*&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;stride&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-3"&gt;&lt;a href="#cb3-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;  &lt;span class="dt"&gt;int&lt;/span&gt; query &lt;span class="op"&gt;=&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;first &lt;span class="op"&gt;+&lt;/span&gt; i&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-4"&gt;&lt;a href="#cb3-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb3-5"&gt;&lt;a href="#cb3-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;  &lt;span class="cf"&gt;if&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;available&lt;span class="op"&gt;[&lt;/span&gt;query&lt;span class="op"&gt;]&lt;/span&gt; &lt;span class="op"&gt;||&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;partial&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-6"&gt;&lt;a href="#cb3-6" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="dt"&gt;int&lt;/span&gt; q &lt;span class="op"&gt;=&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;index&lt;span class="op"&gt;[&lt;/span&gt;query&lt;span class="op"&gt;];&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-7"&gt;&lt;a href="#cb3-7" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    write_result&lt;span class="op"&gt;(&lt;/span&gt;dst&lt;span class="op"&gt;,&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;_64&lt;span class="op"&gt;,&lt;/span&gt; p&lt;span class="op"&gt;-&amp;gt;&lt;/span&gt;results&lt;span class="op"&gt;[&lt;/span&gt;q&lt;span class="op"&gt;]);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-8"&gt;&lt;a href="#cb3-8" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;  &lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-9"&gt;&lt;a href="#cb3-9" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb3-10"&gt;&lt;a href="#cb3-10" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;  &lt;span class="op"&gt;...&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-11"&gt;&lt;a href="#cb3-11" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="april-26"&gt;April 26&lt;/h3&gt;
&lt;p&gt;The final boss: border colours, hard mode.&lt;/p&gt;
&lt;p&gt;Direct3D lets the application choose an arbitrary border colour when
creating a sampler. By contrast, Vulkan only requires three border
colours:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;(0, 0, 0, 0)&lt;/code&gt;&lt;/strong&gt; – transparent black&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;(0, 0, 0, 1)&lt;/code&gt;&lt;/strong&gt; – opaque black&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;(1, 1, 1, 1)&lt;/code&gt;&lt;/strong&gt; – opaque white&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We handled these on April 8. Unfortunately, there are two
problems.&lt;/p&gt;
&lt;p&gt;First, we need custom border colours for Direct3D compatibility. Both
&lt;a href="https://github.com/doitsujin/dxvk"&gt;DXVK&lt;/a&gt; and &lt;a
href="https://github.com/HansKristian-Work/vkd3d-proton"&gt;vkd3d-proton&lt;/a&gt;
require the &lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_custom_border_color.html"&gt;&lt;code&gt;EXT_custom_border_color&lt;/code&gt;&lt;/a&gt;
extension.&lt;/p&gt;
&lt;p&gt;Second, there’s a subtle problem with our hardware, causing dozens of
fails even without custom border colours. To understand the issue, let’s
revisit texture descriptors, which contain a pixel &lt;em&gt;format&lt;/em&gt; and a
component reordering &lt;em&gt;swizzle&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Some formats are implicitly reordered. Common “BGRA” formats swap red
and blue for &lt;a
href="https://stackoverflow.com/questions/74924790/why-bgra-instead-of-rgba"&gt;historical
reasons&lt;/a&gt;. The M1 does not directly support these formats. Instead,
the driver composes the swizzle with the format’s reordering. If the
application uses a &lt;code&gt;BARB&lt;/code&gt; swizzle with a &lt;code&gt;BGRA&lt;/code&gt;
format, the driver uses an &lt;code&gt;RABR&lt;/code&gt; swizzle with an
&lt;code&gt;RGBA&lt;/code&gt; format.&lt;/p&gt;
&lt;p&gt;There’s a catch: swizzles apply to the border colour, but formats do
not. We need to &lt;em&gt;undo&lt;/em&gt; the format reordering when programming the
border colour for correct results after the hardware applies the
composed swizzle. Our OpenGL driver implements border colours this way,
because it knows the texture format when creating the sampler.
Unfortunately, Vulkan doesn’t give us that information.&lt;/p&gt;
&lt;p&gt;Without custom border colour support, we “should” be okay. Swapping
red and blue doesn’t change anything if the colour is white or
black.&lt;/p&gt;
&lt;p&gt;There’s an even &lt;em&gt;subtler&lt;/em&gt; catch. Vulkan mandates support for a
packed 16-bit format with 4-bit components. The M1 supports a similar
format… but with reversed “endianness”, swapping red and
&lt;em&gt;alpha&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;That still seems okay. For transparent black (all zero) and opaque
white (all one), swapping components doesn’t change the result.&lt;/p&gt;
&lt;p&gt;The problem is opaque black: &lt;code style="white-space:nowrap"&gt;(0, 0,
0, 1)&lt;/code&gt;. Swapping red and alpha gives
&lt;code style="white-space:nowrap"&gt;(1, 0, 0, 0)&lt;/code&gt;. Transparent red?
Uh-oh.&lt;/p&gt;
&lt;p&gt;We’re stuck. No known hardware configuration implements correct
Vulkan semantics.&lt;/p&gt;
&lt;p&gt;Is hope lost?&lt;/p&gt;
&lt;p&gt;Do we give up?&lt;/p&gt;
&lt;p&gt;A reasonable person would.&lt;/p&gt;
&lt;p&gt;I am not reasonable.&lt;/p&gt;
&lt;p&gt;Let’s jump into the deep end. If we implement custom border colours,
opaque black becomes a special case. But how? The M1’s custom border
colours entangle the texture format with the sampler. A reasonable
person would skip Direct3D support.&lt;/p&gt;
&lt;p&gt;As you know, I am not reasonable.&lt;/p&gt;
&lt;p&gt;Although the hardware is unsuitable, we control software. Whenever a
shader samples a texture, we’ll inject code to fix up the border colour.
This emulation is simple, correct, and slow. We’ll use dirty driver
tricks to speed it up later. For now, we eat the cost, advertise full
custom border colours, and pass the opaque black tests.&lt;/p&gt;
&lt;h3 id="april-27"&gt;April 27&lt;/h3&gt;
&lt;p&gt;All that’s left is some last minute bug fixing, and…&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pass&lt;/strong&gt;: 686930, &lt;strong&gt;Fail&lt;/strong&gt;: 0&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Success.&lt;/p&gt;
&lt;h3 id="the-future"&gt;The future&lt;/h3&gt;
&lt;p&gt;The next task is implementing everything that &lt;a
href="https://github.com/doitsujin/dxvk/blob/master/VP_DXVK_requirements.json"&gt;DXVK&lt;/a&gt;
and &lt;a
href="https://github.com/HansKristian-Work/vkd3d-proton/blob/master/VP_D3D12_VKD3D_PROTON_profile.json"&gt;vkd3d-proton&lt;/a&gt;
require to layer Direct3D. That includes esoteric extensions like &lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_transform_feedback.html"&gt;transform
feedback&lt;/a&gt;. Then &lt;a href="https://www.winehq.org/"&gt;Wine&lt;/a&gt; and an &lt;a
href="https://github.com/FEX-Emu/FEX"&gt;open source x86 emulator&lt;/a&gt; will
run Windows games on &lt;a href="https://asahilinux.org/"&gt;Asahi
Linux&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;That’s getting ahead of ourselves. In the mean time, enjoy Linux
games with our &lt;a href="/blog/conformant-gl46-on-the-m1.html"&gt;conformant
OpenGL 4.6&lt;/a&gt; drivers… and stay tuned.&lt;/p&gt;
&lt;figure&gt;
&lt;a href="/babystorm.png"&gt;&lt;img src="/babystorm.avif" alt="Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton."&gt;&lt;/a&gt;
&lt;figcaption aria-hidden="true"&gt;
&lt;a href="https://store.steampowered.com/app/2176400/Baby_Storm/"&gt;Baby
Storm&lt;/a&gt; running on Honeykrisp ft. DXVK, FEX, and Proton.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;hr /&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/vk13-on-the-m1-in-1-month.html</guid><pubDate>Wed, 05 Jun 2024 00:00:00 -0500</pubDate></item><item><title>Conformant OpenGL 4.6 on the M1</title><link>https://alyssarosenzweig.ca/blog/conformant-gl46-on-the-m1.html</link><description>&lt;p&gt;For years, the M1 has only supported OpenGL 4.1. That changes today –
with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! &lt;a
href="https://fedora-asahi-remix.org/"&gt;Install Fedora&lt;/a&gt; for the latest
M1/M2-series drivers.&lt;/p&gt;
&lt;p&gt;Already installed? Just &lt;code style="white-space:nowrap"&gt;dnf upgrade
--refresh&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Unlike the vendor’s non-conformant 4.1 drivers, our &lt;a
href="https://gitlab.freedesktop.org/asahi/mesa"&gt;open source&lt;/a&gt; Linux
drivers are &lt;strong&gt;conformant&lt;/strong&gt; to the latest OpenGL versions,
finally promising broad compatibility with modern OpenGL workloads, like
&lt;a href="https://www.blender.org/"&gt;Blender&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="/Blender-Wanderer-high.avif"&gt;&lt;img title="Screenshot of Blender running on Apple M1 on Fedora Linux 39. The scene is 'Wanderer', depicting a humanoid in a space suit on a rocky terrain, beside a rover with solar panels." src="/Blender-Wanderer.avif" width="1465" height="993" style="height:auto;background:linear-gradient(180deg,#000 0%,#000 5%, #bdcbd0 5%,#7d5a37);color:rgba(0,0,0,0)"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure
correctness. The official list of conformant drivers now includes &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_347"&gt;our
OpenGL 4.6&lt;/a&gt; and &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1045"&gt;ES
3.2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;While the vendor doesn’t yet support graphics standards like modern
OpenGL, we do. For this Valentine’s Day, we want to profess our love for
interoperable open standards. We want to free users and developers from
lock-in, enabling applications to run anywhere the heart wants without
special ports. For that, we need standards conformance. Six months ago,
we became the &lt;a href="/blog/first-conformant-m1-gpu-driver.html"&gt;first
conformant driver for any standard graphics API for the M1&lt;/a&gt; with the
release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the
full 4.6… and we’re well on the road to Vulkan.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Compared to 4.1, OpenGL 4.6 adds dozens of required features,
including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Robustness&lt;/li&gt;
&lt;li&gt;SPIR-V&lt;/li&gt;
&lt;li&gt;&lt;a href="/blog/asahi-gpu-part-6.html"&gt;Clip control&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Cull distance&lt;/li&gt;
&lt;li&gt;&lt;a href="/blog/first-conformant-m1-gpu-driver.html"&gt;Compute
shaders&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Upgraded transform feedback&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Regrettably, the M1 doesn’t map well to any graphics standard newer
than OpenGL ES 3.1. While Vulkan makes some of these features optional,
the missing features are required to layer DirectX and OpenGL on top. No
existing solution on M1 gets past the OpenGL 4.1 feature set.&lt;/p&gt;
&lt;p&gt;How do we break the 4.1 barrier? Without hardware support, new
features need new tricks. Geometry shaders, tessellation, and transform
feedback become compute shaders. Cull distance becomes a transformed
interpolated value. Clip control becomes a vertex shader epilogue. The
list goes on.&lt;/p&gt;
&lt;p&gt;For a taste of the challenges we overcame, let’s look at
&lt;strong&gt;robustness&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Built for gaming, GPUs traditionally prioritize raw performance over
safety. Invalid application code, like a shader that reads a buffer
out-of-bounds, can trigger undefined behaviour. Drivers exploit that to
maximize performance.&lt;/p&gt;
&lt;p&gt;For applications like web browsers, that trade-off is undesirable.
Browsers handle untrusted shaders, which they must sanitize to ensure
stability and security. Clicking a malicious link should not crash the
browser. While some sanitization is necessary as graphics APIs are not
security barriers, reducing undefined behaviour in the API can assist
“defence in depth”.&lt;/p&gt;
&lt;p&gt;“Robustness” features can help. Without robustness, out-of-bounds
buffer access in a shader can crash. With robustness, the application
can opt for defined out-of-bounds behaviour, trading some performance
for less attack surface.&lt;/p&gt;
&lt;p&gt;All modern cross-vendor APIs include robustness. Many games even
(accidentally?) rely on robustness. Strangely, the vendor’s proprietary
API omits buffer robustness. We must do better for conformance,
correctness, and compatibility.&lt;/p&gt;
&lt;p&gt;Let’s first define the problem. Different APIs have different
definitions of what an out-of-bounds load returns when robustness is
enabled:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zero (Direct3D, Vulkan with &lt;code&gt;robustBufferAccess2&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Either zero or some data in the buffer (OpenGL, Vulkan with
&lt;code&gt;robustBufferAccess&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Arbitrary values, but can’t crash (OpenGL ES)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OpenGL uses the second definition: return zero or data from the
buffer. One approach is to return the &lt;em&gt;last&lt;/em&gt; element of the
buffer for out-of-bounds access. Given the buffer size, we can calculate
the last index. Now consider the &lt;em&gt;minimum&lt;/em&gt; of the index being
accessed and the last index. That equals the index being accessed if it
is valid, and some other valid index otherwise. Loading the minimum
index is safe and gives a spec-compliant result.&lt;/p&gt;
&lt;p&gt;As an example, a uniform buffer load without robustness might look
like:&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb1-1"&gt;&lt;a href="#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt;&lt;span class="op"&gt;.&lt;/span&gt;i32 result&lt;span class="op"&gt;,&lt;/span&gt; buffer&lt;span class="op"&gt;,&lt;/span&gt; index&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Robustness adds a single unsigned minimum (&lt;code&gt;umin&lt;/code&gt;)
instruction:&lt;/p&gt;
&lt;div class="sourceCode" id="cb2"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb2-1"&gt;&lt;a href="#cb2-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;umin idx&lt;span class="op"&gt;,&lt;/span&gt; index&lt;span class="op"&gt;,&lt;/span&gt; last&lt;/span&gt;
&lt;span id="cb2-2"&gt;&lt;a href="#cb2-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt;&lt;span class="op"&gt;.&lt;/span&gt;i32 result&lt;span class="op"&gt;,&lt;/span&gt; buffer&lt;span class="op"&gt;,&lt;/span&gt; idx&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Is the robust version slower? It can be. The difference should be
small percentage-wise, as arithmetic is faster than memory. With
thousands of threads running in parallel, the arithmetic cost may even
be hidden by the load’s latency.&lt;/p&gt;
&lt;p&gt;There’s another trick that speeds up robust uniform buffers. Like
other GPUs, the M1 supports “preambles”. The idea is simple: instead of
calculating the same value in every thread, it’s faster to calculate
once and reuse the result. The compiler identifies eligible calculations
and moves them to a preamble executed before the main shader. These
redundancies are common, so preambles provide a nice speed-up.&lt;/p&gt;
&lt;p&gt;We usually move uniform buffer loads to the preamble when every
thread loads the same index. Since the size of a uniform buffer is
fixed, extra robustness arithmetic is &lt;em&gt;also&lt;/em&gt; moved to the
preamble. The robustness is “free” for the main shader. For robust
storage buffers, the clamping might move to the preamble even if the
load or store cannot.&lt;/p&gt;
&lt;p&gt;Armed with robust uniform and storage buffers, let’s consider robust
“vertex buffers”. In graphics APIs, the application can set vertex
buffers with a base GPU address and a chosen layout of “attributes”
within each buffer. Each attribute has an offset and a format, and the
buffer has a “stride” indicating the number of bytes per vertex. The
vertex shader can then read attributes, implicitly indexing by the
vertex. To do so, the shader loads the address:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Base plus stride times vertex plus offset" style="display:block;margin:0 auto;max-width:18em" src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' style='width:31.875ex;height:2.5ex;vertical-align:-.75ex;margin:1px 0' viewBox='0 -778.581 13744.556 1057.161'%3E%3Cg stroke='%23000' stroke-width='0' transform='scale(1 -1)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23a'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='713'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='1218'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1617'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='2288'/%3E%3Cg transform='translate(3293)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g' x='561'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h' x='955'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='1352'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='1635'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='2196'/%3E%3C/g%3E%3Cg transform='translate(6105)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k'/%3E%3Cg transform='translate(394)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='755'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h' x='1204'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g' x='1601'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1995'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23m' x='2444'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23n' x='3371'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='10092'/%3E%3Cg transform='translate(11097)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23o'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23p' x='783'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23p' x='1094'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='1405'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1804'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g' x='2253'/%3E%3C/g%3E%3C/g%3E%3Cdefs%3E%3Cpath id='k' stroke-width='10' d='M94 250q0 69 10 131t23 107 37 88 38 67 42 52 33 34 25 21h17q14 0 14-9 0-3-17-21t-41-53-49-86-42-138-17-193 17-192 41-139 49-86 42-53 17-21q0-9-15-9h-16l-28 24q-94 85-137 212T94 250Z'/%3E%3Cpath id='e' stroke-width='10' d='M56 237v13l14 20h299v150l1 150q10 13 19 13 13 0 20-15V270h298q15-8 15-20t-15-20H409V-68q-8-14-18-14h-4q-12 0-18 14v298H70q-14 7-14 20Z'/%3E%3Cpath id='n' stroke-width='10' d='m60 749 4 1h22l28-24q94-85 137-212t43-264q0-68-10-131T261 12t-37-88-38-67-41-51-32-33-23-19l-4-4H63q-3 0-5 3t-3 9q1 1 11 13Q221-64 221 250T66 725q-10 12-11 13 0 8 5 11Z'/%3E%3Cpath id='a' stroke-width='10' d='M131 622q-7 7-11 9t-16 3-43 3H28v46h318q77 0 113-5t72-27q43-24 68-61t25-78q0-51-41-93t-107-59l-10-3q73-9 129-55t56-115q0-68-51-120T469 3q-13-2-227-3H28v46h33q42 1 51 3t19 12v561Zm380-109q0 47-26 81t-69 42h-45q-20 0-38 1-67 0-82-1t-19-8q-3-4-3-129V374h83l84 1 10 2q4 1 11 3t25 13 32 24 25 39 12 57Zm26-325q0 51-28 94t-79 54l-101 1H229V116q0-59 5-64 6-5 100-5h49q42 0 60 6 43 14 68 51t26 84Z'/%3E%3Cpath id='b' stroke-width='10' d='M137 305h-22l-37 15-15 39q0 35 34 62t121 27q73 0 118-32t60-76q5-14 5-31t1-115v-70q0-48 5-66t21-18q15 0 20 16t5 53v36h40v-39q-1-40-3-47-9-30-35-47T400-6t-47 18-24 42v4l-2-3q-2-3-5-6t-8-9-12-11-15-12-18-11-22-8-26-6-31-3q-60 0-108 31t-48 87q0 21 7 40t27 41 48 37 78 28 110 15h14v22q0 34-6 50-22 71-97 71-18 0-34-1t-25-4-8-3q22-15 22-44 0-25-16-39Zm-11-199q0-31 24-55t59-25q38 0 67 23t39 60q2 7 3 66 0 58-1 58-8 0-21-1t-45-9-58-20-46-37-21-60Z'/%3E%3Cpath id='c' stroke-width='10' d='M295 316q0 40-27 69t-78 29q-36 0-62-13-30-19-30-52-1-5 0-13t16-24 43-25q18-5 44-9t44-9 32-13q17-8 33-20t32-41 17-62q0-62-38-102T198-10h-8q-52 0-96 36l-8-7-9-9Q71 4 65-1L54-11H42q-3 0-9 6v137q0 21 2 25t10 5h9q12 0 16-4t5-12 7-27 19-42q35-51 97-51 97 0 97 78 0 29-18 47-20 24-83 36t-83 23q-36 17-57 46t-21 62q0 39 17 66t43 40 50 18 44 5h11q40 0 70-15l15-8 9 7q10 9 22 17h12q3 0 9-6V310l-6-6h-28q-6 6-6 12Z'/%3E%3Cpath id='d' stroke-width='10' d='M28 218q0 55 20 100t50 73 65 42 66 15q53 0 91-18t58-50 28-64 9-71q0-7-7-14H126v-15q0-148 100-180 20-6 44-6 42 0 72 32 17 17 27 42l10 24q3 3 16 3h3q17 0 17-10 0-4-3-13-19-55-63-87t-99-32q-95 0-158 69T28 218Zm305 57q-11 128-95 136h-2q-8 0-16-1t-25-8-29-21-23-41-16-66v-7h206v8Z'/%3E%3Cpath id='l' stroke-width='10' d='m114 620-4 4-3 3-4 3q-4 3-5 2t-7 2-11 1-13 1-19 1H19v46h9q18-3 124-3 121 0 142 3h11v-46h-21q-61-3-61-17 0-2 90-248t91-246l86 232q85 230 85 239 0 19-21 29t-46 11h-5v46h9q15-3 115-3 91 0 97 3h6v-46h-7q-75 0-96-41 0-1-112-305T401-14q-5-8-19-8h-15q-14 0-19 8-2 2-117 317-117 314-117 317Z'/%3E%3Cpath id='h' stroke-width='10' d='M36 46h14q39 0 47 14v31q0 14 1 31t0 39 0 42v125l-1 23q-3 19-14 25t-45 9H20v23q0 23 2 23l10 1q10 1 28 2t36 2q16 1 35 2t29 3 11 1h3v-69q39 68 97 68h6q45 0 66-22t21-46q0-21-13-36t-38-15q-25 0-37 16t-13 34q0 9 2 16t5 12 3 5q-2 2-23-4-16-8-24-15-47-45-47-179V101q0-12 1-20t0-15v-5q1-2 3-4t5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9q-18 3-127 3Q37 3 28 0h-8v46h16Z'/%3E%3Cpath id='g' stroke-width='10' d='M27 422q53 4 82 56t32 122v15h40V431h135v-46H181V241q1-125 1-141t7-32q14-39 49-39 44 0 54 71 1 8 1 46v35h40v-47q0-77-42-117-27-27-70-27-34 0-59 12t-38 31-19 35-7 32q-1 7-1 148v137H18v37h9Z'/%3E%3Cpath id='m' stroke-width='10' d='M201 0q-12 3-99 3-76 0-85-3h-6v46h14q23 1 42 6t29 9 25 17 18 18 21 26 20 28l46 60-58 78q-9 13-19 27t-16 21-11 15-9 12-6 7-7 6-6 3-6 2-8 2q-6 0-36 2H16v46h7q36-2 103-2 93 0 103 2h8v-46q-36-4-36-16 0-2 10-16t28-38 29-41l4-4 25 34q32 41 32 54 0 6-2 11t-5 7-5 4-7 4l-3 1h-5v46h7q15-3 99-3 79 0 85 3h6v-46h-7q-49 0-81-17-17-8-34-27t-65-84l-16-21 62-85q66-90 71-94t17-7q18-4 53-4h17V0h-14q-8 1-20 1t-25 1-25 0-18 1h-37q-26 0-50-2l-23-1h-9v46h3q11 0 22 5t11 12q0 2-40 57l-41 55q-1-1-31-42t-34-45q-4-5-4-14 0-11 7-19t18-9q2 0 2-23V0h-7Z'/%3E%3Cpath id='f' stroke-width='10' d='M55 507q0 83 57 140t131 57h14q85 0 148-63l21 31q5 7 10 15t10 13l3 4h4q3 0 6 1h4q3 0 9-6V462l-6-6h-18q-11 0-13 3t-5 20q-17 126-101 167-37 16-75 16-53 0-86-36t-33-84q0-34 17-62t48-45q10-4 86-23t84-23q57-22 93-75t37-123q0-81-52-146T301-21q-56 0-100 17t-61 31l-18 14q-4-5-15-20T87-7t-9-14q-2-1-10-1h-4q-3 0-9 6v117q0 119 1 121 2 5 20 5h13q6-6 6-13 0-32 10-63t34-61 66-48 100-18q47 0 81 38t34 93q0 43-22 78t-58 48q-56 14-74 19-5 1-27 6t-33 8-32 11-33 18-29 24-27 35q-30 49-30 105Z'/%3E%3Cpath id='i' stroke-width='10' d='M69 609q0 28 18 44t44 16q23-2 40-17t17-43q0-30-17-45t-42-15q-25 0-42 15t-18 45ZM247 0q-15 3-104 3h-37Q80 3 56 1L34 0h-8v46h16q28 0 49 3 9 4 11 11t2 42v191q0 52-2 66t-14 19q-14 7-47 7H30v23q0 23 2 23l10 1q10 1 28 2t36 2 36 2 29 3 11 1h3V62q5-10 12-12t35-4h23V0h-8Z'/%3E%3Cpath id='j' stroke-width='10' d='M376 495v40q0 24 1 33 0 45-10 56t-51 13h-18v23q0 23 2 23l10 1q10 1 29 2t37 2 37 2 30 3 11 1h3V390q0-306 1-309 3-20 14-26t45-9h18V0q-2 0-76-5t-79-6h-7v55l-8-7q-58-48-130-48-77 0-139 61T34 215q0 100 63 163t147 64q75 0 132-49v102Zm-3-153q-45 63-113 63-49 0-87-36-27-28-34-64t-8-94q0-56 7-91t35-61q30-33 78-33 71 0 122 77v239Z'/%3E%3Cpath id='o' stroke-width='10' d='M56 340q0 83 30 154t78 116 106 70 118 25q133 0 233-104t101-260q0-81-29-150T617 75 510 4 388-22 267 3 160 74 85 189 56 340Zm411 307q-41 18-79 18-28 0-57-11t-62-34-56-71-34-110q-5-28-5-85 0-210 103-293 50-41 108-41h6q83 0 146 79 66 89 66 255 0 57-5 85-21 153-131 208Z'/%3E%3Cpath id='p' stroke-width='10' d='M273 0q-18 3-127 3Q43 3 34 0h-8v46h16q28 0 49 3 8 3 12 11 1 2 1 164v161H33v46h71v66l1 67 2 10q19 65 64 94t95 36h9q8 0 14 1 41-3 62-26t21-52q0-23-14-37t-37-14-37 14-14 37q0 20 18 40h-4q-4 1-11 1-28 0-50-21t-34-55q-6-20-7-95v-66h111v-46H185V225q0-162 1-164t3-4 5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9Z'/%3E%3C/defs%3E%3C/svg%3E"/&gt;&lt;/p&gt;
&lt;p&gt;Some hardware implements robust vertex fetch natively. Other hardware
has bounds-checked buffers to accelerate robust software vertex fetch.
Unfortunately, the M1 has neither. We need to implement vertex fetch
with raw memory loads.&lt;/p&gt;
&lt;p&gt;One instruction set feature helps. In addition to a 64-bit base
address, the M1 GPU’s memory loads also take an offset in
&lt;em&gt;elements&lt;/em&gt;. The hardware shifts the offset and adds to the 64-bit
base to determine the address to fetch. Additionally, the M1 has a
combined integer multiply-add instruction &lt;code&gt;imad&lt;/code&gt;. Together,
these features let us implement vertex loads in two instructions. For
example, a 32-bit attribute load looks like:&lt;/p&gt;
&lt;div class="sourceCode" id="cb3"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb3-1"&gt;&lt;a href="#cb3-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;imad idx&lt;span class="op"&gt;,&lt;/span&gt; stride&lt;span class="op"&gt;/&lt;/span&gt;&lt;span class="dv"&gt;4&lt;/span&gt;&lt;span class="op"&gt;,&lt;/span&gt; vertex&lt;span class="op"&gt;,&lt;/span&gt; offset&lt;span class="op"&gt;/&lt;/span&gt;&lt;span class="dv"&gt;4&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-2"&gt;&lt;a href="#cb3-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt;&lt;span class="op"&gt;.&lt;/span&gt;i32 result&lt;span class="op"&gt;,&lt;/span&gt; base&lt;span class="op"&gt;,&lt;/span&gt; idx&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The hardware load can perform an additional small shift. Suppose our
attribute is a vector of 4 32-bit values, densely packed into a buffer
with no offset. We can load that attribute in one instruction:&lt;/p&gt;
&lt;div class="sourceCode" id="cb4"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb4-1"&gt;&lt;a href="#cb4-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt;&lt;span class="op"&gt;.&lt;/span&gt;v4i32 result&lt;span class="op"&gt;,&lt;/span&gt; base&lt;span class="op"&gt;,&lt;/span&gt; vertex &lt;span class="op"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="dv"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;…with the hardware calculating the address:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Base plus 4 times vertex left shifted 2, which equals Base plus 16 times vertex" style="display:block;margin:0 auto;max-width:15em" src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' style='width:25.375ex;height:5.75ex;vertical-align:-2.375ex;margin:1px 0' viewBox='0 -1478.581 10898.731 2457.161'%3E%3Cg stroke='%23000' stroke-width='0'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23a' y='-700' transform='matrix(1 0 0 -1 153 0)'/%3E%3Cg transform='matrix(1 0 0 -1 936 -660)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='713'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1218'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1617'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='2288'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g' x='3293'/%3E%3Cg transform='translate(3965)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h'/%3E%3Cg transform='translate(394)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='755'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='1204'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k' x='1601'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1995'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l' x='2444'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23m' x='3648'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23n' x='4931'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23o' x='5436'/%3E%3C/g%3E%3C/g%3E%3Cg transform='matrix(1 0 0 -1 936 700)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='713'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1218'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1617'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='2288'/%3E%3Cg transform='translate(3293)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23p'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23q' x='505'/%3E%3C/g%3E%3Cg transform='translate(4470)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h'/%3E%3Cg transform='translate(394)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='755'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='1204'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k' x='1601'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1995'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l' x='2444'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23o' x='3371'/%3E%3C/g%3E%3C/g%3E%3C/g%3E%3Cdefs%3E%3Cpath id='a' stroke-width='10' d='M56 347q0 13 14 20h637q15-8 15-20 0-11-14-19l-318-1H72q-16 5-16 20Zm0-194q0 15 16 20h636q14-10 14-20 0-13-15-20H70q-14 7-14 20Z'/%3E%3Cpath id='h' stroke-width='10' d='M94 250q0 69 10 131t23 107 37 88 38 67 42 52 33 34 25 21h17q14 0 14-9 0-3-17-21t-41-53-49-86-42-138-17-193 17-192 41-139 49-86 42-53 17-21q0-9-15-9h-16l-28 24q-94 85-137 212T94 250Z'/%3E%3Cpath id='f' stroke-width='10' d='M56 237v13l14 20h299v150l1 150q10 13 19 13 13 0 20-15V270h298q15-8 15-20t-15-20H409V-68q-8-14-18-14h-4q-12 0-18 14v298H70q-14 7-14 20Z'/%3E%3Cpath id='p' stroke-width='10' d='m213 578-13-5q-14-5-40-10t-58-7H83v46h19q47 2 87 15t56 24 28 22q2 3 12 3 9 0 17-6V361l1-300q7-7 12-9t24-4 62-2h26V0h-11q-21 3-159 3-136 0-157-3H88v46h64q16 0 25 1t16 3 8 2 6 5 6 4v517Z'/%3E%3Cpath id='o' stroke-width='10' d='m60 749 4 1h22l28-24q94-85 137-212t43-264q0-68-10-131T261 12t-37-88-38-67-41-51-32-33-23-19l-4-4H63q-3 0-5 3t-3 9q1 1 11 13Q221-64 221 250T66 725q-10 12-11 13 0 8 5 11Z'/%3E%3Cpath id='n' stroke-width='10' d='M109 429q-27 0-43 18t-16 44q0 71 53 123t132 52q91 0 152-56t62-145q0-43-20-82t-48-68-80-74q-36-31-100-92l-59-56 76-1q157 0 167 5 7 2 24 89v3h40v-3q-1-3-13-91T421 3V0H50v31q0 7 6 15t30 35q29 32 50 56 9 10 34 37t34 37 29 33 28 34 23 30 21 32 15 29 13 32 7 30 3 33q0 63-34 109t-97 46q-33 0-58-17t-35-33-10-19q0-1 5-1 18 0 37-14t19-46q0-25-16-42t-45-18Z'/%3E%3Cpath id='b' stroke-width='10' d='M131 622q-7 7-11 9t-16 3-43 3H28v46h318q77 0 113-5t72-27q43-24 68-61t25-78q0-51-41-93t-107-59l-10-3q73-9 129-55t56-115q0-68-51-120T469 3q-13-2-227-3H28v46h33q42 1 51 3t19 12v561Zm380-109q0 47-26 81t-69 42h-45q-20 0-38 1-67 0-82-1t-19-8q-3-4-3-129V374h83l84 1 10 2q4 1 11 3t25 13 32 24 25 39 12 57Zm26-325q0 51-28 94t-79 54l-101 1H229V116q0-59 5-64 6-5 100-5h49q42 0 60 6 43 14 68 51t26 84Z'/%3E%3Cpath id='c' stroke-width='10' d='M137 305h-22l-37 15-15 39q0 35 34 62t121 27q73 0 118-32t60-76q5-14 5-31t1-115v-70q0-48 5-66t21-18q15 0 20 16t5 53v36h40v-39q-1-40-3-47-9-30-35-47T400-6t-47 18-24 42v4l-2-3q-2-3-5-6t-8-9-12-11-15-12-18-11-22-8-26-6-31-3q-60 0-108 31t-48 87q0 21 7 40t27 41 48 37 78 28 110 15h14v22q0 34-6 50-22 71-97 71-18 0-34-1t-25-4-8-3q22-15 22-44 0-25-16-39Zm-11-199q0-31 24-55t59-25q38 0 67 23t39 60q2 7 3 66 0 58-1 58-8 0-21-1t-45-9-58-20-46-37-21-60Z'/%3E%3Cpath id='d' stroke-width='10' d='M295 316q0 40-27 69t-78 29q-36 0-62-13-30-19-30-52-1-5 0-13t16-24 43-25q18-5 44-9t44-9 32-13q17-8 33-20t32-41 17-62q0-62-38-102T198-10h-8q-52 0-96 36l-8-7-9-9Q71 4 65-1L54-11H42q-3 0-9 6v137q0 21 2 25t10 5h9q12 0 16-4t5-12 7-27 19-42q35-51 97-51 97 0 97 78 0 29-18 47-20 24-83 36t-83 23q-36 17-57 46t-21 62q0 39 17 66t43 40 50 18 44 5h11q40 0 70-15l15-8 9 7q10 9 22 17h12q3 0 9-6V310l-6-6h-28q-6 6-6 12Z'/%3E%3Cpath id='e' stroke-width='10' d='M28 218q0 55 20 100t50 73 65 42 66 15q53 0 91-18t58-50 28-64 9-71q0-7-7-14H126v-15q0-148 100-180 20-6 44-6 42 0 72 32 17 17 27 42l10 24q3 3 16 3h3q17 0 17-10 0-4-3-13-19-55-63-87t-99-32q-95 0-158 69T28 218Zm305 57q-11 128-95 136h-2q-8 0-16-1t-25-8-29-21-23-41-16-66v-7h206v8Z'/%3E%3Cpath id='g' stroke-width='10' d='M462 0q-18 3-129 3-116 0-134-3h-9v46h58q7 0 17 2t14 5 7 8q1 2 1 54v50H28v46l151 231q153 232 155 233 2 2 21 2h18l6-6V211h92v-46h-92V66q0-7 6-12 8-7 57-8h29V0h-9ZM293 211v334L74 212l109-1h110Z'/%3E%3Cpath id='i' stroke-width='10' d='m114 620-4 4-3 3-4 3q-4 3-5 2t-7 2-11 1-13 1-19 1H19v46h9q18-3 124-3 121 0 142 3h11v-46h-21q-61-3-61-17 0-2 90-248t91-246l86 232q85 230 85 239 0 19-21 29t-46 11h-5v46h9q15-3 115-3 91 0 97 3h6v-46h-7q-75 0-96-41 0-1-112-305T401-14q-5-8-19-8h-15q-14 0-19 8-2 2-117 317-117 314-117 317Z'/%3E%3Cpath id='j' stroke-width='10' d='M36 46h14q39 0 47 14v31q0 14 1 31t0 39 0 42v125l-1 23q-3 19-14 25t-45 9H20v23q0 23 2 23l10 1q10 1 28 2t36 2q16 1 35 2t29 3 11 1h3v-69q39 68 97 68h6q45 0 66-22t21-46q0-21-13-36t-38-15q-25 0-37 16t-13 34q0 9 2 16t5 12 3 5q-2 2-23-4-16-8-24-15-47-45-47-179V101q0-12 1-20t0-15v-5q1-2 3-4t5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9q-18 3-127 3Q37 3 28 0h-8v46h16Z'/%3E%3Cpath id='k' stroke-width='10' d='M27 422q53 4 82 56t32 122v15h40V431h135v-46H181V241q1-125 1-141t7-32q14-39 49-39 44 0 54 71 1 8 1 46v35h40v-47q0-77-42-117-27-27-70-27-34 0-59 12t-38 31-19 35-7 32q-1 7-1 148v137H18v37h9Z'/%3E%3Cpath id='l' stroke-width='10' d='M201 0q-12 3-99 3-76 0-85-3h-6v46h14q23 1 42 6t29 9 25 17 18 18 21 26 20 28l46 60-58 78q-9 13-19 27t-16 21-11 15-9 12-6 7-7 6-6 3-6 2-8 2q-6 0-36 2H16v46h7q36-2 103-2 93 0 103 2h8v-46q-36-4-36-16 0-2 10-16t28-38 29-41l4-4 25 34q32 41 32 54 0 6-2 11t-5 7-5 4-7 4l-3 1h-5v46h7q15-3 99-3 79 0 85 3h6v-46h-7q-49 0-81-17-17-8-34-27t-65-84l-16-21 62-85q66-90 71-94t17-7q18-4 53-4h17V0h-14q-8 1-20 1t-25 1-25 0-18 1h-37q-26 0-50-2l-23-1h-9v46h3q11 0 22 5t11 12q0 2-40 57l-41 55q-1-1-31-42t-34-45q-4-5-4-14 0-11 7-19t18-9q2 0 2-23V0h-7Z'/%3E%3Cpath id='m' stroke-width='10' d='M639-48q0-6-5-12t-15-7h-1q-6 0-82 41Q430 33 329 88 61 235 59 239q-3 4-3 11t3 11q3 5 277 154t279 152l4 1q3-1 6-1 14-5 14-19 0-8-6-14-1-2-259-143L117 250l257-141Q632-32 633-34q6-6 6-14Zm305 0q0-6-5-12t-15-7h-1q-6 0-82 41Q735 33 634 88 366 235 364 239q-3 4-3 11t3 11q3 5 277 154t279 152l4 1q3-1 6-1 14-5 14-19 0-8-6-14-1-2-259-143L422 250l257-141Q937-32 938-34q6-6 6-14Z'/%3E%3Cpath id='q' stroke-width='10' d='M42 313q0 163 81 258t180 95q69 0 99-36t30-80q0-25-14-40t-39-15q-23 0-38 14t-15 39q0 44 47 53-22 22-62 25-71 0-117-60-47-66-47-202l1-4q5 6 8 13 41 60 107 60h4q46 0 81-19 24-14 48-40t39-57q21-49 21-107v-18q0-23-5-43-11-59-64-115T253-22q-28 0-54 8t-56 30-51 59-36 97-14 141Zm215 84q-30 0-52-17t-34-45-17-57-6-62q0-83 12-119t38-58q24-18 53-18 51 0 78 38 13 18 18 45t5 105q0 80-5 107t-18 45q-27 36-72 36Z'/%3E%3C/defs%3E%3C/svg%3E"/&gt;&lt;/p&gt;
&lt;p&gt;What about robustness?&lt;/p&gt;
&lt;p&gt;We want to implement robustness with a clamp, like we did for uniform
buffers. The problem is that the vertex buffer size is given in bytes,
while our optimized load takes an index in “vertices”. A single vertex
buffer can contain multiple attributes with different formats and
offsets, so we can’t convert the size in bytes to a size in
“vertices”.&lt;/p&gt;
&lt;p&gt;Let’s handle the latter problem. We can rewrite the addressing
equation as:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Base plus offset, which is the attribute base, plus stride times vertex" style="display:block;margin:0 auto;max-width:18em" src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' style='width:33.75ex;height:6ex;vertical-align:-4.25ex;margin:1px 0' viewBox='0 -778.581 14532.556 2598.036'%3E%3Cg stroke='%23000' stroke-width='0' transform='scale(1 -1)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23a'/%3E%3Cg transform='translate(394)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='713'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1218'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1617'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='2682'/%3E%3Cg transform='translate(3687)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h' x='783'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h' x='1094'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1405'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1804'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='2253'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='6334'/%3E%3Cg transform='translate(12 -783)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k' x='19'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l' transform='matrix(5.77434 0 0 1 512.872 0)'/%3E%3Cg transform='translate(2909)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23m'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23n' x='455'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l' transform='matrix(5.77434 0 0 1 3848.094 0)'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23o' x='6249'/%3E%3C/g%3E%3Cg transform='matrix(.7071 0 0 .7071 1116 -1687)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23p'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='754'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='1149'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23q' x='1543'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23r' x='1940'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23s' x='2223'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23t' x='2784'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='3345'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='3739'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23u' x='4188'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23s' x='4443'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='5004'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='5509'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='5908'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='6950'/%3E%3Cg transform='translate(7955)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23v'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='561'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23q' x='955'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23r' x='1352'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23w' x='1635'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='2196'/%3E%3C/g%3E%3Cg transform='translate(10767)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23a'/%3E%3Cg transform='translate(394)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23x'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='755'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23q' x='1204'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='1601'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1995'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23y' x='2444'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='3371'/%3E%3C/g%3E%3C/g%3E%3Cdefs%3E%3Cpath id='a' stroke-width='10' d='M94 250q0 69 10 131t23 107 37 88 38 67 42 52 33 34 25 21h17q14 0 14-9 0-3-17-21t-41-53-49-86-42-138-17-193 17-192 41-139 49-86 42-53 17-21q0-9-15-9h-16l-28 24q-94 85-137 212T94 250Z'/%3E%3Cpath id='f' stroke-width='10' d='M56 237v13l14 20h299v150l1 150q10 13 19 13 13 0 20-15V270h298q15-8 15-20t-15-20H409V-68q-8-14-18-14h-4q-12 0-18 14v298H70q-14 7-14 20Z'/%3E%3Cpath id='j' stroke-width='10' d='m60 749 4 1h22l28-24q94-85 137-212t43-264q0-68-10-131T261 12t-37-88-38-67-41-51-32-33-23-19l-4-4H63q-3 0-5 3t-3 9q1 1 11 13Q221-64 221 250T66 725q-10 12-11 13 0 8 5 11Z'/%3E%3Cpath id='b' stroke-width='10' d='M131 622q-7 7-11 9t-16 3-43 3H28v46h318q77 0 113-5t72-27q43-24 68-61t25-78q0-51-41-93t-107-59l-10-3q73-9 129-55t56-115q0-68-51-120T469 3q-13-2-227-3H28v46h33q42 1 51 3t19 12v561Zm380-109q0 47-26 81t-69 42h-45q-20 0-38 1-67 0-82-1t-19-8q-3-4-3-129V374h83l84 1 10 2q4 1 11 3t25 13 32 24 25 39 12 57Zm26-325q0 51-28 94t-79 54l-101 1H229V116q0-59 5-64 6-5 100-5h49q42 0 60 6 43 14 68 51t26 84Z'/%3E%3Cpath id='c' stroke-width='10' d='M137 305h-22l-37 15-15 39q0 35 34 62t121 27q73 0 118-32t60-76q5-14 5-31t1-115v-70q0-48 5-66t21-18q15 0 20 16t5 53v36h40v-39q-1-40-3-47-9-30-35-47T400-6t-47 18-24 42v4l-2-3q-2-3-5-6t-8-9-12-11-15-12-18-11-22-8-26-6-31-3q-60 0-108 31t-48 87q0 21 7 40t27 41 48 37 78 28 110 15h14v22q0 34-6 50-22 71-97 71-18 0-34-1t-25-4-8-3q22-15 22-44 0-25-16-39Zm-11-199q0-31 24-55t59-25q38 0 67 23t39 60q2 7 3 66 0 58-1 58-8 0-21-1t-45-9-58-20-46-37-21-60Z'/%3E%3Cpath id='d' stroke-width='10' d='M295 316q0 40-27 69t-78 29q-36 0-62-13-30-19-30-52-1-5 0-13t16-24 43-25q18-5 44-9t44-9 32-13q17-8 33-20t32-41 17-62q0-62-38-102T198-10h-8q-52 0-96 36l-8-7-9-9Q71 4 65-1L54-11H42q-3 0-9 6v137q0 21 2 25t10 5h9q12 0 16-4t5-12 7-27 19-42q35-51 97-51 97 0 97 78 0 29-18 47-20 24-83 36t-83 23q-36 17-57 46t-21 62q0 39 17 66t43 40 50 18 44 5h11q40 0 70-15l15-8 9 7q10 9 22 17h12q3 0 9-6V310l-6-6h-28q-6 6-6 12Z'/%3E%3Cpath id='e' stroke-width='10' d='M28 218q0 55 20 100t50 73 65 42 66 15q53 0 91-18t58-50 28-64 9-71q0-7-7-14H126v-15q0-148 100-180 20-6 44-6 42 0 72 32 17 17 27 42l10 24q3 3 16 3h3q17 0 17-10 0-4-3-13-19-55-63-87t-99-32q-95 0-158 69T28 218Zm305 57q-11 128-95 136h-2q-8 0-16-1t-25-8-29-21-23-41-16-66v-7h206v8Z'/%3E%3Cpath id='x' stroke-width='10' d='m114 620-4 4-3 3-4 3q-4 3-5 2t-7 2-11 1-13 1-19 1H19v46h9q18-3 124-3 121 0 142 3h11v-46h-21q-61-3-61-17 0-2 90-248t91-246l86 232q85 230 85 239 0 19-21 29t-46 11h-5v46h9q15-3 115-3 91 0 97 3h6v-46h-7q-75 0-96-41 0-1-112-305T401-14q-5-8-19-8h-15q-14 0-19 8-2 2-117 317-117 314-117 317Z'/%3E%3Cpath id='q' stroke-width='10' d='M36 46h14q39 0 47 14v31q0 14 1 31t0 39 0 42v125l-1 23q-3 19-14 25t-45 9H20v23q0 23 2 23l10 1q10 1 28 2t36 2q16 1 35 2t29 3 11 1h3v-69q39 68 97 68h6q45 0 66-22t21-46q0-21-13-36t-38-15q-25 0-37 16t-13 34q0 9 2 16t5 12 3 5q-2 2-23-4-16-8-24-15-47-45-47-179V101q0-12 1-20t0-15v-5q1-2 3-4t5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9q-18 3-127 3Q37 3 28 0h-8v46h16Z'/%3E%3Cpath id='i' stroke-width='10' d='M27 422q53 4 82 56t32 122v15h40V431h135v-46H181V241q1-125 1-141t7-32q14-39 49-39 44 0 54 71 1 8 1 46v35h40v-47q0-77-42-117-27-27-70-27-34 0-59 12t-38 31-19 35-7 32q-1 7-1 148v137H18v37h9Z'/%3E%3Cpath id='y' stroke-width='10' d='M201 0q-12 3-99 3-76 0-85-3h-6v46h14q23 1 42 6t29 9 25 17 18 18 21 26 20 28l46 60-58 78q-9 13-19 27t-16 21-11 15-9 12-6 7-7 6-6 3-6 2-8 2q-6 0-36 2H16v46h7q36-2 103-2 93 0 103 2h8v-46q-36-4-36-16 0-2 10-16t28-38 29-41l4-4 25 34q32 41 32 54 0 6-2 11t-5 7-5 4-7 4l-3 1h-5v46h7q15-3 99-3 79 0 85 3h6v-46h-7q-49 0-81-17-17-8-34-27t-65-84l-16-21 62-85q66-90 71-94t17-7q18-4 53-4h17V0h-14q-8 1-20 1t-25 1-25 0-18 1h-37q-26 0-50-2l-23-1h-9v46h3q11 0 22 5t11 12q0 2-40 57l-41 55q-1-1-31-42t-34-45q-4-5-4-14 0-11 7-19t18-9q2 0 2-23V0h-7Z'/%3E%3Cpath id='v' stroke-width='10' d='M55 507q0 83 57 140t131 57h14q85 0 148-63l21 31q5 7 10 15t10 13l3 4h4q3 0 6 1h4q3 0 9-6V462l-6-6h-18q-11 0-13 3t-5 20q-17 126-101 167-37 16-75 16-53 0-86-36t-33-84q0-34 17-62t48-45q10-4 86-23t84-23q57-22 93-75t37-123q0-81-52-146T301-21q-56 0-100 17t-61 31l-18 14q-4-5-15-20T87-7t-9-14q-2-1-10-1h-4q-3 0-9 6v117q0 119 1 121 2 5 20 5h13q6-6 6-13 0-32 10-63t34-61 66-48 100-18q47 0 81 38t34 93q0 43-22 78t-58 48q-56 14-74 19-5 1-27 6t-33 8-32 11-33 18-29 24-27 35q-30 49-30 105Z'/%3E%3Cpath id='r' stroke-width='10' d='M69 609q0 28 18 44t44 16q23-2 40-17t17-43q0-30-17-45t-42-15q-25 0-42 15t-18 45ZM247 0q-15 3-104 3h-37Q80 3 56 1L34 0h-8v46h16q28 0 49 3 9 4 11 11t2 42v191q0 52-2 66t-14 19q-14 7-47 7H30v23q0 23 2 23l10 1q10 1 28 2t36 2 36 2 29 3 11 1h3V62q5-10 12-12t35-4h23V0h-8Z'/%3E%3Cpath id='w' stroke-width='10' d='M376 495v40q0 24 1 33 0 45-10 56t-51 13h-18v23q0 23 2 23l10 1q10 1 29 2t37 2 37 2 30 3 11 1h3V390q0-306 1-309 3-20 14-26t45-9h18V0q-2 0-76-5t-79-6h-7v55l-8-7q-58-48-130-48-77 0-139 61T34 215q0 100 63 163t147 64q75 0 132-49v102Zm-3-153q-45 63-113 63-49 0-87-36-27-28-34-64t-8-94q0-56 7-91t35-61q30-33 78-33 71 0 122 77v239Z'/%3E%3Cpath id='g' stroke-width='10' d='M56 340q0 83 30 154t78 116 106 70 118 25q133 0 233-104t101-260q0-81-29-150T617 75 510 4 388-22 267 3 160 74 85 189 56 340Zm411 307q-41 18-79 18-28 0-57-11t-62-34-56-71-34-110q-5-28-5-85 0-210 103-293 50-41 108-41h6q83 0 146 79 66 89 66 255 0 57-5 85-21 153-131 208Z'/%3E%3Cpath id='h' stroke-width='10' d='M273 0q-18 3-127 3Q43 3 34 0h-8v46h16q28 0 49 3 8 3 12 11 1 2 1 164v161H33v46h71v66l1 67 2 10q19 65 64 94t95 36h9q8 0 14 1 41-3 62-26t21-52q0-23-14-37t-37-14-37 14-14 37q0 20 18 40h-4q-4 1-11 1-28 0-50-21t-34-55q-6-20-7-95v-66h111v-46H185V225q0-162 1-164t3-4 5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9Z'/%3E%3Cpath id='k' stroke-width='10' d='m-24 327 6 6h33q4 0 7-4t5-7 8-14 19-24q61-81 171-122t216-42q13 0 16-3t3-22V28q0-20-3-24t-15-4q-87 0-182 36Q75 118-16 278l-8 14v35Z'/%3E%3Cpath id='o' stroke-width='10' d='M-10 60v35q0 18 3 21t16 4q142 0 241 51t146 113q8 9 16 21t12 19 7 7q2 2 20 2h17l6-6v-35l-8-14Q375 118 190 36 95 0 8 0-5 0-7 3t-3 21v36Z'/%3E%3Cpath id='m' stroke-width='10' d='M-10 60v51q0 7 5 7 4 2 15 2 86 0 180-36Q375 2 466-158l8-14v-35l-6-6h-34q-3 0-6 4t-5 7-9 15-18 24Q331-82 224-41T9 0Q-4 0-7 3t-3 22v35Z'/%3E%3Cpath id='n' stroke-width='10' d='m-18-213-6 6v35l8 14Q75 2 260 84q74 29 155 35h12q9 0 13 1 14 0 17-3t3-19V25q0-18-3-21t-16-4Q308 0 193-55T25-205q-4-6-7-7t-19-1h-17Z'/%3E%3Cpath id='l' stroke-width='10' d='M-10 0v120h420V0H-10Z'/%3E%3Cpath id='p' stroke-width='10' d='M255 0q-15 3-115 3Q48 3 39 0h-7v46h15q72 3 92 42 1 3 53 157t103 308 53 155q3 8 18 8h10q20-1 24-7 2-2 108-319L617 67q7-13 19-16t51-5h30V0h-9q-9 3-127 3-123 0-144-3h-10v46h13q70 0 70 18 0 2-24 74l-24 71H229l-20-59q-20-59-20-65 0-13 20-26t50-13h5V0h-9Zm192 255L345 557 244 256q0-1 101-1h102Z'/%3E%3Cpath id='s' stroke-width='10' d='M307-11q-73 0-139 66l-10-18q-2-3-5-9t-6-11-4-7l-5-9-20-1H98v298q0 301-1 305-3 19-14 25t-45 9H20v23q0 23 2 23l10 1q10 1 29 2t37 2 37 2 30 3 11 1h3V543q0-152 1-152l3 3q3 3 9 7t15 10 21 10 26 10 32 8 37 3q78 0 138-63t61-163q0-101-64-164T307-11ZM182 98q0-1 5-8t9-11 10-12 12-12 15-11 17-9 21-6 24-3q35 0 68 20t49 67q12 35 12 99 0 75-12 111-27 82-112 82-30 0-61-15t-51-43l-6-8V98Z'/%3E%3Cpath id='t' stroke-width='10' d='M383 58q-56-68-127-68h-7q-125 0-144 99-1 7-2 137-1 109-1 122t-6 21q-10 16-60 16H25v23q0 23 2 23l11 1q10 1 29 2t38 2q17 1 37 2t30 3 12 1h3V261q1-184 3-197 3-15 14-24 20-14 60-14 26 0 47 9t32 23 20 32 12 30 4 24v17q0 16 1 40t0 47v67q0 46-10 57t-50 13h-18v46q2 0 76 5t79 6h7V264q0-180 1-183 3-20 14-26t45-9h18V0q-2 0-75-5t-77-6h-7v69Z'/%3E%3C/defs%3E%3C/svg%3E"/&gt;&lt;/p&gt;
&lt;p&gt;That is: one buffer with many attributes at different offsets is
equivalent to many buffers with one attribute and no offset. This gives
an alternate perspective on the same data layout. Is this an
improvement? It avoids an addition in the shader, at the cost of passing
more data – addresses are 64-bit while attribute offsets are &lt;a
href="https://vulkan.gpuinfo.org/listreports.php?limit=maxVertexInputAttributeOffset&amp;amp;value=4294967295&amp;amp;platform=all0"&gt;16-bit&lt;/a&gt;.
More importantly, it lets us translate the vertex buffer size in bytes
into a size in “vertices” for &lt;em&gt;each&lt;/em&gt; vertex attribute. Instead of
clamping the offset, we clamp the vertex index. We still make full use
of the hardware addressing modes, now with robustness:&lt;/p&gt;
&lt;div class="sourceCode" id="cb5"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb5-1"&gt;&lt;a href="#cb5-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;umin idx&lt;span class="op"&gt;,&lt;/span&gt; vertex&lt;span class="op"&gt;,&lt;/span&gt; last valid&lt;/span&gt;
&lt;span id="cb5-2"&gt;&lt;a href="#cb5-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;load&lt;/span&gt;&lt;span class="op"&gt;.&lt;/span&gt;v4i32 result&lt;span class="op"&gt;,&lt;/span&gt; base&lt;span class="op"&gt;,&lt;/span&gt; idx &lt;span class="op"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="dv"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We need to calculate the last valid vertex index ahead-of-time for
each attribute. Each attribute has a format with a particular size.
Manipulating the addressing equation, we can calculate the last
&lt;em&gt;byte&lt;/em&gt; accessed in the buffer (plus 1) relative to the base:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Offset plus stride times vertex plus format" style="display:block;margin:0 auto;max-width:18em" src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' style='width:31.5ex;height:2.5ex;vertical-align:-.75ex;margin:1px 0' viewBox='0 -778.581 13568.556 1057.161'%3E%3Cg stroke='%23000' stroke-width='0' transform='scale(1 -1)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23a'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='783'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='1094'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='1405'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1804'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='2253'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='2869'/%3E%3Cg transform='translate(3874)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='561'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h' x='955'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='1352'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='1635'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='2196'/%3E%3C/g%3E%3Cg transform='translate(6686)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k'/%3E%3Cg transform='translate(394)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='755'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h' x='1204'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1601'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1995'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23m' x='2444'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23n' x='3371'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='10673'/%3E%3Cg transform='translate(11678)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23o'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23p' x='658'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='1496'/%3E%3C/g%3E%3C/g%3E%3Cdefs%3E%3Cpath id='k' stroke-width='10' d='M94 250q0 69 10 131t23 107 37 88 38 67 42 52 33 34 25 21h17q14 0 14-9 0-3-17-21t-41-53-49-86-42-138-17-193 17-192 41-139 49-86 42-53 17-21q0-9-15-9h-16l-28 24q-94 85-137 212T94 250Z'/%3E%3Cpath id='f' stroke-width='10' d='M56 237v13l14 20h299v150l1 150q10 13 19 13 13 0 20-15V270h298q15-8 15-20t-15-20H409V-68q-8-14-18-14h-4q-12 0-18 14v298H70q-14 7-14 20Z'/%3E%3Cpath id='n' stroke-width='10' d='m60 749 4 1h22l28-24q94-85 137-212t43-264q0-68-10-131T261 12t-37-88-38-67-41-51-32-33-23-19l-4-4H63q-3 0-5 3t-3 9q1 1 11 13Q221-64 221 250T66 725q-10 12-11 13 0 8 5 11Z'/%3E%3Cpath id='c' stroke-width='10' d='M295 316q0 40-27 69t-78 29q-36 0-62-13-30-19-30-52-1-5 0-13t16-24 43-25q18-5 44-9t44-9 32-13q17-8 33-20t32-41 17-62q0-62-38-102T198-10h-8q-52 0-96 36l-8-7-9-9Q71 4 65-1L54-11H42q-3 0-9 6v137q0 21 2 25t10 5h9q12 0 16-4t5-12 7-27 19-42q35-51 97-51 97 0 97 78 0 29-18 47-20 24-83 36t-83 23q-36 17-57 46t-21 62q0 39 17 66t43 40 50 18 44 5h11q40 0 70-15l15-8 9 7q10 9 22 17h12q3 0 9-6V310l-6-6h-28q-6 6-6 12Z'/%3E%3Cpath id='d' stroke-width='10' d='M28 218q0 55 20 100t50 73 65 42 66 15q53 0 91-18t58-50 28-64 9-71q0-7-7-14H126v-15q0-148 100-180 20-6 44-6 42 0 72 32 17 17 27 42l10 24q3 3 16 3h3q17 0 17-10 0-4-3-13-19-55-63-87t-99-32q-95 0-158 69T28 218Zm305 57q-11 128-95 136h-2q-8 0-16-1t-25-8-29-21-23-41-16-66v-7h206v8Z'/%3E%3Cpath id='l' stroke-width='10' d='m114 620-4 4-3 3-4 3q-4 3-5 2t-7 2-11 1-13 1-19 1H19v46h9q18-3 124-3 121 0 142 3h11v-46h-21q-61-3-61-17 0-2 90-248t91-246l86 232q85 230 85 239 0 19-21 29t-46 11h-5v46h9q15-3 115-3 91 0 97 3h6v-46h-7q-75 0-96-41 0-1-112-305T401-14q-5-8-19-8h-15q-14 0-19 8-2 2-117 317-117 314-117 317Z'/%3E%3Cpath id='h' stroke-width='10' d='M36 46h14q39 0 47 14v31q0 14 1 31t0 39 0 42v125l-1 23q-3 19-14 25t-45 9H20v23q0 23 2 23l10 1q10 1 28 2t36 2q16 1 35 2t29 3 11 1h3v-69q39 68 97 68h6q45 0 66-22t21-46q0-21-13-36t-38-15q-25 0-37 16t-13 34q0 9 2 16t5 12 3 5q-2 2-23-4-16-8-24-15-47-45-47-179V101q0-12 1-20t0-15v-5q1-2 3-4t5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9q-18 3-127 3Q37 3 28 0h-8v46h16Z'/%3E%3Cpath id='e' stroke-width='10' d='M27 422q53 4 82 56t32 122v15h40V431h135v-46H181V241q1-125 1-141t7-32q14-39 49-39 44 0 54 71 1 8 1 46v35h40v-47q0-77-42-117-27-27-70-27-34 0-59 12t-38 31-19 35-7 32q-1 7-1 148v137H18v37h9Z'/%3E%3Cpath id='m' stroke-width='10' d='M201 0q-12 3-99 3-76 0-85-3h-6v46h14q23 1 42 6t29 9 25 17 18 18 21 26 20 28l46 60-58 78q-9 13-19 27t-16 21-11 15-9 12-6 7-7 6-6 3-6 2-8 2q-6 0-36 2H16v46h7q36-2 103-2 93 0 103 2h8v-46q-36-4-36-16 0-2 10-16t28-38 29-41l4-4 25 34q32 41 32 54 0 6-2 11t-5 7-5 4-7 4l-3 1h-5v46h7q15-3 99-3 79 0 85 3h6v-46h-7q-49 0-81-17-17-8-34-27t-65-84l-16-21 62-85q66-90 71-94t17-7q18-4 53-4h17V0h-14q-8 1-20 1t-25 1-25 0-18 1h-37q-26 0-50-2l-23-1h-9v46h3q11 0 22 5t11 12q0 2-40 57l-41 55q-1-1-31-42t-34-45q-4-5-4-14 0-11 7-19t18-9q2 0 2-23V0h-7Z'/%3E%3Cpath id='g' stroke-width='10' d='M55 507q0 83 57 140t131 57h14q85 0 148-63l21 31q5 7 10 15t10 13l3 4h4q3 0 6 1h4q3 0 9-6V462l-6-6h-18q-11 0-13 3t-5 20q-17 126-101 167-37 16-75 16-53 0-86-36t-33-84q0-34 17-62t48-45q10-4 86-23t84-23q57-22 93-75t37-123q0-81-52-146T301-21q-56 0-100 17t-61 31l-18 14q-4-5-15-20T87-7t-9-14q-2-1-10-1h-4q-3 0-9 6v117q0 119 1 121 2 5 20 5h13q6-6 6-13 0-32 10-63t34-61 66-48 100-18q47 0 81 38t34 93q0 43-22 78t-58 48q-56 14-74 19-5 1-27 6t-33 8-32 11-33 18-29 24-27 35q-30 49-30 105Z'/%3E%3Cpath id='i' stroke-width='10' d='M69 609q0 28 18 44t44 16q23-2 40-17t17-43q0-30-17-45t-42-15q-25 0-42 15t-18 45ZM247 0q-15 3-104 3h-37Q80 3 56 1L34 0h-8v46h16q28 0 49 3 9 4 11 11t2 42v191q0 52-2 66t-14 19q-14 7-47 7H30v23q0 23 2 23l10 1q10 1 28 2t36 2 36 2 29 3 11 1h3V62q5-10 12-12t35-4h23V0h-8Z'/%3E%3Cpath id='j' stroke-width='10' d='M376 495v40q0 24 1 33 0 45-10 56t-51 13h-18v23q0 23 2 23l10 1q10 1 29 2t37 2 37 2 30 3 11 1h3V390q0-306 1-309 3-20 14-26t45-9h18V0q-2 0-76-5t-79-6h-7v55l-8-7q-58-48-130-48-77 0-139 61T34 215q0 100 63 163t147 64q75 0 132-49v102Zm-3-153q-45 63-113 63-49 0-87-36-27-28-34-64t-8-94q0-56 7-91t35-61q30-33 78-33 71 0 122 77v239Z'/%3E%3Cpath id='a' stroke-width='10' d='M56 340q0 83 30 154t78 116 106 70 118 25q133 0 233-104t101-260q0-81-29-150T617 75 510 4 388-22 267 3 160 74 85 189 56 340Zm411 307q-41 18-79 18-28 0-57-11t-62-34-56-71-34-110q-5-28-5-85 0-210 103-293 50-41 108-41h6q83 0 146 79 66 89 66 255 0 57-5 85-21 153-131 208Z'/%3E%3Cpath id='b' stroke-width='10' d='M273 0q-18 3-127 3Q43 3 34 0h-8v46h16q28 0 49 3 8 3 12 11 1 2 1 164v161H33v46h71v66l1 67 2 10q19 65 64 94t95 36h9q8 0 14 1 41-3 62-26t21-52q0-23-14-37t-37-14-37 14-14 37q0 20 18 40h-4q-4 1-11 1-28 0-50-21t-34-55q-6-20-7-95v-66h111v-46H185V225q0-162 1-164t3-4 5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9Z'/%3E%3Cpath id='o' stroke-width='10' d='M128 619q-7 7-11 9t-16 3-43 3H25v46h557v-4q2-6 14-116t14-116v-4h-40v4q-7 49-9 57-6 37-18 62t-27 38-39 21-46 9-57 2h-88q-34 0-42-2t-11-10q-1-2-1-131V363h71q16 0 24 1t22 3 23 6 17 12q18 18 21 74v21h40V200h-40v21q-3 55-21 75-8 7-18 11t-23 6-21 3-24 1-19 0h-52V189l1-128q7-7 12-9t25-4 63-2h27V0h-12q-24 3-166 3Q51 3 36 0H25v46h33q42 1 51 3t19 12v558Z'/%3E%3Cpath id='p' stroke-width='10' d='M41 46h14q39 0 47 14v62q0 17 1 39t0 42v66q0 35-1 59v23q-3 19-14 25t-45 9H25v23q0 23 2 23l10 1q10 1 28 2t37 2q17 1 36 2t29 3 11 1h3v-40q0-38 1-38t5 5 12 15 19 18 29 19 38 16q20 5 51 5 15 0 28-2t23-6 19-8 15-9 11-11 9-11 7-11 4-10 3-8l2-5 3 4 6 8q3 4 9 11t13 13 15 13 20 12 23 10 26 7 31 3q126 0 137-113 1-7 1-139v-86q0-38 2-45t11-10q21-3 49-3h16V0h-8l-23 1q-24 1-51 1t-38 1Q596 3 587 0h-8v46h16q61 0 61 16 1 2 1 138-1 135-2 143-6 28-20 42t-24 17-26 2q-45 0-79-34-27-27-34-55t-8-83V108q0-30 1-40t3-13 9-6q21-3 49-3h16V0h-8l-24 1q-23 1-50 1t-38 1Q319 3 310 0h-8v46h16q61 0 61 16 1 2 1 138-1 135-2 143-6 28-20 42t-24 17-26 2q-45 0-79-34-27-27-34-55t-8-83V108q0-30 1-40t3-13 9-6q21-3 49-3h16V0h-8l-23 1q-24 1-51 1t-38 1Q42 3 33 0h-8v46h16Z'/%3E%3C/defs%3E%3C/svg%3E"/&gt;&lt;/p&gt;
&lt;p&gt;The load is valid when that value is bounded by the buffer size in
bytes. We solve the integer inequality as:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Vertex less than or equal to the floor of size minus offset minus format divided by stride" style="display:block;margin:0 auto;max-width:18em" src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' style='width:34.5ex;height:5.75ex;vertical-align:-2.375ex;margin:1px 0' viewBox='0 -1478.081 14863.222 2456.161'%3E%3Cg stroke='%23000' stroke-width='0' transform='scale(1 -1)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23a'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='755'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='1204'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1601'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='1995'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23e' x='2444'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23f' x='3254'/%3E%3Cg transform='translate(4315)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23g' x='277' y='-1'/%3E%3Cpath stroke='none' d='M985 220h8853v60H985z'/%3E%3Cg transform='translate(1045 676)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='561'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23j' x='844'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='1293'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k' x='1964'/%3E%3Cg transform='translate(2969)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23l'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23m' x='783'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23m' x='1094'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23n' x='1405'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='1804'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='2253'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23k' x='5838'/%3E%3Cg transform='translate(6843)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23o'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23p' x='658'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='1496'/%3E%3C/g%3E%3C/g%3E%3Cg transform='translate(4089 -690)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23h'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23d' x='561'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23c' x='955'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23i' x='1352'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23q' x='1635'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23b' x='2196'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23r' x='9959' y='-1'/%3E%3C/g%3E%3C/g%3E%3Cdefs%3E%3Cpath id='n' stroke-width='10' d='M295 316q0 40-27 69t-78 29q-36 0-62-13-30-19-30-52-1-5 0-13t16-24 43-25q18-5 44-9t44-9 32-13q17-8 33-20t32-41 17-62q0-62-38-102T198-10h-8q-52 0-96 36l-8-7-9-9Q71 4 65-1L54-11H42q-3 0-9 6v137q0 21 2 25t10 5h9q12 0 16-4t5-12 7-27 19-42q35-51 97-51 97 0 97 78 0 29-18 47-20 24-83 36t-83 23q-36 17-57 46t-21 62q0 39 17 66t43 40 50 18 44 5h11q40 0 70-15l15-8 9 7q10 9 22 17h12q3 0 9-6V310l-6-6h-28q-6 6-6 12Z'/%3E%3Cpath id='b' stroke-width='10' d='M28 218q0 55 20 100t50 73 65 42 66 15q53 0 91-18t58-50 28-64 9-71q0-7-7-14H126v-15q0-148 100-180 20-6 44-6 42 0 72 32 17 17 27 42l10 24q3 3 16 3h3q17 0 17-10 0-4-3-13-19-55-63-87t-99-32q-95 0-158 69T28 218Zm305 57q-11 128-95 136h-2q-8 0-16-1t-25-8-29-21-23-41-16-66v-7h206v8Z'/%3E%3Cpath id='a' stroke-width='10' d='m114 620-4 4-3 3-4 3q-4 3-5 2t-7 2-11 1-13 1-19 1H19v46h9q18-3 124-3 121 0 142 3h11v-46h-21q-61-3-61-17 0-2 90-248t91-246l86 232q85 230 85 239 0 19-21 29t-46 11h-5v46h9q15-3 115-3 91 0 97 3h6v-46h-7q-75 0-96-41 0-1-112-305T401-14q-5-8-19-8h-15q-14 0-19 8-2 2-117 317-117 314-117 317Z'/%3E%3Cpath id='c' stroke-width='10' d='M36 46h14q39 0 47 14v31q0 14 1 31t0 39 0 42v125l-1 23q-3 19-14 25t-45 9H20v23q0 23 2 23l10 1q10 1 28 2t36 2q16 1 35 2t29 3 11 1h3v-69q39 68 97 68h6q45 0 66-22t21-46q0-21-13-36t-38-15q-25 0-37 16t-13 34q0 9 2 16t5 12 3 5q-2 2-23-4-16-8-24-15-47-45-47-179V101q0-12 1-20t0-15v-5q1-2 3-4t5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9q-18 3-127 3Q37 3 28 0h-8v46h16Z'/%3E%3Cpath id='d' stroke-width='10' d='M27 422q53 4 82 56t32 122v15h40V431h135v-46H181V241q1-125 1-141t7-32q14-39 49-39 44 0 54 71 1 8 1 46v35h40v-47q0-77-42-117-27-27-70-27-34 0-59 12t-38 31-19 35-7 32q-1 7-1 148v137H18v37h9Z'/%3E%3Cpath id='e' stroke-width='10' d='M201 0q-12 3-99 3-76 0-85-3h-6v46h14q23 1 42 6t29 9 25 17 18 18 21 26 20 28l46 60-58 78q-9 13-19 27t-16 21-11 15-9 12-6 7-7 6-6 3-6 2-8 2q-6 0-36 2H16v46h7q36-2 103-2 93 0 103 2h8v-46q-36-4-36-16 0-2 10-16t28-38 29-41l4-4 25 34q32 41 32 54 0 6-2 11t-5 7-5 4-7 4l-3 1h-5v46h7q15-3 99-3 79 0 85 3h6v-46h-7q-49 0-81-17-17-8-34-27t-65-84l-16-21 62-85q66-90 71-94t17-7q18-4 53-4h17V0h-14q-8 1-20 1t-25 1-25 0-18 1h-37q-26 0-50-2l-23-1h-9v46h3q11 0 22 5t11 12q0 2-40 57l-41 55q-1-1-31-42t-34-45q-4-5-4-14 0-11 7-19t18-9q2 0 2-23V0h-7Z'/%3E%3Cpath id='h' stroke-width='10' d='M55 507q0 83 57 140t131 57h14q85 0 148-63l21 31q5 7 10 15t10 13l3 4h4q3 0 6 1h4q3 0 9-6V462l-6-6h-18q-11 0-13 3t-5 20q-17 126-101 167-37 16-75 16-53 0-86-36t-33-84q0-34 17-62t48-45q10-4 86-23t84-23q57-22 93-75t37-123q0-81-52-146T301-21q-56 0-100 17t-61 31l-18 14q-4-5-15-20T87-7t-9-14q-2-1-10-1h-4q-3 0-9 6v117q0 119 1 121 2 5 20 5h13q6-6 6-13 0-32 10-63t34-61 66-48 100-18q47 0 81 38t34 93q0 43-22 78t-58 48q-56 14-74 19-5 1-27 6t-33 8-32 11-33 18-29 24-27 35q-30 49-30 105Z'/%3E%3Cpath id='i' stroke-width='10' d='M69 609q0 28 18 44t44 16q23-2 40-17t17-43q0-30-17-45t-42-15q-25 0-42 15t-18 45ZM247 0q-15 3-104 3h-37Q80 3 56 1L34 0h-8v46h16q28 0 49 3 9 4 11 11t2 42v191q0 52-2 66t-14 19q-14 7-47 7H30v23q0 23 2 23l10 1q10 1 28 2t36 2 36 2 29 3 11 1h3V62q5-10 12-12t35-4h23V0h-8Z'/%3E%3Cpath id='q' stroke-width='10' d='M376 495v40q0 24 1 33 0 45-10 56t-51 13h-18v23q0 23 2 23l10 1q10 1 29 2t37 2 37 2 30 3 11 1h3V390q0-306 1-309 3-20 14-26t45-9h18V0q-2 0-76-5t-79-6h-7v55l-8-7q-58-48-130-48-77 0-139 61T34 215q0 100 63 163t147 64q75 0 132-49v102Zm-3-153q-45 63-113 63-49 0-87-36-27-28-34-64t-8-94q0-56 7-91t35-61q30-33 78-33 71 0 122 77v239Z'/%3E%3Cpath id='l' stroke-width='10' d='M56 340q0 83 30 154t78 116 106 70 118 25q133 0 233-104t101-260q0-81-29-150T617 75 510 4 388-22 267 3 160 74 85 189 56 340Zm411 307q-41 18-79 18-28 0-57-11t-62-34-56-71-34-110q-5-28-5-85 0-210 103-293 50-41 108-41h6q83 0 146 79 66 89 66 255 0 57-5 85-21 153-131 208Z'/%3E%3Cpath id='m' stroke-width='10' d='M273 0q-18 3-127 3Q43 3 34 0h-8v46h16q28 0 49 3 8 3 12 11 1 2 1 164v161H33v46h71v66l1 67 2 10q19 65 64 94t95 36h9q8 0 14 1 41-3 62-26t21-52q0-23-14-37t-37-14-37 14-14 37q0 20 18 40h-4q-4 1-11 1-28 0-50-21t-34-55q-6-20-7-95v-66h111v-46H185V225q0-162 1-164t3-4 5-3 5-3 7-2 7-1 9-1 9 0 10-1 10 0h31V0h-9Z'/%3E%3Cpath id='o' stroke-width='10' d='M128 619q-7 7-11 9t-16 3-43 3H25v46h557v-4q2-6 14-116t14-116v-4h-40v4q-7 49-9 57-6 37-18 62t-27 38-39 21-46 9-57 2h-88q-34 0-42-2t-11-10q-1-2-1-131V363h71q16 0 24 1t22 3 23 6 17 12q18 18 21 74v21h40V200h-40v21q-3 55-21 75-8 7-18 11t-23 6-21 3-24 1-19 0h-52V189l1-128q7-7 12-9t25-4 63-2h27V0h-12q-24 3-166 3Q51 3 36 0H25v46h33q42 1 51 3t19 12v558Z'/%3E%3Cpath id='p' stroke-width='10' d='M41 46h14q39 0 47 14v62q0 17 1 39t0 42v66q0 35-1 59v23q-3 19-14 25t-45 9H25v23q0 23 2 23l10 1q10 1 28 2t37 2q17 1 36 2t29 3 11 1h3v-40q0-38 1-38t5 5 12 15 19 18 29 19 38 16q20 5 51 5 15 0 28-2t23-6 19-8 15-9 11-11 9-11 7-11 4-10 3-8l2-5 3 4 6 8q3 4 9 11t13 13 15 13 20 12 23 10 26 7 31 3q126 0 137-113 1-7 1-139v-86q0-38 2-45t11-10q21-3 49-3h16V0h-8l-23 1q-24 1-51 1t-38 1Q596 3 587 0h-8v46h16q61 0 61 16 1 2 1 138-1 135-2 143-6 28-20 42t-24 17-26 2q-45 0-79-34-27-27-34-55t-8-83V108q0-30 1-40t3-13 9-6q21-3 49-3h16V0h-8l-24 1q-23 1-50 1t-38 1Q319 3 310 0h-8v46h16q61 0 61 16 1 2 1 138-1 135-2 143-6 28-20 42t-24 17-26 2q-45 0-79-34-27-27-34-55t-8-83V108q0-30 1-40t3-13 9-6q21-3 49-3h16V0h-8l-23 1q-24 1-51 1t-38 1Q42 3 33 0h-8v46h16Z'/%3E%3Cpath id='f' stroke-width='10' d='M674 636q8 0 14-6t6-15-7-14q-1-1-270-129L151 346l248-118Q687 92 691 87q3-6 3-11 0-18-18-20h-6L382 192Q92 329 90 331q-7 5-7 17 1 11 13 17 8 4 286 135t283 134q4 2 9 2ZM84-118q0 10 15 20h579q16-6 16-20 0-12-15-20H98q-14 7-14 20Z'/%3E%3Cpath id='j' stroke-width='10' d='M42 263q2 7 6 82t5 78v8h340q6-6 6-16 0-12-1-13l-17-24q-17-23-50-69t-66-89L134 41l48-1h24q48 0 77 6t48 31q21 28 28 108l2 16q0 1 20 1h20v-6q0-1-8-93t-9-97V0H209L34 1l-3 2q-3 5-3 14 0 13 1 14t131 179 134 184h-58q-67-1-84-6-25-6-39-21-24-23-31-103v-9H42v8Z'/%3E%3Cpath id='k' stroke-width='10' d='M84 237v13l14 20h581q15-8 15-20t-15-20H98q-14 7-14 20Z'/%3E%3Cpath id='g' stroke-width='10' d='M246-949v2399h62V-887h263v-62H246Z'/%3E%3Cpath id='r' stroke-width='10' d='M274-887v2337h62V-949H11v62h263Z'/%3E%3C/defs%3E%3C/svg%3E"/&gt;&lt;/p&gt;
&lt;p&gt;The driver calculates the right-hand side and passes it into the
shader.&lt;/p&gt;
&lt;p&gt;One last problem: what if a buffer is too small to load
&lt;em&gt;anything&lt;/em&gt;? Clamping won’t save us – the code would clamp to a
negative index. In that case, the attribute is entirely invalid, so we
swap the application’s buffer for a small buffer of zeroes. Since we
gave each attribute its own base address, this determination is
per-attribute. Then clamping the index to zero correctly loads
zeroes.&lt;/p&gt;
&lt;p&gt;Putting it together, a little driver math gives us robust buffers at
the cost of one &lt;code&gt;umin&lt;/code&gt; instruction.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;In addition to buffer robustness, we need image robustness. Like its
buffer counterpart, image robustness requires that out-of-bounds image
loads return zero. That formalizes a guarantee that reasonable hardware
already makes.&lt;/p&gt;
&lt;p&gt;…But it would be no fun if our hardware was reasonable.&lt;/p&gt;
&lt;p&gt;Running the conformance tests for image robustness, there is a single
test failure affecting “mipmapping”.&lt;/p&gt;
&lt;p&gt;For background, mipmapped images contain multiple “levels of detail”.
The base level is the original image; each successive level is the
previous level downscaled. When rendering, the hardware selects the
level closest to matching the on-screen size, improving efficiency and
visual quality.&lt;/p&gt;
&lt;p&gt;With robustness, the specifications all agree that image loads
return…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zero if the X- or Y-coordinate is out-of-bounds&lt;/li&gt;
&lt;li&gt;Zero if the level is out-of-bounds&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Meanwhile, image loads on the M1 GPU return…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zero if the X- or Y-coordinate is out-of-bounds&lt;/li&gt;
&lt;li&gt;Values from the last level if the level is out-of-bounds&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Uh-oh. Rather than returning zero for out-of-bounds levels, the
hardware clamps the level and returns nonzero values. It’s a mystery
why. The vendor does not document their hardware publicly, forcing us to
rely on reverse engineering to build drivers. Without documentation, we
don’t know if this behaviour is intentional or a hardware bug. Either
way, we need a workaround to pass conformance.&lt;/p&gt;
&lt;p&gt;The obvious workaround is to never load from an invalid level:&lt;/p&gt;
&lt;div class="sourceCode" id="cb6"&gt;&lt;pre
class="sourceCode glsl"&gt;&lt;code class="sourceCode glsl"&gt;&lt;span id="cb6-1"&gt;&lt;a href="#cb6-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="kw"&gt;if&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;level &lt;span class="op"&gt;&amp;lt;=&lt;/span&gt; levels&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-2"&gt;&lt;a href="#cb6-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="kw"&gt;return&lt;/span&gt; &lt;span class="bu"&gt;imageLoad&lt;/span&gt;&lt;span class="op"&gt;(&lt;/span&gt;x&lt;span class="op"&gt;,&lt;/span&gt; y&lt;span class="op"&gt;,&lt;/span&gt; level&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-3"&gt;&lt;a href="#cb6-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt; &lt;span class="kw"&gt;else&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-4"&gt;&lt;a href="#cb6-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="kw"&gt;return&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-5"&gt;&lt;a href="#cb6-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That involves branching, which is inefficient. Loading an
out-of-bounds level doesn’t crash, so we can speculatively load and then
use a compare-and-select operation instead of branching:&lt;/p&gt;
&lt;div class="sourceCode" id="cb7"&gt;&lt;pre
class="sourceCode glsl"&gt;&lt;code class="sourceCode glsl"&gt;&lt;span id="cb7-1"&gt;&lt;a href="#cb7-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="dt"&gt;vec4&lt;/span&gt; data &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="bu"&gt;imageLoad&lt;/span&gt;&lt;span class="op"&gt;(&lt;/span&gt;x&lt;span class="op"&gt;,&lt;/span&gt; y&lt;span class="op"&gt;,&lt;/span&gt; level&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb7-2"&gt;&lt;a href="#cb7-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb7-3"&gt;&lt;a href="#cb7-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="kw"&gt;return&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;level &lt;span class="op"&gt;&amp;lt;=&lt;/span&gt; levels&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;?&lt;/span&gt; data &lt;span class="op"&gt;:&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This workaround is okay, but it could be improved. While the M1 GPU
has combined compare-and-select instructions, the instruction set is
&lt;em&gt;scalar&lt;/em&gt;. Each thread processes one value at a time, not a vector
of multiple values. However, image loads return a vector of four
components (red, green, blue, alpha). While the pseudo-code looks
efficient, the resulting assembly is not:&lt;/p&gt;
&lt;div class="sourceCode" id="cb8"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb8-1"&gt;&lt;a href="#cb8-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;image_load R&lt;span class="op"&gt;,&lt;/span&gt; x&lt;span class="op"&gt;,&lt;/span&gt; y&lt;span class="op"&gt;,&lt;/span&gt; level&lt;/span&gt;
&lt;span id="cb8-2"&gt;&lt;a href="#cb8-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; level&lt;span class="op"&gt;,&lt;/span&gt; levels&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb8-3"&gt;&lt;a href="#cb8-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; level&lt;span class="op"&gt;,&lt;/span&gt; levels&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb8-4"&gt;&lt;a href="#cb8-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;2&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; level&lt;span class="op"&gt;,&lt;/span&gt; levels&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;2&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb8-5"&gt;&lt;a href="#cb8-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;3&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; level&lt;span class="op"&gt;,&lt;/span&gt; levels&lt;span class="op"&gt;,&lt;/span&gt; R&lt;span class="op"&gt;[&lt;/span&gt;&lt;span class="dv"&gt;3&lt;/span&gt;&lt;span class="op"&gt;],&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Fortunately, the vendor driver has a trick. We know the hardware
returns zero if either X or Y is out-of-bounds, so we can &lt;em&gt;force&lt;/em&gt;
a zero output by &lt;em&gt;setting&lt;/em&gt; X or Y out-of-bounds. As the maximum
image size is 16384 pixels wide, any X greater than 16384 is
out-of-bounds. That justifies an alternate workaround:&lt;/p&gt;
&lt;div class="sourceCode" id="cb9"&gt;&lt;pre
class="sourceCode glsl"&gt;&lt;code class="sourceCode glsl"&gt;&lt;span id="cb9-1"&gt;&lt;a href="#cb9-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="dt"&gt;bool&lt;/span&gt; valid &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;level &lt;span class="op"&gt;&amp;lt;=&lt;/span&gt; levels&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb9-2"&gt;&lt;a href="#cb9-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="dt"&gt;int&lt;/span&gt; x_ &lt;span class="op"&gt;=&lt;/span&gt; valid &lt;span class="op"&gt;?&lt;/span&gt; x &lt;span class="op"&gt;:&lt;/span&gt; &lt;span class="dv"&gt;20000&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb9-3"&gt;&lt;a href="#cb9-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb9-4"&gt;&lt;a href="#cb9-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="kw"&gt;return&lt;/span&gt; &lt;span class="bu"&gt;imageLoad&lt;/span&gt;&lt;span class="op"&gt;(&lt;/span&gt;x_&lt;span class="op"&gt;,&lt;/span&gt; y&lt;span class="op"&gt;,&lt;/span&gt; level&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Why is this better? We only change a single scalar, not a whole
vector, compiling to compact scalar assembly:&lt;/p&gt;
&lt;div class="sourceCode" id="cb10"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb10-1"&gt;&lt;a href="#cb10-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;ulesel x_&lt;span class="op"&gt;,&lt;/span&gt; level&lt;span class="op"&gt;,&lt;/span&gt; levels&lt;span class="op"&gt;,&lt;/span&gt; x&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="dv"&gt;20000&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb10-2"&gt;&lt;a href="#cb10-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;image_load R&lt;span class="op"&gt;,&lt;/span&gt; x_&lt;span class="op"&gt;,&lt;/span&gt; y&lt;span class="op"&gt;,&lt;/span&gt; level&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If we preload the constant to a uniform register, the workaround is a
single instruction. That’s optimal – and it passes conformance.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Blender &lt;a
href="https://download.blender.org/demo/eevee/wanderer/wanderer.blend"&gt;“Wanderer”&lt;/a&gt;
demo by &lt;a href="https://www.artstation.com/dbystedt"&gt;Daniel
Bystedt&lt;/a&gt;, licensed CC BY-SA.&lt;/em&gt;&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/conformant-gl46-on-the-m1.html</guid><pubDate>Wed, 14 Feb 2024 00:00:00 -0500</pubDate></item><item><title>The first conformant M1 GPU driver</title><link>https://alyssarosenzweig.ca/blog/first-conformant-m1-gpu-driver.html</link><description>&lt;p&gt;Conformant OpenGL® ES 3.1 drivers are now available for M1- and
M2-family GPUs. That means the drivers are compatible with any OpenGL ES
3.1 application. Interested? &lt;a
href="https://fedora-asahi-remix.org/"&gt;Just install Linux!&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;For existing &lt;a href="https://asahilinux.org/"&gt;Asahi Linux&lt;/a&gt; users,
upgrade your system with &lt;code style="white-space:nowrap;"&gt;dnf
upgrade&lt;/code&gt; (Fedora) or &lt;code style="white-space:nowrap;"&gt;pacman
-Syu&lt;/code&gt; (Arch) for the latest drivers.&lt;/p&gt;
&lt;p&gt;Our reverse-engineered, free and &lt;a
href="https://gitlab.freedesktop.org/asahi/mesa"&gt;open source graphics
drivers&lt;/a&gt; are the world’s &lt;strong&gt;&lt;em&gt;only&lt;/em&gt;&lt;/strong&gt; conformant
OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware.
That means our driver passed tens of thousands of tests to demonstrate
correctness and is now recognized by the industry.&lt;/p&gt;
&lt;p&gt;To become conformant, an “implementation” must pass the official
conformance test suite, designed to verify every feature in the
specification. The test results are submitted to Khronos, the standards
body. After a &lt;a
href="https://www.khronos.org/conformance/adopters/"&gt;30-day review
period&lt;/a&gt;, if no issues are found, the implementation becomes
conformant. The Khronos website lists all conformant implementations,
including our drivers for the &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1007"&gt;M1&lt;/a&gt;,
&lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1014"&gt;M1
Pro/Max/Ultra&lt;/a&gt;, &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1016"&gt;M2&lt;/a&gt;,
and &lt;a
href="https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1017"&gt;M2
Pro/Max&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Today’s milestone isn’t just about OpenGL ES. We’re releasing the
first conformant implementation of &lt;em&gt;any&lt;/em&gt; graphics standard for
the M1. And we don’t plan to stop here ;-)&lt;/p&gt;
&lt;p&gt;&lt;a href="/vkinstancing.webp"&gt;&lt;img src="/blog/vkinstancing2.webp"
style="width: 85%;margin: 0 auto;display: block"
alt="Teaser of the “Vulkan instancing” demo running on Asahi Linux" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Unlike ours, the manufacturer’s M1 drivers are unfortunately not
conformant for &lt;em&gt;any&lt;/em&gt; standard graphics API, whether Vulkan or
OpenGL or OpenGL ES. That means that there is no guarantee that
applications using the standards will work on your M1/M2 (if you’re not
running Linux). This isn’t just a theoretical issue. Consider Vulkan.
The third-party &lt;a
href="https://github.com/KhronosGroup/MoltenVK"&gt;MoltenVK&lt;/a&gt; layers a
subset of Vulkan on top of the proprietary drivers. However, those
drivers lack key functionality, breaking valid Vulkan applications. That
hinders developers and users alike, if they haven’t yet switched their
M1/M2 computers to Linux.&lt;/p&gt;
&lt;p&gt;Why did &lt;em&gt;we&lt;/em&gt; pursue standards conformance when the
manufacturer did not? Above all, our commitment to quality. We want our
users to know that they can depend on our Linux drivers. We want
standard software to run without M1-specific hacks or porting. We want
to set the right example for the ecosystem: the way forward is
implementing open standards, conformant to the specifications, without
compromises for “portability”. We are not satisfied with proprietary
drivers, proprietary APIs, and refusal to implement standards. The rest
of the industry knows that progress comes from cross-vendor
collaboration. We know it, too. Achieving conformance is a win for our
community, for open source, and for open graphics.&lt;/p&gt;
&lt;p&gt;Of course, &lt;a href="https://vt.social/@lina/"&gt;Asahi Lina&lt;/a&gt; and I
are two individuals with minimal funding. It’s a little awkward that we
beat the big corporation…&lt;/p&gt;
&lt;p&gt;It’s not too late though. They should follow our lead!&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;OpenGL ES 3.1 updates the experimental &lt;a
href="/blog/opengl3-on-asahi-linux.html"&gt;OpenGL ES 3.0 and OpenGL
3.1&lt;/a&gt; we shipped in June. Notably, ES 3.1 adds compute shaders,
typically used to accelerate general computations within graphics
applications. For example, a 3D game could run its physics simulations
in a compute shader. The simulation results can then be used for
rendering, eliminating stalls that would otherwise be required to
synchronize the GPU with a CPU physics simulation. That lets the game
run faster.&lt;/p&gt;
&lt;p&gt;Let’s zoom in on one new feature: atomics on images. Older versions
of OpenGL ES allowed an application to read an image in order to display
it on screen. ES 3.1 allows the application to &lt;em&gt;write&lt;/em&gt; to the
image, typically from a compute shader. This new feature enables
flexible image processing algorithms, which previously needed to fit
into the fixed-function 3D pipeline. However, GPUs are massively
parallel, running thousands of threads at the same time. If two threads
write to the same location, there is a conflict: depending which thread
runs first, the result will be different. We have a race condition.&lt;/p&gt;
&lt;p&gt;“Atomic” access to memory provides a solution to race conditions.
With atomics, special hardware in the memory subsystem guarantees
consistent, well-defined results for select operations, regardless of
the order of the threads. Modern graphics hardware supports various
atomic operations, like addition, serving as building blocks to complex
parallel algorithms.&lt;/p&gt;
&lt;p&gt;Can we put these two features together to write to an image
atomically?&lt;/p&gt;
&lt;p&gt;Yes. A ubiquitous OpenGL ES &lt;a
href="https://registry.khronos.org/OpenGL/extensions/OES/OES_shader_image_atomic.txt"&gt;extension&lt;/a&gt;,
required for ES 3.2, adds atomics operating on pixels in an image. For
example, a compute shader could atomically increment the value at pixel
(10, 20).&lt;/p&gt;
&lt;p&gt;Other GPUs have dedicated instructions to perform atomics on an
images, making the driver implementation straightforward. For us, the
story is more complicated. The M1 lacks hardware instructions for image
atomics, even though it has non-image atomics and non-atomic images. We
need to reframe the problem.&lt;/p&gt;
&lt;p&gt;The idea is simple: to perform an atomic on a pixel, we instead
calculate the address of the pixel in memory and perform a regular
atomic on that address. Since the hardware supports regular atomics, our
task is “just” calculating the pixel’s address.&lt;/p&gt;
&lt;p&gt;If the image were laid out linearly in memory, this would be
straightforward: multiply the Y-coordinate by the number of bytes per
row (“stride”), multiply the X-coordinate by the number of bytes per
pixel, and add. That gives the pixel’s offset in bytes relative to the
first pixel of the image. To get the final address, we add that offset
to the address of the first pixel.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Address of (X, Y) equals Address of (0, 0) + Y times Stride + X times Bytes Per Pixel" src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' viewBox='0 -776.7226410624452 27924.66666666667 1053.4452821248904' style='width: 64.861ex; height: 2.5ex; vertical-align: -0.694ex; margin: 1px 0px;'%3E%3Cg stroke='black' fill='black' stroke-width='0' transform='matrix(1 0 0 -1 0 0)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-41'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-64' x='755' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-64' x='1316' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-72' x='1877' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-65' x='2274' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-73' x='2723' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-73' x='3122' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-28' x='3521' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMATHI-58' x='3915' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-2C' x='4772' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMATHI-59' x='5221' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-29' x='5989' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-3D' x='6661' y='0'/%3E%3Cg transform='translate(7722,0)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-41'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-64' x='755' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-64' x='1316' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-72' x='1877' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-65' x='2274' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-73' x='2723' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-73' x='3122' y='0'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-28' x='11243' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-30' x='11637' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-2C' x='12142' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-30' x='12591' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-29' x='13096' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-2B' x='13713' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMATHI-59' x='14718' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-22C5' x='15708' y='0'/%3E%3Cg transform='translate(16213,0)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-53'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-74' x='561' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-72' x='955' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-69' x='1352' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-64' x='1635' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-65' x='2196' y='0'/%3E%3C/g%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-2B' x='19081' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMATHI-58' x='20086' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-22C5' x='21165' y='0'/%3E%3Cg transform='translate(21670,0)'%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-42'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-79' x='713' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-74' x='1246' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-65' x='1640' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-73' x='2089' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-50' x='2488' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-65' x='3174' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-72' x='3623' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-50' x='4020' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-69' x='4706' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-78' x='4989' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-65' x='5522' y='0'/%3E%3Cuse xmlns:xlink='http://www.w3.org/1999/xlink' xlink:href='%23MJMAIN-6C' x='5971' y='0'/%3E%3C/g%3E%3C/g%3E%3Cdefs id='MathJax_SVG_glyphs'%3E%3Cpath id='MJSZ2-2211' stroke-width='10' d='M60 948Q63 950 665 950H1267L1325 815Q1384 677 1388 669H1348L1341 683Q1320 724 1285 761Q1235 809 1174 838T1033 881T882 898T699 902H574H543H251L259 891Q722 258 724 252Q725 250 724 246Q721 243 460 -56L196 -356Q196 -357 407 -357Q459 -357 548 -357T676 -358Q812 -358 896 -353T1063 -332T1204 -283T1307 -196Q1328 -170 1348 -124H1388Q1388 -125 1381 -145T1356 -210T1325 -294L1267 -449L666 -450Q64 -450 61 -448Q55 -446 55 -439Q55 -437 57 -433L590 177Q590 178 557 222T452 366T322 544L56 909L55 924Q55 945 60 948Z'/%3E%3Cpath id='MJMATHI-69' stroke-width='10' d='M184 600Q184 624 203 642T247 661Q265 661 277 649T290 619Q290 596 270 577T226 557Q211 557 198 567T184 600ZM21 287Q21 295 30 318T54 369T98 420T158 442Q197 442 223 419T250 357Q250 340 236 301T196 196T154 83Q149 61 149 51Q149 26 166 26Q175 26 185 29T208 43T235 78T260 137Q263 149 265 151T282 153Q302 153 302 143Q302 135 293 112T268 61T223 11T161 -11Q129 -11 102 10T74 74Q74 91 79 106T122 220Q160 321 166 341T173 380Q173 404 156 404H154Q124 404 99 371T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Z'/%3E%3Cpath id='MJMAIN-3D' stroke-width='10' d='M56 347Q56 360 70 367H707Q722 359 722 347Q722 336 708 328L390 327H72Q56 332 56 347ZM56 153Q56 168 72 173H708Q722 163 722 153Q722 140 707 133H70Q56 140 56 153Z'/%3E%3Cpath id='MJMAIN-30' stroke-width='10' d='M96 585Q152 666 249 666Q297 666 345 640T423 548Q460 465 460 320Q460 165 417 83Q397 41 362 16T301 -15T250 -22Q224 -22 198 -16T137 16T82 83Q39 165 39 320Q39 494 96 585ZM321 597Q291 629 250 629Q208 629 178 597Q153 571 145 525T137 333Q137 175 145 125T181 46Q209 16 250 16Q290 16 318 46Q347 76 354 130T362 333Q362 478 354 524T321 597Z'/%3E%3Cpath id='MJMATHI-6E' stroke-width='10' d='M21 287Q22 293 24 303T36 341T56 388T89 425T135 442Q171 442 195 424T225 390T231 369Q231 367 232 367L243 378Q304 442 382 442Q436 442 469 415T503 336T465 179T427 52Q427 26 444 26Q450 26 453 27Q482 32 505 65T540 145Q542 153 560 153Q580 153 580 145Q580 144 576 130Q568 101 554 73T508 17T439 -10Q392 -10 371 17T350 73Q350 92 386 193T423 345Q423 404 379 404H374Q288 404 229 303L222 291L189 157Q156 26 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 112 180T152 343Q153 348 153 366Q153 405 129 405Q91 405 66 305Q60 285 60 284Q58 278 41 278H27Q21 284 21 287Z'/%3E%3Cpath id='MJMAIN-28' stroke-width='10' d='M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z'/%3E%3Cpath id='MJMAIN-2B' stroke-width='10' d='M56 237T56 250T70 270H369V420L370 570Q380 583 389 583Q402 583 409 568V270H707Q722 262 722 250T707 230H409V-68Q401 -82 391 -82H389H387Q375 -82 369 -68V230H70Q56 237 56 250Z'/%3E%3Cpath id='MJMAIN-31' stroke-width='10' d='M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z'/%3E%3Cpath id='MJMAIN-29' stroke-width='10' d='M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z'/%3E%3Cpath id='MJMAIN-32' stroke-width='10' d='M109 429Q82 429 66 447T50 491Q50 562 103 614T235 666Q326 666 387 610T449 465Q449 422 429 383T381 315T301 241Q265 210 201 149L142 93L218 92Q375 92 385 97Q392 99 409 186V189H449V186Q448 183 436 95T421 3V0H50V19V31Q50 38 56 46T86 81Q115 113 136 137Q145 147 170 174T204 211T233 244T261 278T284 308T305 340T320 369T333 401T340 431T343 464Q343 527 309 573T212 619Q179 619 154 602T119 569T109 550Q109 549 114 549Q132 549 151 535T170 489Q170 464 154 447T109 429Z'/%3E%3Cpath id='MJMAIN-41' stroke-width='10' d='M255 0Q240 3 140 3Q48 3 39 0H32V46H47Q119 49 139 88Q140 91 192 245T295 553T348 708Q351 716 366 716H376Q396 715 400 709Q402 707 508 390L617 67Q624 54 636 51T687 46H717V0H708Q699 3 581 3Q458 3 437 0H427V46H440Q510 46 510 64Q510 66 486 138L462 209H229L209 150Q189 91 189 85Q189 72 209 59T259 46H264V0H255ZM447 255L345 557L244 256Q244 255 345 255H447Z'/%3E%3Cpath id='MJMAIN-64' stroke-width='10' d='M376 495Q376 511 376 535T377 568Q377 613 367 624T316 637H298V660Q298 683 300 683L310 684Q320 685 339 686T376 688Q393 689 413 690T443 693T454 694H457V390Q457 84 458 81Q461 61 472 55T517 46H535V0Q533 0 459 -5T380 -11H373V44L365 37Q307 -11 235 -11Q158 -11 96 50T34 215Q34 315 97 378T244 442Q319 442 376 393V495ZM373 342Q328 405 260 405Q211 405 173 369Q146 341 139 305T131 211Q131 155 138 120T173 59Q203 26 251 26Q322 26 373 103V342Z'/%3E%3Cpath id='MJMAIN-72' stroke-width='10' d='M36 46H50Q89 46 97 60V68Q97 77 97 91T98 122T98 161T98 203Q98 234 98 269T98 328L97 351Q94 370 83 376T38 385H20V408Q20 431 22 431L32 432Q42 433 60 434T96 436Q112 437 131 438T160 441T171 442H174V373Q213 441 271 441H277Q322 441 343 419T364 373Q364 352 351 337T313 322Q288 322 276 338T263 372Q263 381 265 388T270 400T273 405Q271 407 250 401Q234 393 226 386Q179 341 179 207V154Q179 141 179 127T179 101T180 81T180 66V61Q181 59 183 57T188 54T193 51T200 49T207 48T216 47T225 47T235 46T245 46H276V0H267Q249 3 140 3Q37 3 28 0H20V46H36Z'/%3E%3Cpath id='MJMAIN-65' stroke-width='10' d='M28 218Q28 273 48 318T98 391T163 433T229 448Q282 448 320 430T378 380T406 316T415 245Q415 238 408 231H126V216Q126 68 226 36Q246 30 270 30Q312 30 342 62Q359 79 369 104L379 128Q382 131 395 131H398Q415 131 415 121Q415 117 412 108Q393 53 349 21T250 -11Q155 -11 92 58T28 218ZM333 275Q322 403 238 411H236Q228 411 220 410T195 402T166 381T143 340T127 274V267H333V275Z'/%3E%3Cpath id='MJMAIN-73' stroke-width='10' d='M295 316Q295 356 268 385T190 414Q154 414 128 401Q98 382 98 349Q97 344 98 336T114 312T157 287Q175 282 201 278T245 269T277 256Q294 248 310 236T342 195T359 133Q359 71 321 31T198 -10H190Q138 -10 94 26L86 19L77 10Q71 4 65 -1L54 -11H46H42Q39 -11 33 -5V74V132Q33 153 35 157T45 162H54Q66 162 70 158T75 146T82 119T101 77Q136 26 198 26Q295 26 295 104Q295 133 277 151Q257 175 194 187T111 210Q75 227 54 256T33 318Q33 357 50 384T93 424T143 442T187 447H198Q238 447 268 432L283 424L292 431Q302 440 314 448H322H326Q329 448 335 442V310L329 304H301Q295 310 295 316Z'/%3E%3Cpath id='MJMATHI-58' stroke-width='10' d='M42 0H40Q26 0 26 11Q26 15 29 27Q33 41 36 43T55 46Q141 49 190 98Q200 108 306 224T411 342Q302 620 297 625Q288 636 234 637H206Q200 643 200 645T202 664Q206 677 212 683H226Q260 681 347 681Q380 681 408 681T453 682T473 682Q490 682 490 671Q490 670 488 658Q484 643 481 640T465 637Q434 634 411 620L488 426L541 485Q646 598 646 610Q646 628 622 635Q617 635 609 637Q594 637 594 648Q594 650 596 664Q600 677 606 683H618Q619 683 643 683T697 681T738 680Q828 680 837 683H845Q852 676 852 672Q850 647 840 637H824Q790 636 763 628T722 611T698 593L687 584Q687 585 592 480L505 384Q505 383 536 304T601 142T638 56Q648 47 699 46Q734 46 734 37Q734 35 732 23Q728 7 725 4T711 1Q708 1 678 1T589 2Q528 2 496 2T461 1Q444 1 444 10Q444 11 446 25Q448 35 450 39T455 44T464 46T480 47T506 54Q523 62 523 64Q522 64 476 181L429 299Q241 95 236 84Q232 76 232 72Q232 53 261 47Q262 47 267 47T273 46Q276 46 277 46T280 45T283 42T284 35Q284 26 282 19Q279 6 276 4T261 1Q258 1 243 1T201 2T142 2Q64 2 42 0Z'/%3E%3Cpath id='MJMAIN-2C' stroke-width='10' d='M78 35T78 60T94 103T137 121Q165 121 187 96T210 8Q210 -27 201 -60T180 -117T154 -158T130 -185T117 -194Q113 -194 104 -185T95 -172Q95 -168 106 -156T131 -126T157 -76T173 -3V9L172 8Q170 7 167 6T161 3T152 1T140 0Q113 0 96 17Z'/%3E%3Cpath id='MJMATHI-59' stroke-width='10' d='M66 637Q54 637 49 637T39 638T32 641T30 647T33 664T42 682Q44 683 56 683Q104 680 165 680Q288 680 306 683H316Q322 677 322 674T320 656Q316 643 310 637H298Q242 637 242 624Q242 619 292 477T343 333L346 336Q350 340 358 349T379 373T411 410T454 461Q546 568 561 587T577 618Q577 634 545 637Q528 637 528 647Q528 649 530 661Q533 676 535 679T549 683Q551 683 578 682T657 680Q684 680 713 681T746 682Q763 682 763 673Q763 669 760 657T755 643Q753 637 734 637Q662 632 617 587Q608 578 477 424L348 273L322 169Q295 62 295 57Q295 46 363 46Q379 46 384 45T390 35Q390 33 388 23Q384 6 382 4T366 1Q361 1 324 1T232 2Q170 2 138 2T102 1Q84 1 84 9Q84 14 87 24Q88 27 89 30T90 35T91 39T93 42T96 44T101 45T107 45T116 46T129 46Q168 47 180 50T198 63Q201 68 227 171L252 274L129 623Q128 624 127 625T125 627T122 629T118 631T113 633T105 634T96 635T83 636T66 637Z'/%3E%3Cpath id='MJMAIN-22C5' stroke-width='10' d='M78 250Q78 274 95 292T138 310Q162 310 180 294T199 251Q199 226 182 208T139 190T96 207T78 250Z'/%3E%3Cpath id='MJMATHI-53' stroke-width='10' d='M308 24Q367 24 416 76T466 197Q466 260 414 284Q308 311 278 321T236 341Q176 383 176 462Q176 523 208 573T273 648Q302 673 343 688T407 704H418H425Q521 704 564 640Q565 640 577 653T603 682T623 704Q624 704 627 704T632 705Q645 705 645 698T617 577T585 459T569 456Q549 456 549 465Q549 471 550 475Q550 478 551 494T553 520Q553 554 544 579T526 616T501 641Q465 662 419 662Q362 662 313 616T263 510Q263 480 278 458T319 427Q323 425 389 408T456 390Q490 379 522 342T554 242Q554 216 546 186Q541 164 528 137T492 78T426 18T332 -20Q320 -22 298 -22Q199 -22 144 33L134 44L106 13Q83 -14 78 -18T65 -22Q52 -22 52 -14Q52 -11 110 221Q112 227 130 227H143Q149 221 149 216Q149 214 148 207T144 186T142 153Q144 114 160 87T203 47T255 29T308 24Z'/%3E%3Cpath id='MJMATHI-74' stroke-width='10' d='M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z'/%3E%3Cpath id='MJMATHI-72' stroke-width='10' d='M21 287Q22 290 23 295T28 317T38 348T53 381T73 411T99 433T132 442Q161 442 183 430T214 408T225 388Q227 382 228 382T236 389Q284 441 347 441H350Q398 441 422 400Q430 381 430 363Q430 333 417 315T391 292T366 288Q346 288 334 299T322 328Q322 376 378 392Q356 405 342 405Q286 405 239 331Q229 315 224 298T190 165Q156 25 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 114 189T154 366Q154 405 128 405Q107 405 92 377T68 316T57 280Q55 278 41 278H27Q21 284 21 287Z'/%3E%3Cpath id='MJMATHI-64' stroke-width='10' d='M366 683Q367 683 438 688T511 694Q523 694 523 686Q523 679 450 384T375 83T374 68Q374 26 402 26Q411 27 422 35Q443 55 463 131Q469 151 473 152Q475 153 483 153H487H491Q506 153 506 145Q506 140 503 129Q490 79 473 48T445 8T417 -8Q409 -10 393 -10Q359 -10 336 5T306 36L300 51Q299 52 296 50Q294 48 292 46Q233 -10 172 -10Q117 -10 75 30T33 157Q33 205 53 255T101 341Q148 398 195 420T280 442Q336 442 364 400Q369 394 369 396Q370 400 396 505T424 616Q424 629 417 632T378 637H357Q351 643 351 645T353 664Q358 683 366 683ZM352 326Q329 405 277 405Q242 405 210 374T160 293Q131 214 119 129Q119 126 119 118T118 106Q118 61 136 44T179 26Q233 26 290 98L298 109L352 326Z'/%3E%3Cpath id='MJMATHI-65' stroke-width='10' d='M39 168Q39 225 58 272T107 350T174 402T244 433T307 442H310Q355 442 388 420T421 355Q421 265 310 237Q261 224 176 223Q139 223 138 221Q138 219 132 186T125 128Q125 81 146 54T209 26T302 45T394 111Q403 121 406 121Q410 121 419 112T429 98T420 82T390 55T344 24T281 -1T205 -11Q126 -11 83 42T39 168ZM373 353Q367 405 305 405Q272 405 244 391T199 357T170 316T154 280T149 261Q149 260 169 260Q282 260 327 284T373 353Z'/%3E%3Cpath id='MJMAIN-53' stroke-width='10' d='M55 507Q55 590 112 647T243 704H257Q342 704 405 641L426 672Q431 679 436 687T446 700L449 704Q450 704 453 704T459 705H463Q466 705 472 699V462L466 456H448Q437 456 435 459T430 479Q413 605 329 646Q292 662 254 662Q201 662 168 626T135 542Q135 508 152 480T200 435Q210 431 286 412T370 389Q427 367 463 314T500 191Q500 110 448 45T301 -21Q245 -21 201 -4T140 27L122 41Q118 36 107 21T87 -7T78 -21Q76 -22 68 -22H64Q61 -22 55 -16V101Q55 220 56 222Q58 227 76 227H89Q95 221 95 214Q95 182 105 151T139 90T205 42T305 24Q352 24 386 62T420 155Q420 198 398 233T340 281Q284 295 266 300Q261 301 239 306T206 314T174 325T141 343T112 367T85 402Q55 451 55 507Z'/%3E%3Cpath id='MJMAIN-74' stroke-width='10' d='M27 422Q80 426 109 478T141 600V615H181V431H316V385H181V241Q182 116 182 100T189 68Q203 29 238 29Q282 29 292 100Q293 108 293 146V181H333V146V134Q333 57 291 17Q264 -10 221 -10Q187 -10 162 2T124 33T105 68T98 100Q97 107 97 248V385H18V422H27Z'/%3E%3Cpath id='MJMAIN-69' stroke-width='10' d='M69 609Q69 637 87 653T131 669Q154 667 171 652T188 609Q188 579 171 564T129 549Q104 549 87 564T69 609ZM247 0Q232 3 143 3Q132 3 106 3T56 1L34 0H26V46H42Q70 46 91 49Q100 53 102 60T104 102V205V293Q104 345 102 359T88 378Q74 385 41 385H30V408Q30 431 32 431L42 432Q52 433 70 434T106 436Q123 437 142 438T171 441T182 442H185V62Q190 52 197 50T232 46H255V0H247Z'/%3E%3Cpath id='MJMAIN-42' stroke-width='10' d='M131 622Q124 629 120 631T104 634T61 637H28V683H229H267H346Q423 683 459 678T531 651Q574 627 599 590T624 512Q624 461 583 419T476 360L466 357Q539 348 595 302T651 187Q651 119 600 67T469 3Q456 1 242 0H28V46H61Q103 47 112 49T131 61V622ZM511 513Q511 560 485 594T416 636Q415 636 403 636T371 636T333 637Q266 637 251 636T232 628Q229 624 229 499V374H312L396 375L406 377Q410 378 417 380T442 393T474 417T499 456T511 513ZM537 188Q537 239 509 282T430 336L329 337H229V200V116Q229 57 234 52Q240 47 334 47H383Q425 47 443 53Q486 67 511 104T537 188Z'/%3E%3Cpath id='MJMAIN-79' stroke-width='10' d='M69 -66Q91 -66 104 -80T118 -116Q118 -134 109 -145T91 -160Q84 -163 97 -166Q104 -168 111 -168Q131 -168 148 -159T175 -138T197 -106T213 -75T225 -43L242 0L170 183Q150 233 125 297Q101 358 96 368T80 381Q79 382 78 382Q66 385 34 385H19V431H26L46 430Q65 430 88 429T122 428Q129 428 142 428T171 429T200 430T224 430L233 431H241V385H232Q183 385 185 366L286 112Q286 113 332 227L376 341V350Q376 365 366 373T348 383T334 385H331V431H337H344Q351 431 361 431T382 430T405 429T422 429Q477 429 503 431H508V385H497Q441 380 422 345Q420 343 378 235T289 9T227 -131Q180 -204 113 -204Q69 -204 44 -177T19 -116Q19 -89 35 -78T69 -66Z'/%3E%3Cpath id='MJMAIN-50' stroke-width='10' d='M130 622Q123 629 119 631T103 634T60 637H27V683H214Q237 683 276 683T331 684Q419 684 471 671T567 616Q624 563 624 489Q624 421 573 372T451 307Q429 302 328 301H234V181Q234 62 237 58Q245 47 304 46H337V0H326Q305 3 182 3Q47 3 38 0H27V46H60Q102 47 111 49T130 61V622ZM507 488Q507 514 506 528T500 564T483 597T450 620T397 635Q385 637 307 637H286Q237 637 234 628Q231 624 231 483V342H302H339Q390 342 423 349T481 382Q507 411 507 488Z'/%3E%3Cpath id='MJMAIN-78' stroke-width='10' d='M201 0Q189 3 102 3Q26 3 17 0H11V46H25Q48 47 67 52T96 61T121 78T139 96T160 122T180 150L226 210L168 288Q159 301 149 315T133 336T122 351T113 363T107 370T100 376T94 379T88 381T80 383Q74 383 44 385H16V431H23Q59 429 126 429Q219 429 229 431H237V385Q201 381 201 369Q201 367 211 353T239 315T268 274L272 270L297 304Q329 345 329 358Q329 364 327 369T322 376T317 380T310 384L307 385H302V431H309Q324 428 408 428Q487 428 493 431H499V385H492Q443 385 411 368Q394 360 377 341T312 257L296 236L358 151Q424 61 429 57T446 50Q464 46 499 46H516V0H510H502Q494 1 482 1T457 2T432 2T414 3Q403 3 377 3T327 1L304 0H295V46H298Q309 46 320 51T331 63Q331 65 291 120L250 175Q249 174 219 133T185 88Q181 83 181 74Q181 63 188 55T206 46Q208 46 208 23V0H201Z'/%3E%3Cpath id='MJMAIN-6C' stroke-width='10' d='M42 46H56Q95 46 103 60V68Q103 77 103 91T103 124T104 167T104 217T104 272T104 329Q104 366 104 407T104 482T104 542T103 586T103 603Q100 622 89 628T44 637H26V660Q26 683 28 683L38 684Q48 685 67 686T104 688Q121 689 141 690T171 693T182 694H185V379Q185 62 186 60Q190 52 198 49Q219 46 247 46H263V0H255L232 1Q209 2 183 2T145 3T107 3T57 1L34 0H26V46H42Z'/%3E%3C/defs%3E%3C/svg%3E"/&gt;&lt;/p&gt;
&lt;p&gt;Alas, images are rarely linear in memory. To improve cache
efficiency, modern graphics hardware interleaves the X- and
Y-coordinates. Instead of one row after the next, pixels in memory
follow a &lt;a
href="https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/"&gt;spiral-like
curve&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We need to amend our previous equation to interleave the coordinates.
We could use many instructions to mask one bit at a time, shifting to
construct the interleaved result, but that’s inefficient. We can do
better.&lt;/p&gt;
&lt;p&gt;There is a well-known &lt;a
href="https://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN"&gt;“bit
twiddling” algorithm to interleave bits&lt;/a&gt;. Rather than shuffle one bit
at a time, the algorithm shuffles groups of bits, parallelizing the
problem. Implementing this algorithm in shader code improves
performance.&lt;/p&gt;
&lt;p&gt;In practice, only the lower 7-bits (or less) of each coordinate are
interleaved. That lets us use 32-bit instructions to “vectorize” the
interleave, by putting the X- and Y-coordinates in the low and high
16-bits of a 32-bit register. Those 32-bit instructions let us
interleave X and Y at the same time, halving the instruction count.
Plus, we can exploit the GPU’s combined shift-and-add instruction.
Putting the tricks together, we interleave in 10 instructions of M1 GPU
assembly:&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre
class="sourceCode asm"&gt;&lt;code class="sourceCode fasm"&gt;&lt;span id="cb1-1"&gt;&lt;a href="#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;# Inputs x&lt;span class="op"&gt;,&lt;/span&gt; y in r0l&lt;span class="op"&gt;,&lt;/span&gt; r0h&lt;span class="op"&gt;.&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-2"&gt;&lt;a href="#cb1-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;# Output in r1&lt;span class="op"&gt;.&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-3"&gt;&lt;a href="#cb1-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb1-4"&gt;&lt;a href="#cb1-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;add&lt;/span&gt; r2&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;,&lt;/span&gt; r0&lt;span class="op"&gt;,&lt;/span&gt; lsl &lt;span class="dv"&gt;4&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-5"&gt;&lt;a href="#cb1-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;or&lt;/span&gt;  r1&lt;span class="op"&gt;,&lt;/span&gt; r0&lt;span class="op"&gt;,&lt;/span&gt; r2&lt;/span&gt;
&lt;span id="cb1-6"&gt;&lt;a href="#cb1-6" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;and&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="bn"&gt;0xf0f0f0f&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-7"&gt;&lt;a href="#cb1-7" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;add&lt;/span&gt; r2&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; lsl &lt;span class="dv"&gt;2&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-8"&gt;&lt;a href="#cb1-8" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;or&lt;/span&gt;  r1&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; r2&lt;/span&gt;
&lt;span id="cb1-9"&gt;&lt;a href="#cb1-9" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;and&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="bn"&gt;0x33333333&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-10"&gt;&lt;a href="#cb1-10" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;add&lt;/span&gt; r2&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; lsl &lt;span class="dv"&gt;1&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-11"&gt;&lt;a href="#cb1-11" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;or&lt;/span&gt;  r1&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; r2&lt;/span&gt;
&lt;span id="cb1-12"&gt;&lt;a href="#cb1-12" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;and&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; &lt;span class="op"&gt;#&lt;/span&gt;&lt;span class="bn"&gt;0x55555555&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-13"&gt;&lt;a href="#cb1-13" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="bu"&gt;add&lt;/span&gt; r1&lt;span class="op"&gt;,&lt;/span&gt; r1l&lt;span class="op"&gt;,&lt;/span&gt; r1h&lt;span class="op"&gt;,&lt;/span&gt; lsl &lt;span class="dv"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We could stop here, but what if there’s a &lt;em&gt;dedicated&lt;/em&gt;
instruction to interleave bits? PowerVR has a “shuffle” instruction &lt;a
href="https://docs.imgtec.com/reference-manuals/powervr-instruction-set-reference/topics/bitwise-instructions/SHFL.html"&gt;&lt;code&gt;shfl&lt;/code&gt;&lt;/a&gt;,
and the M1 GPU borrows from PowerVR. Perhaps that instruction was
borrowed too. Unfortunately, even if it was, the proprietary compiler
won’t use it when compiling our test shaders. That makes it difficult to
reverse-engineer the instruction – if it exists – by observing compiled
shaders.&lt;/p&gt;
&lt;p&gt;It’s time to dust off a powerful reverse-engineering technique from
magic kindergarten: guess and check.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://mastodon.social/@dougall"&gt;Dougall Johnson&lt;/a&gt;
provided the guess. When considering the instructions we already know
about, he took special notice of the “reverse bits” instruction. Since
reversing bits is a type of bit shuffle, the interleave instruction
should be encoded similarly. The bit reverse instruction has a two-bit
field specifying the operation, with value &lt;code&gt;01&lt;/code&gt;. Related
instructions to &lt;em&gt;count the number of set bits&lt;/em&gt; and &lt;em&gt;find the
first set bit&lt;/em&gt; have values &lt;code&gt;10&lt;/code&gt; and &lt;code&gt;11&lt;/code&gt;
respectively. That encompasses all known “complex bit manipulation”
instructions.&lt;/p&gt;
&lt;style id="center-rule"&gt;tr:first-child &gt; td:nth-child(2) { text-align:center !important } td &gt; strong &gt; a:visited { color: #0000EE }&lt;/style&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr class="odd"&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;00&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href=""
onclick="const p = document.createElement(&amp;#39;span&amp;#39;);p.innerText = &amp;#39;Interleave!&amp;#39;;this.replaceWith(p);document.getElementById(&amp;#39;center-rule&amp;#39;).remove();return false;"&gt;?
? ?&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="even"&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;01&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reverse bits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Count set bits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="even"&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;11&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find first set&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is one value of the two-bit enumeration that is unobserved and
unknown: &lt;code&gt;00&lt;/code&gt;. If this interleave instruction exists, it’s
probably encoded like the bit reverse but with operation code
&lt;code&gt;00&lt;/code&gt; instead of &lt;code&gt;01&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There’s a difficulty: the three known instructions have one single
input source, but our instruction interleaves two sources. Where does
the second source go? We can make a guess based on symmetry. Presumably
to simplify the hardware decoder, M1 GPU instructions usually encode
their sources in consistent locations across instructions. The other
three instructions have a gap where we would expect the second source to
be, in a two-source arithmetic instruction. Probably the second source
is there.&lt;/p&gt;
&lt;p&gt;Armed with a guess, it’s our turn to check. Rather than handwrite GPU
assembly, we can hack our compiler to replace some two-source integer
operation (like multiply) with our guessed encoding of “interleave”.
Then we write a compute shader using this operation (by “multiplying”
numbers) and run it with the newfangled compute support in our
driver.&lt;/p&gt;
&lt;p&gt;All that’s left is writing a &lt;a
href="/blog/interleave.shader_test"&gt;shader&lt;/a&gt; that checks that the
mystery instruction returns the interleaved result for each possible
input. Since the instruction takes two 16-bit sources, there are about 4
billion (&lt;span class="math inline"&gt;\(2^32\)&lt;/span&gt;) inputs. With our
driver, the M1 GPU manages to check them all in under a second, and the
verdict is in: this is our interleave instruction.&lt;/p&gt;
&lt;p&gt;As for our clever vectorized assembly to interleave coordinates? We
can replace it with one instruction. It’s anticlimactic, but it’s fast
and it passes the conformance tests.&lt;/p&gt;
&lt;p&gt;And that’s what matters.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Thank you to &lt;a href="https://www.khronos.org/"&gt;Khronos&lt;/a&gt; and
&lt;a href="https://www.spi-inc.org/"&gt;Software in the Public Interest&lt;/a&gt;
for supporting open drivers.&lt;/em&gt;&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/first-conformant-m1-gpu-driver.html</guid><pubDate>Tue, 22 Aug 2023 00:00:00 -0500</pubDate></item><item><title>OpenGL 3.1 on Asahi Linux</title><link>https://alyssarosenzweig.ca/blog/opengl3-on-asahi-linux.html</link><description>&lt;p&gt;Upgrade your &lt;a href="https://asahilinux.org/"&gt;Asahi Linux&lt;/a&gt;
systems, because your graphics drivers are getting a big boost:
leapfrogging from OpenGL 2.1 over OpenGL 3.0 up to OpenGL 3.1!
Similarly, the OpenGL ES 2.0 support is bumping up to OpenGL ES 3.0.
That means more playable games and more functioning applications.&lt;/p&gt;
&lt;p&gt;Back in December, I teased an early screenshot of SuperTuxKart’s
deferred renderer working on Asahi, using OpenGL ES 3.0 features like
multiple render targets and instancing. Now you too can enjoy
SuperTuxKart with advanced lighting the way it’s meant to be:&lt;/p&gt;
&lt;figure&gt;
&lt;img src="/blog/STK-1080p.webp"
alt="SuperTuxKart rendering with advanced lighting" /&gt;
&lt;figcaption aria-hidden="true"&gt;SuperTuxKart rendering with advanced
lighting&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;As before, these drivers are experimental and not yet conformant to
the OpenGL or OpenGL ES specifications. For now, you’ll need to run our
&lt;code&gt;-edge&lt;/code&gt; packages to opt-in to the work-in-progress drivers,
understanding that there may be bugs. Please refer to &lt;a
href="https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/"&gt;our
previous post&lt;/a&gt; explaining how to install the drivers and how to
report bugs to help us improve.&lt;/p&gt;
&lt;p&gt;With that disclaimer out of the way, there’s a LOT of new
functionality packed into OpenGL 3.0, 3.1, and OpenGL ES 3.0 to make
this release. Highlights include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple render targets&lt;/li&gt;
&lt;li&gt;Multisampling&lt;/li&gt;
&lt;li&gt;&lt;a
href="https://cgit.freedesktop.org/mesa/mesa/commit/?id=d72e1418ce4f66c42f20779f50f40091d3d310b0"&gt;Transform
feedback&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="/blog/texture-buffer-objects-on-asahi.html"&gt;Texture buffer
objects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;..and more.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For now, let’s talk about…&lt;/p&gt;
&lt;h2 id="multisampling"&gt;Multisampling&lt;/h2&gt;
&lt;p&gt;Vulkan and OpenGL support &lt;em&gt;multisampling&lt;/em&gt;, short for
&lt;em&gt;multisampled anti-aliasing&lt;/em&gt;. In graphics, &lt;em&gt;aliasing&lt;/em&gt;
causes jagged diagonal edges due to rendering at insufficient
resolution. One solution to aliasing is rendering at higher resolutions
and scaling down. Edges will be blurred, not jagged, which looks better.
Multisampling is an efficient implementation of that idea.&lt;/p&gt;
&lt;p&gt;A &lt;em&gt;multisampled&lt;/em&gt; image contains multiple &lt;em&gt;samples&lt;/em&gt; for
every pixel. After rendering, a multisampled image is &lt;em&gt;resolved&lt;/em&gt;
to a regular image with one sample per pixel, typically by averaging the
samples within a pixel.&lt;/p&gt;
&lt;p&gt;Apple GPUs support multisampled images and framebuffers. There’s
quite a bit of typing to plumb the programmer’s view of multisampling
into the form understood by the hardware, but there’s no fundamental
incompatibility.&lt;/p&gt;
&lt;p&gt;The trouble comes with &lt;em&gt;sample shading&lt;/em&gt;. Recall that in modern
graphics, the colour of each &lt;em&gt;fragment&lt;/em&gt; is determined by running
a &lt;em&gt;fragment shader&lt;/em&gt; given by the programmer. If the fragments are
pixels, then each sample within that pixel gets the same colour. Running
the fragment shader once per pixel still benefits from multisampling
thanks to higher quality rasterization, but it’s not as good as
&lt;em&gt;actually&lt;/em&gt; rendering at a higher resolution. If instead the
fragments are samples, each sample gets a unique colour, equivalent to
rendering at a higher resolution (supersampling). In Vulkan and OpenGL,
fragment shaders generally run per-pixel, but with “sample shading”, the
application can force the fragment shader to run per-sample.&lt;/p&gt;
&lt;p&gt;How does sample shading work from the drivers’ perspective? On a
typical GPU, it is simple: the driver compiles a fragment shader that
calculates the colour of a single sample, and sets a hardware bit to
execute it per-sample instead of per-pixel. There is only one bit of
state associated with sample shading. The hardware will execute the
fragment shader multiple times per pixel, writing out pixel colours
independently.&lt;/p&gt;
&lt;p&gt;Easy, right?&lt;/p&gt;
&lt;p&gt;Alas, Apple’s “AGX” GPU is not typical.&lt;/p&gt;
&lt;p&gt;AGX always executes the shader once per pixel, not once per sample,
like older GPUs that did not support sample shading. AGX &lt;em&gt;does&lt;/em&gt;
support it, though.&lt;/p&gt;
&lt;p&gt;How? The AGX instruction set allows pixel shaders to output different
colours to each sample. The instruction used to output a colour&lt;a
href="#fn1" class="footnote-ref" id="fnref1"
role="doc-noteref"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; takes a &lt;em&gt;set&lt;/em&gt; of samples to
modify, encoded as a bit mask. The default all-1’s mask writes the same
value to all samples in a pixel, but a mask setting a single bit will
write only the single corresponding sample.&lt;/p&gt;
&lt;p&gt;This design is unusual, and it requires driver backflips to translate
“fragment shaders” into hardware pixel shaders. How do we do it?&lt;/p&gt;
&lt;p&gt;Physically, the hardware executes our shader once per pixel.
Logically, we’re supposed to execute the application’s fragment shader
once per sample. If we know the number of samples per pixel, then we can
wrap the application’s shader in a loop over each sample. So, if the
original fragment shader is:&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb1-1"&gt;&lt;a href="#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;interpolated colour &lt;span class="op"&gt;=&lt;/span&gt; interpolate at current sample&lt;span class="op"&gt;(&lt;/span&gt;input colour&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-2"&gt;&lt;a href="#cb1-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;output current sample&lt;span class="op"&gt;(&lt;/span&gt;interpolated colour&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we will transform the program to the pixel shader:&lt;/p&gt;
&lt;div class="sourceCode" id="cb2"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb2-1"&gt;&lt;a href="#cb2-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="cf"&gt;for&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;sample &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt; sample &lt;span class="op"&gt;&amp;lt;&lt;/span&gt; number of samples&lt;span class="op"&gt;;&lt;/span&gt; &lt;span class="op"&gt;++&lt;/span&gt;sample&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb2-2"&gt;&lt;a href="#cb2-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    sample mask &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt; &lt;span class="op"&gt;&amp;lt;&amp;lt;&lt;/span&gt; sample&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb2-3"&gt;&lt;a href="#cb2-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    interpolated colour &lt;span class="op"&gt;=&lt;/span&gt; interpolate at sample&lt;span class="op"&gt;(&lt;/span&gt;input colour&lt;span class="op"&gt;,&lt;/span&gt; sample&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb2-4"&gt;&lt;a href="#cb2-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    output samples&lt;span class="op"&gt;(&lt;/span&gt;sample mask&lt;span class="op"&gt;,&lt;/span&gt; interpolated colour&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb2-5"&gt;&lt;a href="#cb2-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The original fragment shader runs inside the loop, once per sample.
Whenever it interpolates inputs at the current sample position, we
change it to instead interpolate at a specific sample given by the loop
counter &lt;code&gt;sample&lt;/code&gt;. Likewise, when it outputs a colour for a
sample, we change it to output the colour to the single sample given by
the loop counter.&lt;/p&gt;
&lt;p&gt;If the story ended here, this mechanism would be silly. Adding sample
masks to the instruction set is more complicated than a single bit to
invoke the shader multiple times, as other GPUs do. Even Apple’s own
Metal driver has to implement this dance, because Metal has a similar
approach to sample shading as OpenGL and Vulkan. With all this extra
complexity, is there a benefit?&lt;/p&gt;
&lt;p&gt;If we generated that loop at the end, maybe not. But if we know at
compile-time that sample shading is used, we can run our full optimizer
on this sample loop. If there is an expression that is the same for all
samples in a pixel, it can be hoisted out of the loop.&lt;a href="#fn2"
class="footnote-ref" id="fnref2" role="doc-noteref"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;
Instead of calculating the same value multiple times, as other GPUs do,
the value can be calculated just once and reused for each sample.
Although it complicates the driver, this approach to sample shading
isn’t Apple cutting corners. If we slapped on the loop at the end and
did no optimizations, the resulting code would be comparable to what
other GPUs execute in hardware. There might be slight differences from
spawning fewer threads but executing more control flow instructions&lt;a
href="#fn3" class="footnote-ref" id="fnref3"
role="doc-noteref"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;, but that’s minor. Generating the
loop early and running the optimizer enables better performance than
possible on other GPUs.&lt;/p&gt;
&lt;p&gt;So is the mechanism only an optimization? Did Apple stumble on a
better approach to sample shading that other GPUs should adopt? I
wouldn’t be so sure.&lt;/p&gt;
&lt;p&gt;Let’s pull the curtain back. AGX has its roots as a &lt;em&gt;mobile&lt;/em&gt;
GPU intended for iPhones, with significant PowerVR heritage. Even if it
powers Mac Pros today, the mobile legacy means AGX prefers software
implementations of many features that desktop GPUs implement with
dedicated hardware.&lt;/p&gt;
&lt;p&gt;Yes, I’m talking about blending.&lt;/p&gt;
&lt;p&gt;Blending is an operation in graphics APIs to combine the fragment
shader output colour with the existing colour in the framebuffer. It is
usually used to implement &lt;a
href="https://en.wikipedia.org/wiki/Alpha_compositing"&gt;alpha
blending&lt;/a&gt;, to let the background poke through translucent
objects.&lt;/p&gt;
&lt;p&gt;When multisampling is used &lt;em&gt;without&lt;/em&gt; sample shading, although
the fragment shader only runs once per pixel, blending happens
per-sample. Even if the fragment shader outputs the same colour to each
sample, if the framebuffer already had different colours in different
samples, blending needs to happen per-sample to avoid losing that
information already in the framebuffer.&lt;/p&gt;
&lt;p&gt;A traditional desktop GPU blends with dedicated hardware. In the
mobile space, there’s a mix of dedicated hardware and software. On AGX,
blending is purely software. Rather than configure blending hardware,
the driver must produce &lt;em&gt;variants&lt;/em&gt; of the fragment shader that
include instructions to implement the desired blend mode. With alpha
blending, a fragment shader like:&lt;/p&gt;
&lt;div class="sourceCode" id="cb3"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb3-1"&gt;&lt;a href="#cb3-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;colour &lt;span class="op"&gt;=&lt;/span&gt; calculate lighting&lt;span class="op"&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb3-2"&gt;&lt;a href="#cb3-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;output&lt;span class="op"&gt;(&lt;/span&gt;colour&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;becomes:&lt;/p&gt;
&lt;div class="sourceCode" id="cb4"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb4-1"&gt;&lt;a href="#cb4-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;colour &lt;span class="op"&gt;=&lt;/span&gt; calculate lighting&lt;span class="op"&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb4-2"&gt;&lt;a href="#cb4-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;dest &lt;span class="op"&gt;=&lt;/span&gt; load destination colour&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb4-3"&gt;&lt;a href="#cb4-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;alpha &lt;span class="op"&gt;=&lt;/span&gt; colour&lt;span class="op"&gt;.&lt;/span&gt;alpha&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb4-4"&gt;&lt;a href="#cb4-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;blended &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;alpha &lt;span class="op"&gt;*&lt;/span&gt; colour&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;+&lt;/span&gt; &lt;span class="op"&gt;((&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt; &lt;span class="op"&gt;-&lt;/span&gt; alpha&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;*&lt;/span&gt; dest&lt;span class="op"&gt;));&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb4-5"&gt;&lt;a href="#cb4-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;output&lt;span class="op"&gt;(&lt;/span&gt;blended&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Where’s the problem?&lt;/p&gt;
&lt;p&gt;Blending happens per sample. Even if the application intends to run
the fragment shader per pixel, the shader &lt;em&gt;must&lt;/em&gt; run per sample
for correct blending. Compared to other GPUs, this approach to blending
would regress performance when blending and multisampling are enabled
but sample shading is not.&lt;/p&gt;
&lt;p&gt;On the other hand, exposing multisample pixel shaders to the driver
solves the problem neatly. If both the blending and the multisample
state are known, we can first insert instructions for blending, and then
wrap with the sample loop. The above program would then become:&lt;/p&gt;
&lt;div class="sourceCode" id="cb5"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb5-1"&gt;&lt;a href="#cb5-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="cf"&gt;for&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;sample &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt; sample &lt;span class="op"&gt;&amp;lt;&lt;/span&gt; number of samples&lt;span class="op"&gt;;&lt;/span&gt; &lt;span class="op"&gt;++&lt;/span&gt;sample_id&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-2"&gt;&lt;a href="#cb5-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    colour &lt;span class="op"&gt;=&lt;/span&gt; calculate lighting&lt;span class="op"&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-3"&gt;&lt;a href="#cb5-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb5-4"&gt;&lt;a href="#cb5-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    dest &lt;span class="op"&gt;=&lt;/span&gt; load destination colour at sample &lt;span class="op"&gt;(&lt;/span&gt;sample&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-5"&gt;&lt;a href="#cb5-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    alpha &lt;span class="op"&gt;=&lt;/span&gt; colour&lt;span class="op"&gt;.&lt;/span&gt;alpha&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-6"&gt;&lt;a href="#cb5-6" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    blended &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;alpha &lt;span class="op"&gt;*&lt;/span&gt; colour&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;+&lt;/span&gt; &lt;span class="op"&gt;((&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt; &lt;span class="op"&gt;-&lt;/span&gt; alpha&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;*&lt;/span&gt; dest&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-7"&gt;&lt;a href="#cb5-7" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb5-8"&gt;&lt;a href="#cb5-8" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    sample mask &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt; &lt;span class="op"&gt;&amp;lt;&amp;lt;&lt;/span&gt; sample&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-9"&gt;&lt;a href="#cb5-9" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    output samples&lt;span class="op"&gt;(&lt;/span&gt;sample_mask&lt;span class="op"&gt;,&lt;/span&gt; blended&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb5-10"&gt;&lt;a href="#cb5-10" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In this form, the fragment shader is asymptotically worse than the
application wanted: the fragment shader is executed inside the loop,
running per-sample unnecessarily.&lt;/p&gt;
&lt;p&gt;Have no fear, the optimizer is here. Since &lt;code&gt;colour&lt;/code&gt; is the
same for each sample in the pixel, it does not depend on the sample ID.
The compiler can move the entire original fragment shader (and related
expressions) out of the per-sample loop:&lt;/p&gt;
&lt;div class="sourceCode" id="cb6"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb6-1"&gt;&lt;a href="#cb6-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;colour &lt;span class="op"&gt;=&lt;/span&gt; calculate lighting&lt;span class="op"&gt;();&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-2"&gt;&lt;a href="#cb6-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;alpha &lt;span class="op"&gt;=&lt;/span&gt; colour&lt;span class="op"&gt;.&lt;/span&gt;alpha&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-3"&gt;&lt;a href="#cb6-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;inv_alpha &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="dv"&gt;1&lt;/span&gt; &lt;span class="op"&gt;-&lt;/span&gt; alpha&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-4"&gt;&lt;a href="#cb6-4" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;colour_alpha &lt;span class="op"&gt;=&lt;/span&gt; alpha &lt;span class="op"&gt;*&lt;/span&gt; colour&lt;span class="op"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-5"&gt;&lt;a href="#cb6-5" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb6-6"&gt;&lt;a href="#cb6-6" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="cf"&gt;for&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;sample &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt; sample &lt;span class="op"&gt;&amp;lt;&lt;/span&gt; number of samples&lt;span class="op"&gt;;&lt;/span&gt; &lt;span class="op"&gt;++&lt;/span&gt;sample_id&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-7"&gt;&lt;a href="#cb6-7" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    dest &lt;span class="op"&gt;=&lt;/span&gt; load destination colour at sample &lt;span class="op"&gt;(&lt;/span&gt;sample&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-8"&gt;&lt;a href="#cb6-8" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    blended &lt;span class="op"&gt;=&lt;/span&gt; colour_alpha &lt;span class="op"&gt;+&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;inv_alpha &lt;span class="op"&gt;*&lt;/span&gt; dest&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-9"&gt;&lt;a href="#cb6-9" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;/span&gt;
&lt;span id="cb6-10"&gt;&lt;a href="#cb6-10" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    sample mask &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;&lt;span class="dv"&gt;1&lt;/span&gt; &lt;span class="op"&gt;&amp;lt;&amp;lt;&lt;/span&gt; sample&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-11"&gt;&lt;a href="#cb6-11" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    output samples&lt;span class="op"&gt;(&lt;/span&gt;sample_mask&lt;span class="op"&gt;,&lt;/span&gt; blended&lt;span class="op"&gt;);&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb6-12"&gt;&lt;a href="#cb6-12" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now blending happens per sample but the application’s fragment shader
runs just once, matching the performance characteristics of traditional
GPUs. Even better, all of this happens without any special work from the
compiler. There’s no magic multisampling optimization happening here:
it’s just a loop.&lt;/p&gt;
&lt;p&gt;By the way, what do we do if we &lt;em&gt;don’t&lt;/em&gt; know the blending and
multisample state at compile-time? Hope is not lost…&lt;/p&gt;
&lt;p&gt;…but that’s a story for another day.&lt;/p&gt;
&lt;h2 id="whats-next"&gt;What’s next?&lt;/h2&gt;
&lt;p&gt;While OpenGL ES 3.0 is an improvement over ES 2.0, we’re not done. In
my work-in-progress branch, OpenGL ES 3.1 support is nearly finished,
which will unlock compute shaders.&lt;/p&gt;
&lt;p&gt;The final goal is a Vulkan driver running modern games. We’re a while
away, but the baseline Vulkan 1.0 requirements parallel OpenGL ES 3.1,
so our work translates to Vulkan. For example, the multisampling
compiler passes described above are common code between the drivers.
We’ve tested them against OpenGL, and now they’re ready to go for
Vulkan.&lt;/p&gt;
&lt;p&gt;And yes, &lt;a href="https://github.com/ella-0"&gt;the team&lt;/a&gt; is already
working on Vulkan.&lt;/p&gt;
&lt;p&gt;Until then, you’re one &lt;code&gt;pacman -Syu&lt;/code&gt; away from enjoying
OpenGL 3.1!&lt;/p&gt;
&lt;section id="footnotes" class="footnotes footnotes-end-of-document"
role="doc-endnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn1"&gt;&lt;p&gt;Store a formatted value to local memory acting as a
tilebuffer.&lt;a href="#fnref1" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id="fn2"&gt;&lt;p&gt;Via &lt;a
href="https://en.wikipedia.org/wiki/Common_subexpression_elimination"&gt;common
subexpression elimination&lt;/a&gt; if the &lt;a
href="https://en.wikipedia.org/wiki/Loop_unrolling"&gt;loop is
unrolled&lt;/a&gt;, otherwise via &lt;a
href="https://en.wikipedia.org/wiki/Code_motion"&gt;code motion&lt;/a&gt;.&lt;a
href="#fnref2" class="footnote-back" role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id="fn3"&gt;&lt;p&gt;Since the number of samples is constant, all threads
branch in the same direction so the usual “GPUs are bad at branching”
advice does not apply.&lt;a href="#fnref3" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/opengl3-on-asahi-linux.html</guid><pubDate>Tue, 06 Jun 2023 00:00:00 -0500</pubDate></item><item><title>Passing the reins on Panfrost</title><link>https://alyssarosenzweig.ca/blog/passing-reins-panfrost.html</link><description>&lt;p&gt;Today is my last day at &lt;a
href="https://www.collabora.com/"&gt;Collabora&lt;/a&gt; and my last day leading
the &lt;a href="https://docs.mesa3d.org/drivers/panfrost.html"&gt;Panfrost&lt;/a&gt;
driver.&lt;/p&gt;
&lt;p&gt;It’s been a wild ride.&lt;/p&gt;
&lt;p&gt;In 2017, I began work on the &lt;code&gt;chai&lt;/code&gt; driver for Mali T
(Midgard). &lt;code&gt;chai&lt;/code&gt; would later be merged into &lt;a
href="https://queer.party/@Lyude"&gt;Lyude Paul&lt;/a&gt;’s and Connor Abbott’s
BiOpenly project for Mali G (Bifrost) to form Panfrost.&lt;/p&gt;
&lt;p&gt;In 2019, I joined Collabora to accelerate work on the driver stack.
The initial goal was to run GNOME on a Mali-T860 Chromebook.&lt;/p&gt;
&lt;p&gt;&lt;a
href="https://www.collabora.com/news-and-blog/blog/2019/06/26/gnome-meets-panfrost/"&gt;Huge
success&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="/glmark-gears-gnome-panfrost-crop.webp"
alt="GNOME running on Panfrost in 2019" /&gt;&lt;br /&gt;
&lt;/p&gt;
&lt;p&gt;Today, Panfrost supports a broad spectrum of Mali GPUs, conformant to
the OpenGL ES 3.1 specification on Mali-G52 and Mali-G57. It’s hard to
overstate how far we’ve come. I’ve had the thrills of architecting
several backend shader compilers as well as the Gallium-based OpenGL
driver, while my dear colleague Boris Brezillon has put together a
proof-of-concept Vulkan driver which I think you’ll hear more about
soon.&lt;/p&gt;
&lt;p&gt;Lately, my focus has been ensuring the project can stand on its own
four legs. I have every confidence in other Collaborans hacking on
Panfrost, including Boris and Italo Nicola. The project has a bright
future. It’s time for me to pass the reins.&lt;/p&gt;
&lt;p&gt;I’m still alive. I plan to continue working on Mesa drivers for a
long time, including the common infrastructure upon which Panfrost
relies. And I’ll still send the odd Panfrost patch now and then. That
said, my focus will shift.&lt;/p&gt;
&lt;p&gt;I’m not ready to announce what’s in store yet… but maybe you can read
between the lines!&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/passing-reins-panfrost.html</guid><pubDate>Mon, 10 Apr 2023 00:00:00 -0500</pubDate></item><item><title>Apple GPU drivers now in Asahi Linux</title><link>https://alyssarosenzweig.ca/blog/asahi-gpu-part-7.html</link><description>&lt;p&gt;&lt;a href="/Quake3.png"&gt;&lt;img src="/Quake3.webp" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We’re excited to announce our first Apple GPU driver release!&lt;/p&gt;
&lt;p&gt;We’ve been working hard over the past two years to bring this new
driver to everyone, and we’re really proud to finally be here. This is
still an alpha driver, but it’s already good enough to run a smooth
desktop experience and some games.&lt;/p&gt;
&lt;p&gt;Read on to find out more about the state of things today, how to
install it (it’s an opt-in package), and how to report bugs!&lt;/p&gt;
&lt;h2 id="status"&gt;Status&lt;/h2&gt;
&lt;p&gt;This release features work-in-progress OpenGL 2.1 and OpenGL ES 2.0
support for all current Apple M-series systems. That’s enough for
hardware acceleration with desktop environments, like GNOME and KDE.
It’s also enough for older 3D games, like Quake3 and Neverball. While
there’s always room for improvement, the driver is fast enough to run
all of the above at 60 frames per second at 4K.&lt;/p&gt;
&lt;p&gt;Please note: these drivers have not yet passed the OpenGL (ES)
conformance tests. There will be bugs!&lt;/p&gt;
&lt;p&gt;What’s next? Supporting more applications. While OpenGL (ES) 2
suffices for some applications, newer ones (especially games) demand
more OpenGL features. OpenGL (ES) 3 brings with it a slew of new
features, like multiple render targets, multisampling, and transform
feedback. Work on these features is well under way, but they will each
take a great deal of additional development effort, and all are needed
before OpenGL (ES) 3.0 is available.&lt;/p&gt;
&lt;p&gt;What about Vulkan? We’re working on it! Although we’re only shipping
OpenGL right now, we’re designing with Vulkan in mind. Most of the work
we’re putting toward OpenGL will be reused for Vulkan. We estimated that
we could ship working OpenGL 2 drivers much sooner than a working Vulkan
1.0 driver, and we wanted to get hardware accelerated desktops into your
hands as soon as possible. For the most part, those desktops use OpenGL,
so supporting OpenGL first made more sense to us than diving into the
Vulkan deep end, only to use Zink to translate OpenGL 2 to Vulkan to run
desktops. Plus, there is a large spectrum of OpenGL support, with OpenGL
2.1 containing a fraction of the features of OpenGL 4.6. The same is
true for Vulkan: the baseline Vulkan 1.0 profile is roughly equivalent
to OpenGL ES 3.1, but applications these days want Vulkan 1.3 with tons
of extensions and “optional” features. Zink’s “layering” of OpenGL on
top of Vulkan isn’t magic: it can only expose the OpenGL features that
the underlying Vulkan driver has. A baseline Vulkan 1.0 driver isn’t
even enough to get OpenGL 2.1 on Zink! Zink itself advertises support
for OpenGL 4.6, but of course that’s only when paired with Vulkan
drivers that support the equivalent of OpenGL 4.6… and that gets us back
to a tremendous amount of time and effort.&lt;/p&gt;
&lt;p&gt;When will OpenGL 3 support be ready? OpenGL 4? Vulkan 1.0? Vulkan
1.3? In community open source projects, it’s said that every time
somebody asks when a feature will be done, it delays that feature by a
month. Well, a lot of people have been asking…&lt;/p&gt;
&lt;p&gt;At any rate, for a sneak peek… here is SuperTuxKart’s deferred
renderer running at full speed, making liberal use of OpenGL ES 3
features like multiple render targets~&lt;/p&gt;
&lt;p&gt;&lt;a href="/SuperTuxKart-Deferred.png"&gt;&lt;img
src="/SuperTuxKart-Deferred.webp" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="anatomy-of-a-gpu-driver"&gt;Anatomy of a GPU driver&lt;/h2&gt;
&lt;p&gt;Modern GPUs consist of many distinct “layered” parts. There is…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a memory management unit and an interface to submit memory-mapped
work to the hardware&lt;/li&gt;
&lt;li&gt;fixed-function 3D hardware to rasterize triangles, perform
depth/stencil testing, and more&lt;/li&gt;
&lt;li&gt;programmable “shader cores” (like little CPUs with bespoke
instruction sets) with work dispatched by the fixed-function
hardware&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This “layered” hardware demands a “layered” graphics driver stack. We
need…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a kernel driver to map memory and submit memory-mapped work&lt;/li&gt;
&lt;li&gt;a userspace driver to translate OpenGL and Vulkan calls into
hardware-specific data structures in graphics memory&lt;/li&gt;
&lt;li&gt;a compiler translating shading programming languages like GLSL to
the hardware’s instruction set&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s a lot of work, calling for a team effort! Fortunately, that
layering gives us natural boundaries to divide work among our small
team.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/"&gt;&lt;strong&gt;Alyssa Rosenzweig&lt;/strong&gt;&lt;/a&gt; is writing the
OpenGL driver and compiler.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://vt.social/@lina"&gt;&lt;strong&gt;Asahi Lina&lt;/strong&gt;&lt;/a&gt; is
writing the kernel driver and helping with OpenGL.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mastodon.social/@dougall"&gt;&lt;strong&gt;Dougall
Johnson&lt;/strong&gt;&lt;/a&gt; is reverse-engineering the instruction set with
Alyssa.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Meanwhile, &lt;a href="https://tech.lgbt/@ella"&gt;&lt;strong&gt;Ella
Stanforth&lt;/strong&gt;&lt;/a&gt; is working on a Vulkan driver, reusing the kernel
driver, the compiler, and some code shared with the OpenGL driver.&lt;/p&gt;
&lt;p&gt;Of course, we couldn’t build an OpenGL driver in under two years just
ourselves. Thanks to the power of free and open source software, we
stand on the shoulders of FOSS giants. The compiler implements a “NIR”
backend, where NIR is a powerful intermediate representation, including
GLSL to NIR translation. The kernel driver users the “Direct Rendering
Manager” (DRM) subsystem of the Linux kernel to minimize boilerplate.
Finally, the OpenGL driver implements the “Gallium3D” API inside of &lt;a
href="https://mesa3d.org/"&gt;Mesa&lt;/a&gt;, the home for open source OpenGL and
Vulkan drivers. Through Mesa and Gallium3D, we benefit from thirty years
of OpenGL driver development, with common code translating OpenGL into
the much simpler Gallium3D. Thanks to the incredible engineering of NIR,
Mesa, and Gallium3D, our ragtag team of reverse-engineers can focus on
what’s left: the Apple hardware.&lt;/p&gt;
&lt;h2 id="installation-instructions"&gt;Installation instructions&lt;/h2&gt;
&lt;p&gt;To get the new drivers, you need to run the
&lt;code&gt;linux-asahi-edge&lt;/code&gt; kernel and also install the
&lt;code&gt;mesa-asahi-edge&lt;/code&gt; Mesa package.&lt;/p&gt;
&lt;pre class="shell"&gt;&lt;code&gt;$ sudo pacman -Syu
$ sudo pacman -S linux-asahi-edge mesa-asahi-edge
$ sudo update-grub&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since only one version of Mesa can be installed at a time, pacman
will prompt you to replace &lt;code&gt;mesa&lt;/code&gt; with
&lt;code&gt;mesa-asahi-edge&lt;/code&gt;. This is normal!&lt;/p&gt;
&lt;p&gt;We also recommend running Wayland instead of Xorg at this point, so
if you’re using the KDE Plasma environment, make sure to install the
Wayland session:&lt;/p&gt;
&lt;pre class="shell"&gt;&lt;code&gt;$ sudo pacman -S plasma-wayland-session&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then reboot, pick the Wayland session at the top of the login screen
(SDDM), and enjoy! You might want to adjust the screen scale factor in
&lt;em&gt;System Settings → Display and Monitor&lt;/em&gt; (Plasma Wayland defaults
to 100% or 200%, while 150% is often nicer). If you have “Force font
DPI” enabled under &lt;em&gt;Appearance → Fonts&lt;/em&gt;, you should disable that
(it is saved separately for Wayland and Xorg, and shouldn’t be necessary
on Wayland sessions). Log out and back in for these changes to fully
apply.&lt;/p&gt;
&lt;p&gt;Xorg and Xorg-based desktop environments should work, but there are a
few known issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expect screen tearing (this might be fixed &lt;a
href="https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests/1006"&gt;soon&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;VSync does not work (some KDE animations will be too fast, and GL
apps will not limit their FPS even with VSync enabled). This is a
limitation of Xorg on the Apple DCP display controllers, which do not
support VBlank interrupts.&lt;/li&gt;
&lt;li&gt;There are still driver bugs triggered by Xorg/KWin. We’re looking
into this.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;linux-asahi-edge&lt;/code&gt; kernel can be installed
side-by-side with the standard &lt;code&gt;linux-asahi&lt;/code&gt; package, but
both versions should be kept in sync, so make sure to always update your
packages together! You can always pick the &lt;code&gt;linux-asahi&lt;/code&gt;
kernel in the GRUB boot menu, which will disable GPU acceleration and
the DCP display driver.&lt;/p&gt;
&lt;p&gt;When the packages are updated in the future, it’s possible that
graphical apps will stop starting up after an update until you reboot,
or they may fall back to software rendering. This is normal. Until the
UAPI is stable, we’ll have to break compatibility between Mesa and the
kernel every now and then, so you will need to reboot to make things
work after updates. In general, if apps &lt;em&gt;do&lt;/em&gt; keep working with
acceleration after any particular Mesa update, then it’s probably safe
not to reboot, but you should still do it to make sure you’re running
the latest kernel!&lt;/p&gt;
&lt;h2 id="reporting-bugs"&gt;Reporting bugs&lt;/h2&gt;
&lt;p&gt;Since the driver is still in development, there are lots of known
issues and we’re still working hard on improving conformance test
results. Please don’t open new bugs for random apps not working! It’s
still the early days and we know there’s a lot of work to do. Here’s a
quick guide of how to report bugs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you find an app that does not start up at all, please don’t
report it as a bug. Lots of apps won’t work because they require a newer
GL version than what we support. Please set the
&lt;code&gt;LIBGL_ALWAYS_SOFTWARE=1&lt;/code&gt; environment variable for those apps
to fall back to software rendering. If it is a popular app that is part
of the Arch Linux ARM repository, you can make a comment on &lt;a
href="https://github.com/AsahiLinux/linux/issues/73"&gt;this issue&lt;/a&gt;
instead, so we can add Mesa quirks to workaround.&lt;/li&gt;
&lt;li&gt;If you run into issues caused by &lt;code&gt;linux-asahi-edge&lt;/code&gt;
unrelated to the GPU, please add a comment to &lt;a
href="https://github.com/AsahiLinux/linux/issues/70"&gt;this issue&lt;/a&gt;.
This includes display output issues! (Resolutions, backlight control,
display power control, etc.)&lt;/li&gt;
&lt;li&gt;If the GPU locks up and all GPU apps stop working, run
&lt;code&gt;asahi-diagnose&lt;/code&gt; (for example, from an SSH session), open a
new bug on &lt;a href="https://github.com/AsahiLinux/linux"&gt;the
AsahiLinux/linux repository&lt;/a&gt;, attach the file generated by that
command, and tell us what you were doing that caused the lockup.&lt;/li&gt;
&lt;li&gt;For other GPU issues (rendering glitches, apps that crash after
starting up correctly, and things like that), run
&lt;code&gt;asahi-diagnose&lt;/code&gt; and make a comment on &lt;a
href="https://github.com/AsahiLinux/linux/issues/72"&gt;this issue&lt;/a&gt;,
attaching the file generated by that command. Don’t forget to tell us
about your environment!&lt;/li&gt;
&lt;li&gt;In the future, if a driver update causes a regression (rendering
problems or crashes for apps that previously worked properly), you can
open a bug &lt;a
href="https://gitlab.freedesktop.org/asahi/mesa/-/issues"&gt;directly in
the Mesa tracker&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We hope you enjoy our driver! Remember, things are still moving
quickly, so make sure to update your packages regularly to get updates
and bug fixes!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Co-written with Asahi Lina. Can you tell who wrote what?&lt;/em&gt;&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/asahi-gpu-part-7.html</guid><pubDate>Wed, 07 Dec 2022 00:00:00 -0500</pubDate></item><item><title>Clip control on the Apple GPU</title><link>https://alyssarosenzweig.ca/blog/asahi-gpu-part-6.html</link><description>&lt;figure&gt;
&lt;img src="/Neverball.webp"
alt="Neverball rendered on the Apple M1 GPU with an open source OpenGL driver" /&gt;
&lt;figcaption aria-hidden="true"&gt;Neverball rendered on the Apple M1 GPU
with an open source OpenGL driver&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;After a year in development, &lt;a
href="https://docs.mesa3d.org/drivers/asahi.html"&gt;the open source
“Asahi” driver for the Apple GPU&lt;/a&gt; is running real games. There’s more
to do, but &lt;a href="https://neverball.org/"&gt;Neverball&lt;/a&gt; is already
playable (and a lot of fun!).&lt;/p&gt;
&lt;p&gt;Neverball uses legacy “fixed function” OpenGL. Rather than supply
programmable shaders like OpenGL 2, old OpenGL 1 applications configure
a fixed set of graphics effects like fog and alpha testing. Modern GPUs
don’t implement these features in hardware. Instead, the driver
synthesizes shaders implementing the desired graphics. This translation
is complicated, but we get it for “free” as an open source driver in
Mesa. If we implement the modern shader pipeline, Mesa will handle fixed
function OpenGL for us transparently. That’s a win for open source
drivers, and a win for GPU acceleration on &lt;a
href="https://asahilinux.org/"&gt;Asahi Linux&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To implement the modern OpenGL features, we rely on
reverse-engineering the behaviour of Apple’s Metal driver, as we don’t
have hardware documentation. Although Metal uses the same shader
pipeline as OpenGL, it doesn’t support all the OpenGL features that the
hardware does, which puts us in bind. In the past, I’ve relied on &lt;a
href="/blog/asahi-gpu-part-4.html"&gt;educated guesswork&lt;/a&gt; to bridge the
gap, but there’s another solution… and it’s a doozy.&lt;/p&gt;
&lt;p&gt;For motivation, consider the &lt;em&gt;clip space&lt;/em&gt; used in OpenGL. In
every other API on the planet, the Z component (depth) of points in the
3D world range from 0 to 1, where 0 is “near” and 1 is “far”. In OpenGL,
however, Z ranges from &lt;em&gt;negative 1&lt;/em&gt; to 1. As Metal uses the 0/1
clip space, implementing OpenGL on Metal requires emulating the -1/1
clip space by inserting extra instructions into the vertex shader to
transform the Z coordinate. Although this emulation adds overhead, it
works for &lt;a href="https://github.com/google/angle"&gt;ANGLE&lt;/a&gt;’s open
source implementation of OpenGL ES on Metal.&lt;/p&gt;
&lt;p&gt;Like ANGLE, Apple’s OpenGL driver internally translates to Metal.
Because Metal uses the 0 to 1 clip space, it should require this
emulation code. Curiously, when we disassemble shaders compiled with
their OpenGL implementation, we don’t see any such emulation. That means
Apple’s GPU must support -1/1 clip spaces in addition to Metal’s
preferred 0/1. The problem is figuring out how to use this other clip
space.&lt;/p&gt;
&lt;p&gt;We expect that there’s a bit toggling between these clip spaces. The
logical place for such a bit is the viewport packet, but there’s no
obvious difference between the viewport packets emitted by Metal and
OpenGL-on-Metal. Ordinarily, we would identify the bit by toggling the
clip space in Metal and comparing memory dumps. However, according to
Apple’s documentation, there’s no way to change the clip space in
Metal.&lt;/p&gt;
&lt;p&gt;That’s an apparently contradiction. There’s no way to use the -1/1
clip space with Metal, but Apple’s OpenGL-on-Metal translator uses uses
the -1/1 clip space. What gives?&lt;/p&gt;
&lt;p&gt;Here’s a little secret: there are two graphics APIs called “Metal”.
There’s the Metal you know, a limited API that Apple documents for App
Store developers, an API that lacks useful features supported by OpenGL
and Vulkan.&lt;/p&gt;
&lt;p&gt;And there’s the Metal that Apple uses themselves, an internal API
adding back features that Apple doesn’t want you using. While ANGLE
implements OpenGL ES on the documented Metal, Apple can implement OpenGL
on the secret Metal.&lt;/p&gt;
&lt;p&gt;Apple does not publish documentation or headers for this richer Metal
API, but if we’re lucky, we can catch a glimpse behind the curtain. The
undocumented classes and methods making up the internal Metal API are
still available in the production Metal binaries. To use them, we only
need the missing headers. Fortunately, Objective-C symbols contain
enough information to reconstruct header files, allowing us to
experiment with undocumented methods with “extra” functionality
inherited from OpenGL.&lt;/p&gt;
&lt;p&gt;Compared to the desktop GPUs found in Intel Macs, Apple’s own GPU
implements a slim, modern feature set mapping well to Metal. Most of the
“extra” functionality is emulated. It is interesting to know the
emulation happens in their Metal driver instead of their OpenGL
frontend, but that’s unsurprising, as it allows their Metal drivers for
Intel and AMD GPUs to implement the functionality natively. While this
information is fascinating for “macOS hermeneutics”, it won’t help us
with our Apple GPU mystery.&lt;/p&gt;
&lt;p&gt;What &lt;em&gt;will&lt;/em&gt; help us are the catch-all mystery methods named
&lt;code&gt;setOpenGLModeEnabled&lt;/code&gt;, apparently enabling “OpenGL
mode”.&lt;/p&gt;
&lt;p&gt;Mystery methods named like just &lt;em&gt;beg&lt;/em&gt; to be called.&lt;/p&gt;
&lt;p&gt;The render pipeline descriptor has such a method. That descriptor
contains state that can change every draw. In some graphics APIs, like
OpenGL with &lt;a
href="https://registry.khronos.org/OpenGL/extensions/ARB/ARB_clip_control.txt"&gt;&lt;code&gt;ARB_clip_control&lt;/code&gt;&lt;/a&gt;
and Vulkan with &lt;a
href="https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_depth_clip_control.html"&gt;&lt;code&gt;VK_EXT_depth_clip_control&lt;/code&gt;&lt;/a&gt;,
the application can change the clip space every draw. Ideally, the clip
space state would be part of this descriptor.&lt;/p&gt;
&lt;p&gt;We can test this optimistic guess by augmenting our Metal test bench
to call
&lt;code&gt;[MTLRenderPipelineDescriptorInternal setOpenGLModeEnabled: YES]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It feels strange to call this hidden method. It’s stranger when the
code compiles and runs just fine.&lt;/p&gt;
&lt;p&gt;We can then compare traces between OpenGL mode and the normal Metal
mode. Seemingly, enabling OpenGL mode toggles a plethora of random
unknown bits. Even if one of them is what we want, it’s a bit
unsatisfying that the “real” Metal would lack a proper
&lt;code&gt;[setClipSpace: MTLMinusOneToOne]&lt;/code&gt; method, rather than this
blunt hack reconfiguring a pile of loosely related API behaviours.&lt;/p&gt;
&lt;p&gt;Alas, for all the random changes in “OpenGL mode”, none seem to
affect clipping behaviour.&lt;/p&gt;
&lt;p&gt;Hope is not yet lost. There’s another
&lt;code&gt;setOpenGLModeEnabled&lt;/code&gt; method, this time in the render pass
descriptor. Rather than pipeline state that can change every draw, this
descriptor’s state can only change in between render passes. Changing
that state in between draws would require an expensive flush to main
memory, similar to &lt;a href="/blog/asahi-gpu-part-5.html"&gt;the partial
renders seen elsewhere with the Apple GPU&lt;/a&gt;. Nevertheless, it’s worth
a shot.&lt;/p&gt;
&lt;p&gt;Changing our test bench to call
&lt;code&gt;[MTLRenderPassDescriptorInternal setOpenGLModeEnabled: YES]&lt;/code&gt;,
we find another collection of random bits changed. Most of them are in
hardware packets, and none of those seem to control clip space,
either.&lt;/p&gt;
&lt;p&gt;One bit does stand out. It’s not a hardware bit.&lt;/p&gt;
&lt;p&gt;In addition to the packets that the userspace driver prepares for the
hardware, userspace passes to the &lt;em&gt;kernel&lt;/em&gt; a large block of
render pass state describing everything from tile size to the
depth/stencil buffers. Such a design is unusual. Ordinarily, GPU kernel
drivers are only concerned with memory management and scheduling,
remaining oblivious of 3D graphics. By contrast, Apple processes this
state in the kernel forwarding the state to the GPU’s firmware to
configure the actual hardware.&lt;/p&gt;
&lt;p&gt;Comparing traces, the render pass “OpenGL mode” sets an unknown bit
in this kernel-processed block. If we set the same bit in our OpenGL
driver, we find the clip space changes to -1/1. Victory, right?&lt;/p&gt;
&lt;p&gt;Almost. Because this bit is render pass state, we can’t use it to
change the clip space between draws. That’s okay for baseline OpenGL and
Vulkan, but it prevents us from efficiently implementing the
&lt;code&gt;ARB_clip_control&lt;/code&gt; and &lt;code&gt;VK_EXT_depth_clip_control&lt;/code&gt;
extensions. There &lt;em&gt;are&lt;/em&gt; at least three (inefficient)
implementations.&lt;/p&gt;
&lt;p&gt;The first is ignoring the hardware support and emulating one of the
clip spaces by inserting extra instructions into the vertex shader when
the “wrong” clip space is used. In addition to extra overhead, that
requires &lt;em&gt;shader variants&lt;/em&gt; for the different clip spaces.&lt;/p&gt;
&lt;p&gt;Shader variants are terrible.&lt;/p&gt;
&lt;p&gt;In new APIs like Vulkan, Metal, and D3D12, everything needed to
compile a shader is known up-front as part of a monolithic pipeline.
That means pipelines are compiled when they’re created, not when they’re
used, and are never recompiled. By contrast, older APIs like OpenGL and
D3D11 allow using the same shader with different API states, requiring
some drivers to recompile shaders on the fly. Compiling shaders is slow,
so shader variants can cause unpredictable drops in an application’s
frame rate, familiar to desktop gamers as stuttering. If we use this
approach in our OpenGL driver, switching clip modes could cause
stuttering due to recompiling shaders. In bad circumstances, that
stutter could even happen long after the mode is switched.&lt;/p&gt;
&lt;p&gt;That option is undesirable, so the second approach is &lt;em&gt;always&lt;/em&gt;
inserting emulation instructions that read the desired clip space at
run-time, reserving a uniform (push constant) for the transformation.
That way, the same shader is usable with either clip space, eliminating
shader variants. However, that has even higher overhead than the first
method. If an application frequently changes clip spaces within a render
pass, this approach will be the most efficient of the three. If it does
not, this approach adds constant overhead to &lt;em&gt;every&lt;/em&gt; application.
Knowing which approach is better requires the driver to have a magic
crystal ball.&lt;a href="#fn1" class="footnote-ref" id="fnref1"
role="doc-noteref"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The final option is &lt;em&gt;using&lt;/em&gt; the hardware clip space bit and
splitting the render pass when the clip space is changed. Here, the
shaders are optimal and do not require variants. However, splitting the
render pass wastes tremendous memory bandwidth if the application
changes clip spaces frequently. Nevertheless, this approach has some
support from the &lt;code&gt;ARB_clip_control&lt;/code&gt; specification:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Some [OpenGL] implementations may introduce a flush when changing the
clip control state. Hence frequent clip control changes are not
recommended.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Each approach has trade-offs. For now, the easiest “option” is
sticking our head in the sand and giving up on
&lt;code&gt;ARB_clip_control&lt;/code&gt; altogether. The OpenGL extension is
optional until we get to OpenGL 4.5. Apple doesn’t implement it in their
OpenGL stack. Because &lt;code&gt;ARB_clip_control&lt;/code&gt; is primarily for
porting Direct3D games, native OpenGL games are happy without it.
Certainly, Neverball doesn’t mind. For now, we can use the hardware bit
to use the -1/1 clip space unconditionally in OpenGL and 0/1
unconditionally in Vulkan. That does not require any emulation or
flushing, though it prevents us from advertising the extensions.&lt;/p&gt;
&lt;p&gt;That’s enough to run Neverball on macOS, using our userspace OpenGL
driver in Mesa, and Apple’s proprietary kernel driver. There’s a catch:
Neverball has to present with the deprecated X11 server on macOS. Years
ago, Apple engineers&lt;a href="#fn2" class="footnote-ref" id="fnref2"
role="doc-noteref"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; contributed Mesa support for X11 on
macOS (XQuartz), allowing us to run X11 applications with our Mesa
driver. However, there’s no support for Apple’s own Cocoa windowing
system, meaning native macOS applications won’t work with our driver.
There’s also no easy way to run Linux’s newer Wayland display server on
macOS. Nevertheless, Neverball does not use Cocoa directly. Instead, it
uses the cross-platform &lt;a
href="https://github.com/libsdl-org/SDL"&gt;SDL2&lt;/a&gt; library to create its
window, which internally uses Cocoa, X11, or Wayland as appropriate for
the operating system. With enough sweat and tears, we can build an
macOS/X11 version of SDL2 and link Neverball with that.&lt;/p&gt;
&lt;p&gt;This Neverball/macOS/X11 port was frustrating, especially when the
game is one &lt;code&gt;apt install&lt;/code&gt; away on Linux. That’s a job for &lt;a
href="https://twitter.com/linaasahi"&gt;Asahi Lina&lt;/a&gt;, who has been hard
at work writing a Linux kernel driver for Apple’s GPU. When our work
converges, my userspace Mesa driver will run on Linux with her kernel
driver to implement a full open source graphics stack for 3D
acceleration on Asahi Linux.&lt;/p&gt;
&lt;p&gt;Please temper your expectations: even with hardware documentation, an
optimized Vulkan driver stack (with enough features to layer OpenGL 4.6
with Zink) requires many years of full time work. At least for now,
nobody is working on this driver full time&lt;a href="#fn3"
class="footnote-ref" id="fnref3" role="doc-noteref"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;.
Reverse-engineering slows the process considerably. We won’t be playing
AAA games any time soon.&lt;/p&gt;
&lt;p&gt;That said, thanks to the tremendous shared code in Mesa, a basic
OpenGL driver is doable by a single person. I’m optimistic that we’ll
have native OpenGL 2.1 in Asahi Linux by the end of the year. That’s
enough to accelerate your desktop environment and browser. It’s also
enough to play older games (like Neverball). Even without fancy
features, GPU acceleration means smooth animations and better battery
life.&lt;/p&gt;
&lt;p&gt;In that light, the Asahi Linux future looks bright.&lt;/p&gt;
&lt;p&gt;&lt;img src="/Neverball2.webp" /&gt;&lt;/p&gt;
&lt;section id="footnotes" class="footnotes footnotes-end-of-document"
role="doc-endnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn1"&gt;&lt;p&gt;This crystal ball is called “Vulkan, Metal, or D3D12”,
and it has its own problems.&lt;a href="#fnref1" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id="fn2"&gt;&lt;p&gt;Hi Jeremy!&lt;a href="#fnref2" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id="fn3"&gt;&lt;p&gt;I work full-time at &lt;a
href="https://collabora.com"&gt;Collabora&lt;/a&gt; on my baby, the open source
Panfrost driver for Mali GPUs.&lt;a href="#fnref3" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/asahi-gpu-part-6.html</guid><pubDate>Mon, 22 Aug 2022 00:00:00 -0500</pubDate></item><item><title>The Apple GPU and the Impossible Bug</title><link>https://alyssarosenzweig.ca/blog/asahi-gpu-part-5.html</link><description>&lt;p&gt;In late 2020, Apple debuted the M1 with Apple’s GPU architecture,
AGX, rumoured to be derived from Imagination’s PowerVR series. Since
then, &lt;a href="https://asahilinux.org"&gt;we’ve&lt;/a&gt; been
reverse-engineering AGX and building open source graphics drivers. Last
January, I &lt;a href="/blog/asahi-gpu-part-2.html"&gt;rendered a triangle&lt;/a&gt;
with my own code, but there has since been a heinous bug lurking:&lt;/p&gt;
&lt;p&gt;The driver fails to render large amounts of geometry.&lt;/p&gt;
&lt;p&gt;Spinning a cube is fine, low polygon geometry is okay, but detailed
models won’t render. Instead, the GPU renders only part of the model and
then faults.&lt;/p&gt;
&lt;figure&gt;
&lt;img src="/PartialPhong.webp" alt="Partially rendered bunny" /&gt;
&lt;figcaption aria-hidden="true"&gt;Partially rendered bunny&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;It’s hard to pinpoint how much we can render without faults. It’s not
just the geometry complexity that matters. The same geometry can render
with simple shaders but fault with complex ones.&lt;/p&gt;
&lt;p&gt;That suggests rendering detailed geometry with a complex shader
“takes too long”, and the GPU is timing out. Maybe it renders only the
parts it finished in time.&lt;/p&gt;
&lt;p&gt;Given the hardware architecture, this explanation is unlikely.&lt;/p&gt;
&lt;p&gt;This hypothesis is easy to test, because we can control for timing
with a shader that takes as long as we like:&lt;/p&gt;
&lt;div class="sourceCode" id="cb1"&gt;&lt;pre class="sourceCode c"&gt;&lt;code class="sourceCode c"&gt;&lt;span id="cb1-1"&gt;&lt;a href="#cb1-1" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="cf"&gt;for&lt;/span&gt; &lt;span class="op"&gt;(&lt;/span&gt;&lt;span class="dt"&gt;int&lt;/span&gt; i &lt;span class="op"&gt;=&lt;/span&gt; &lt;span class="dv"&gt;0&lt;/span&gt;&lt;span class="op"&gt;;&lt;/span&gt; i &lt;span class="op"&gt;&amp;lt;&lt;/span&gt; LARGE_NUMBER&lt;span class="op"&gt;;&lt;/span&gt; &lt;span class="op"&gt;++&lt;/span&gt;i&lt;span class="op"&gt;)&lt;/span&gt; &lt;span class="op"&gt;{&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-2"&gt;&lt;a href="#cb1-2" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;    &lt;span class="co"&gt;/* some work to prevent the optimizer from removing the loop */&lt;/span&gt;&lt;/span&gt;
&lt;span id="cb1-3"&gt;&lt;a href="#cb1-3" aria-hidden="true" tabindex="-1"&gt;&lt;/a&gt;&lt;span class="op"&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;After experimenting with such a shader, we learn…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If shaders have a time limit to protect against infinite loops, it’s
astronomically high. There’s no way our bunny hits that limit.&lt;/li&gt;
&lt;li&gt;The symptoms of timing out differ from the symptoms of our driver
rendering too much geometry.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That theory is out.&lt;/p&gt;
&lt;p&gt;Let’s experiment more. Modifying the shader and seeing where it
breaks, we find the only part of the shader contributing to the bug: the
amount of data interpolated per vertex. Modern graphics APIs allow
specifying “varying” data for each vertex, like the colour or the
surface normal. Then, for each triangle the hardware renders, these
“varyings” are interpolated across the triangle to provide smooth inputs
to the fragment shader, allowing efficient implementation of common
graphics techniques like Blinn-Phong shading.&lt;/p&gt;
&lt;p&gt;Putting the pieces together, what matters is the &lt;em&gt;product&lt;/em&gt; of
the number of vertices (geometry complexity) &lt;em&gt;times&lt;/em&gt; amount of
data per vertex (“shading” complexity). That product is “total amount of
per-vertex data”. The GPU faults if we use too much &lt;em&gt;total&lt;/em&gt;
per-vertex data.&lt;/p&gt;
&lt;p&gt;Why?&lt;/p&gt;
&lt;p&gt;When the hardware processes each vertex, the vertex shader produces
per-vertex data. That data has to &lt;em&gt;go&lt;/em&gt; somewhere. How this works
depends on the hardware architecture. Let’s consider common GPU
architectures.&lt;a href="#fn1" class="footnote-ref" id="fnref1"
role="doc-noteref"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Traditional &lt;strong&gt;immediate mode renderers&lt;/strong&gt; render directly
into the framebuffer. They first run the vertex shader for each vertex
of a triangle, then run the fragment shader for each pixel in the
triangle. Per-vertex “varying” data is passed almost directly between
the shaders, so immediate mode renderers are efficient for complex
scenes.&lt;/p&gt;
&lt;p&gt;There is a drawback: rendering directly into the framebuffer requires
tremendous amounts of memory access to constantly write the results of
the fragment shader and to read out back results when blending.
Immediate mode renderers are suited to discrete, power-hungry desktop
GPUs with dedicated video RAM.&lt;/p&gt;
&lt;p&gt;By contrast, &lt;strong&gt;tile-based deferred renderers&lt;/strong&gt; split
rendering into two passes. First, the hardware runs all vertex shaders
for the entire frame, not just for a single model. Then the framebuffer
is divided into small tiles, and dedicated hardware called a
&lt;em&gt;tiler&lt;/em&gt; determines which triangles are in each tile. Finally, for
each tile, the hardware runs all relevant fragment shaders and writes
the final blended result to memory.&lt;/p&gt;
&lt;p&gt;Tilers reduce memory traffic required for the framebuffer. As the
hardware renders a single tile at a time, it keeps a “cached” copy of
that tile of the framebuffer (called the “tilebuffer”). The tilebuffer
is small, just a few kilobytes, but tilebuffer access is &lt;em&gt;fast&lt;/em&gt;.
Writing to the tilebuffer is cheap, and unlike immediate renderers,
blending is almost free. Because main memory access is expensive and
mobile GPUs can’t afford dedicated video memory, tilers are suited to
mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s
AGX.&lt;/p&gt;
&lt;p&gt;Yes, AGX is a &lt;em&gt;mobile&lt;/em&gt; GPU, designed for the iPhone. The M1 is
a screaming fast desktop, but its unified memory and tiler GPU have
roots in mobile phones. Tilers work well on the desktop, but there are
some drawbacks.&lt;/p&gt;
&lt;p&gt;First, at the start of a frame, the contents of the tilebuffer are
undefined. If the application needs to preserve existing framebuffer
contents, the driver needs to load the framebuffer from main memory and
store it into the tilebuffer. This is expensive.&lt;/p&gt;
&lt;p&gt;Second, because all vertex shaders are run before any fragment
shaders, the hardware needs a buffer to store the outputs of all vertex
shaders. In general, there is much more data required than space inside
the GPU, so this buffer must be in main memory. This is also
expensive.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ah-ha&lt;/strong&gt;. Because AGX is a tiler, it requires a buffer
of &lt;em&gt;all&lt;/em&gt; per-vertex data. We fault when we use too &lt;em&gt;much&lt;/em&gt;
total per-vertex data, overflowing the buffer.&lt;/p&gt;
&lt;p&gt;…So how do we allocate a larger buffer?&lt;/p&gt;
&lt;p&gt;On some tilers, like older versions of Arm’s Mali GPU, the userspace
driver computes how large this “varyings” buffer should be and allocates
it.&lt;a href="#fn2" class="footnote-ref" id="fnref2"
role="doc-noteref"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; To fix the faults, we can try
increasing the sizes of all buffers we allocate, in the hopes that one
of them contains the per-vertex data.&lt;/p&gt;
&lt;p&gt;No dice.&lt;/p&gt;
&lt;p&gt;It’s prudent to observe what Apple’s Metal driver does. We can cook
up a Metal program drawing variable amounts of geometry and trace all
GPU memory allocations that Metal performs while running our program.
Doing so, we learn that increasing the amount of geometry drawn does
&lt;em&gt;not&lt;/em&gt; increase the sizes of any allocated buffers. In fact, it
doesn’t change &lt;em&gt;anything&lt;/em&gt; in the command buffer submitted to the
kernel, except for the single “number of vertices” field in the draw
command.&lt;/p&gt;
&lt;p&gt;We &lt;em&gt;know&lt;/em&gt; that buffer exists. If it’s not allocated by
userspace – and by now it seems that it’s not – it must be allocated by
the kernel or firmware.&lt;/p&gt;
&lt;p&gt;Here’s a funny thought: maybe we don’t specify the size of the buffer
at all. Maybe it’s &lt;em&gt;okay&lt;/em&gt; for it to overflow, and there’s a way
to handle the overflow.&lt;/p&gt;
&lt;p&gt;It’s time for a little reconnaissance. Digging through what little
public documentation exists for AGX, we learn from one &lt;a
href="https://developer.apple.com/videos/play/wwdc2020/10602/"&gt;WWDC
presentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Tiled Vertex Buffer stores the Tiling phase output, which
includes the post-transform vertex data…&lt;/p&gt;
&lt;p&gt;But it may cause a Partial Render if full. A Partial Render is when
the GPU splits the render pass in order to flush the contents of that
buffer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can
overflow. To cope, the GPU stops accepting new geometry, renders the
existing geometry, and restarts rendering.&lt;/p&gt;
&lt;p&gt;Since partial renders hurt performance, Metal application developers
need to know about them to optimize their applications. There should be
&lt;em&gt;performance counters&lt;/em&gt; flagging this issue. Poking around, we
find two:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Number of partial renders.&lt;/li&gt;
&lt;li&gt;Number of bytes used of the parameter buffer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Wait, what’s a “parameter buffer”?&lt;/p&gt;
&lt;p&gt;Remember the rumours that AGX is derived from PowerVR? The public
PowerVR &lt;a
href="https://docs.imgtec.com/Profiling_and_Optimisations/PerfRec/topics/c_PerfRec_parameter_buffer.html"&gt;optimization&lt;/a&gt;
&lt;a
href="https://github.com/powervr-graphics/WebGL_SDK/blob/4.0/Documentation/Architecture%20Guides/PowerVR.Performance%20Recommendations.pdf"&gt;guides&lt;/a&gt;
explain:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[The] list containing pointers to each vertex passed in from the
application… is called the &lt;strong&gt;parameter buffer&lt;/strong&gt; (PB) and is
stored in system memory along with the vertex data.&lt;/p&gt;
&lt;p&gt;Each varying requires additional space in the parameter buffer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Tiled Vertex Buffer &lt;em&gt;is&lt;/em&gt; the Parameter Buffer. PB is the
PowerVR name, TVB is the public Apple name, and PB is still an internal
Apple name.&lt;/p&gt;
&lt;p&gt;What happens when PowerVR overflows the parameter buffer?&lt;/p&gt;
&lt;p&gt;An old &lt;a
href="http://imgtec.eetrend.com/sites/imgtec.eetrend.com/files/download/201402/1458-2110-1385011428.pdf"&gt;PowerVR
presentation&lt;/a&gt; says that when the parameter buffer is full, the
“render is flushed”, meaning “flushed data must be retrieved from the
frame buffer as successive tile renders are performed”. In other words,
it performs a partial render.&lt;/p&gt;
&lt;p&gt;Back to the Apple M1, it seems the hardware is failing to perform a
partial render. Let’s revisit the broken render.&lt;/p&gt;
&lt;figure&gt;
&lt;img src="/PartialPhong.webp" alt="Partially rendered bunny, again" /&gt;
&lt;figcaption aria-hidden="true"&gt;Partially rendered bunny,
again&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Notice &lt;em&gt;parts&lt;/em&gt; of the model are correctly rendered. The parts
that are not only have the black clear colour of the scene rendered at
the start. Let’s consider the logical order of events.&lt;/p&gt;
&lt;p&gt;First, the hardware runs vertex shaders for the bunny until the
parameter buffer overflows. This works: the partial geometry is
correct.&lt;/p&gt;
&lt;p&gt;Second, the hardware rasterizes the partial geometry and runs the
fragment shaders. This works: the shading is correct.&lt;/p&gt;
&lt;p&gt;Third, the hardware flushes the partial render to the framebuffer.
This must work for us to see anything at all.&lt;/p&gt;
&lt;p&gt;Fourth, the hardware runs vertex shaders for the rest of the bunny’s
geometry. This ought to work: the configuration is identical to the
original vertex shaders.&lt;/p&gt;
&lt;p&gt;Fifth, the hardware rasterizes and shades the rest of the geometry,
blending with the old partial render. Because AGX is a tiler, to
preserve that existing partial render, the hardware needs to load it
back into the tilebuffer. We have no idea how it does this.&lt;/p&gt;
&lt;p&gt;Finally, the hardware flushes the render to the framebuffer. This
should work as it did the first time.&lt;/p&gt;
&lt;p&gt;The only problematic step is &lt;em&gt;loading the framebuffer back into
the tilebuffer after a partial render&lt;/em&gt;. Usually, the driver supplies
two “extra” fragment shaders. One clears the tilebuffer at the start,
and the other flushes out the tilebuffer contents at the end.&lt;/p&gt;
&lt;p&gt;If the application needs the existing framebuffer contents preserved,
instead of writing a clear colour, the “load tilebuffer” program instead
reads from the framebuffer to reload the contents. Handling this
requires quite a bit of code, but it works in our driver.&lt;/p&gt;
&lt;p&gt;Looking closer, AGX requires more auxiliary programs.&lt;/p&gt;
&lt;p&gt;The “store” program is supplied &lt;em&gt;twice&lt;/em&gt;. I noticed this when
initially bringing up the hardware, but the reason for the duplication
was unclear. Omitting each copy separately and seeing what breaks, the
reason becomes clear: one program flushes the &lt;em&gt;final&lt;/em&gt; render, and
the other flushes a &lt;em&gt;partial render&lt;/em&gt;.&lt;a href="#fn3"
class="footnote-ref" id="fnref3" role="doc-noteref"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;…What about the program that loads the framebuffer into the
tilebuffer?&lt;/p&gt;
&lt;p&gt;When a partial render is possible, there are two “load” programs. One
writes the clear colour or loads the framebuffer, depending on the
application setting. We understand this one. The other &lt;em&gt;always&lt;/em&gt;
loads the framebuffer.&lt;/p&gt;
&lt;p&gt;…Always loads the framebuffer, as in, for loading back with a partial
render even if there is a clear at the start of the frame?&lt;/p&gt;
&lt;p&gt;If this program is the issue, we can confirm easily. Metal must
require it to draw the same bunny, so we can write a Metal application
drawing the bunny and stomp over its GPU memory to replace this
auxiliary load program with one always loading with black.&lt;/p&gt;
&lt;figure&gt;
&lt;img src="/Metal-Artefacts.webp"
alt="Metal drawing the bunny, stomping over its memory." /&gt;
&lt;figcaption aria-hidden="true"&gt;Metal drawing the bunny, stomping over
its memory.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Doing so, Metal fails in a similar way. That means we’re at the root
cause. Looking at our own driver code, we don’t specify &lt;em&gt;any&lt;/em&gt;
program for this partial render load. Up until now, that’s worked okay.
If the parameter buffer is never overflowed, this program is unused. As
soon as a partial render is required, however, failing to provide this
program means the GPU dereferences a null pointer and faults. That
explains our GPU faults at the beginning.&lt;/p&gt;
&lt;p&gt;Following Metal, we supply our own program to load back the
tilebuffer after a partial render…&lt;/p&gt;
&lt;figure&gt;
&lt;img src="/BrokenDepth.webp" alt="Bunny with the fix" /&gt;
&lt;figcaption aria-hidden="true"&gt;Bunny with the fix&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;…which does &lt;em&gt;not&lt;/em&gt; fix the rendering! Cursed, this GPU. The
faults go away, but the render still isn’t quite right for the first few
frames, indicating partial renders are still broken. Notice the weird
artefacts on the feet.&lt;/p&gt;
&lt;p&gt;Curiously, the render “repairs itself” after a few frames, suggesting
the parameter buffer stops overflowing. This implies the parameter
buffer can be resized (by the kernel or by the firmware), and the system
is growing the parameter buffer after a few frames in response to
overflow. This mechanism makes sense:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The hardware can’t allocate more parameter buffer space itself.&lt;/li&gt;
&lt;li&gt;Overflowing the parameter buffer is expensive, as partial renders
require tremendous memory bandwidth.&lt;/li&gt;
&lt;li&gt;Overallocating the parameter buffer wastes memory for applications
rendering simple geometry.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Starting the parameter buffer small and growing in response to
overflow provides a balance, reducing the GPU’s memory footprint and
minimizing partial renders.&lt;/p&gt;
&lt;p&gt;Back to our misrendering. There are actually &lt;em&gt;two&lt;/em&gt; buffers
being used by our program, a colour buffer (framebuffer)… and a depth
buffer. The depth buffer isn’t directly visible, but facilitates the
“depth test”, which discards far pixels that are occluded by other close
pixels. While the partial render mechanism discards geometry, the depth
test discards pixels.&lt;/p&gt;
&lt;p&gt;That would explain the missing pixels on our bunny. The depth test is
broken with partial renders. Why? The depth test depends on the depth
buffer, so the depth buffer must &lt;em&gt;also&lt;/em&gt; be stored after a partial
render and loaded back when resuming. Comparing a trace from our driver
to a trace from Metal, looking for any relevant difference, we
eventually stumble on the configuration required to make depth buffer
flushes work.&lt;/p&gt;
&lt;p&gt;And with that, we get our bunny.&lt;/p&gt;
&lt;figure&gt;
&lt;img src="/Final.webp" alt="The final Phong shaded bunny" /&gt;
&lt;figcaption aria-hidden="true"&gt;The final Phong shaded bunny&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;section id="footnotes" class="footnotes footnotes-end-of-document"
role="doc-endnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn1"&gt;&lt;p&gt;These explanations are massive oversimplifications of
how modern GPUs work, but it’s good enough for our purposes here.&lt;a
href="#fnref1" class="footnote-back" role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id="fn2"&gt;&lt;p&gt;This is a worse idea than it sounds. Starting with the
new Valhall architecture, Mali allocates varyings much more
efficiently.&lt;a href="#fnref2" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id="fn3"&gt;&lt;p&gt;Why the duplication? I have not yet observed Metal using
different programs for each. However, for front buffer rendering,
partial renders need to be flushed to a temporary buffer for this scheme
to work. Of course, you may as well use double buffering at that
point.&lt;a href="#fnref3" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/asahi-gpu-part-5.html</guid><pubDate>Fri, 13 May 2022 00:00:00 -0500</pubDate></item><item><title>Fun and Games with Exposure Notifications</title><link>https://alyssarosenzweig.ca/blog/fun-and-games-with-exposure-notifications.html</link><description>&lt;p&gt;&lt;a
href="https://en.wikipedia.org/wiki/Exposure_Notification"&gt;&lt;em&gt;Exposure
Notifications&lt;/em&gt;&lt;/a&gt; is a protocol developed by Apple and Google for
facilitating COVID-19 contact tracing on &lt;em&gt;mobile phones&lt;/em&gt; by
exchanging codes with nearby phones over &lt;a
href="https://en.wikipedia.org/wiki/Bluetooth"&gt;Bluetooth&lt;/a&gt;,
implemented within the Android and iOS operating systems, now available
here in Toronto.&lt;/p&gt;
&lt;p&gt;Wait – phones? Android and iOS only? Can’t my &lt;a
href="https://debian.org"&gt;Debian&lt;/a&gt; laptop participate? It has a recent
Bluetooth chip. What about phones running GNU/Linux distributions like
the &lt;a href="https://en.wikipedia.org/wiki/PinePhone"&gt;PinePhone&lt;/a&gt; or
&lt;a href="https://en.wikipedia.org/wiki/Librem_5"&gt;Librem 5&lt;/a&gt;?&lt;/p&gt;
&lt;p&gt;Exposure Notifications breaks down neatly into three sections: a
Bluetooth layer, some cryptography, and integration with local public
health authorities. Linux is up to the task, via &lt;a
href="http://www.bluez.org/"&gt;BlueZ&lt;/a&gt;, &lt;a
href="https://en.wikipedia.org/wiki/OpenSSL"&gt;OpenSSL&lt;/a&gt;, and some &lt;a
href="https://en.wikipedia.org/wiki/Python_(programming_language)"&gt;Python&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Given my background, will this build to be a &lt;a
href="/blog/my-name-is-cafe-beverage.html"&gt;reverse-engineering epic&lt;/a&gt;
resulting in a novel open stack for a closed system?&lt;/p&gt;
&lt;p&gt;…&lt;/p&gt;
&lt;p&gt;Not at all. The specifications for the Exposure Notifications are
available for both the &lt;a
href="https://covid19-static.cdn-apple.com/applications/covid19/current/static/contact-tracing/pdf/ExposureNotification-BluetoothSpecificationv1.2.pdf?1"&gt;Bluetooth
protocol&lt;/a&gt; and the &lt;a
href="https://covid19-static.cdn-apple.com/applications/covid19/current/static/contact-tracing/pdf/ExposureNotification-CryptographySpecificationv1.2.pdf?1"&gt;underlying
cryptography&lt;/a&gt;. A &lt;a
href="https://github.com/google/exposure-notifications-internals"&gt;partial
reference implementation&lt;/a&gt; is available for Android, as is an
independent Android implementation in &lt;a
href="https://github.com/microg/android_packages_apps_GmsCore"&gt;microG&lt;/a&gt;.
In Canada, the key servers run an &lt;a
href="https://github.com/cds-snc/covid-alert-server"&gt;open source
stack&lt;/a&gt; originally built by Shopify and now maintained by the &lt;a
href="https://digital.canada.ca/"&gt;Canadian Digital Service&lt;/a&gt;,
including open &lt;a
href="https://github.com/cds-snc/covid-alert-server/tree/master/proto"&gt;protocol
documentation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;All in all, this is looking to be a smooth-sailing weekend&lt;a
href="#fn1" class="footnote-ref" id="fnref1"
role="doc-noteref"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; project.&lt;/p&gt;
&lt;p&gt;The devil’s in the details.&lt;/p&gt;
&lt;h2 id="bluetooth"&gt;Bluetooth&lt;/h2&gt;
&lt;p&gt;Exposure Notifications operates via Bluetooth Low Energy
“advertisements”. Scanning for other devices is as simple as scanning
for advertisements, and broadcasting is as simple as advertising
ourselves.&lt;/p&gt;
&lt;p&gt;On an Android phone, this is handled deep within Google Play
Services. Can we drive the protocol from userspace on a regular
GNU/Linux laptop? It depends. Not all laptops support Bluetooth, not all
Bluetooth implementations support Bluetooth Low Energy, and I hear not
all Bluetooth Low Energy implementations properly support undirected
transmissions (“advertising”).&lt;/p&gt;
&lt;p&gt;Luckily in my case, I develop on an Debianized Chromebook with a
Wi-Fi/Bluetooth module. I’ve never used the Bluetooth, but it turns out
the module has full support for advertisements, verified with the
&lt;code&gt;lescan&lt;/code&gt; (&lt;strong&gt;L&lt;/strong&gt;ow &lt;strong&gt;E&lt;/strong&gt;nergy
&lt;strong&gt;Scan&lt;/strong&gt;) command of the &lt;code&gt;hcitool&lt;/code&gt; Bluetooth
utility.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;hcitool&lt;/code&gt; is a part of BlueZ, the standard Linux library
for Bluetooth. Since &lt;code&gt;lescan&lt;/code&gt; is able to detect nearby phones
running Exposure Notifications, pouring through its source code is a
good first step to our implementation. With some minor changes to
&lt;code&gt;hcitool&lt;/code&gt; to dump packets as raw hex and to filter for the
Exposure Notifications protocol, we can print all nearby Exposure
Notifications advertisements. So far, so good.&lt;/p&gt;
&lt;p&gt;That’s about where the good ends.&lt;/p&gt;
&lt;p&gt;While scanning is simple with reference code in &lt;code&gt;hcitool&lt;/code&gt;,
advertising is complicated by BlueZ’s lack of an interface at the time
of writing. While a general “enable advertising” routine exists,
routines to set advertising parameters and data per the Exposure
Notifications specification are unavailable. This is not a showstopper,
since BlueZ is itself an open source userspace library. We can drive the
Bluetooth module the same way BlueZ does internally, filling in the
necessary gaps in the API, while continuing to use BlueZ for the
heavy-lifting.&lt;/p&gt;
&lt;p&gt;Some care is needed to multiplex scanning and advertising within a
single thread while remaining power efficient. The key is that
advertising, once configured, is handled entirely in hardware without
CPU intervention. On the other hand, scanning does require CPU
involvement, but it is &lt;em&gt;not&lt;/em&gt; necessary to scan continuously.
Since COVID-19 is thought to transmit from &lt;em&gt;sustained&lt;/em&gt; exposure,
we only need to scan every few minutes. (Food for thought: how does this
connect to the sampling theorem?)&lt;/p&gt;
&lt;p&gt;Thus we can order our operations as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Configure advertising&lt;/li&gt;
&lt;li&gt;Scan for devices&lt;/li&gt;
&lt;li&gt;Wait for several minutes&lt;/li&gt;
&lt;li&gt;Repeat.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Since most of the time the program is asleep, this loop is efficient.
It additionally allows us to reconfigure advertising every ten to
fifteen minutes, in order to change the Bluetooth address to prevent
tracking.&lt;/p&gt;
&lt;p&gt;All of the above amounts to a few hundred lines of C code, treating
the Exposure Notifications packets themselves as opaque random data.&lt;/p&gt;
&lt;h2 id="cryptography"&gt;Cryptography&lt;/h2&gt;
&lt;p&gt;Yet the data is far from random; it is the result of a series of
operations in terms of secret keys defined by the Exposure Notifications
cryptography specification. Every day, a “temporary exposure key” is
generated, from which a “rolling proximity identifier key” and an
“associated encrypted metadata key” are derived. These are used to
generate a “rolling proximity identifier” and the “associated encrypted
metadata”, which are advertised over Bluetooth and changed in lockstep
with the Bluetooth random addresses.&lt;/p&gt;
&lt;p&gt;There are lots of moving parts to get right, but each derivation
reuses a common encryption primitive: HKDF-SHA256 for key derivation,
AES-128 for the rolling proximity identifier, and AES-128-CTR for the
associated encrypted metadata. Ideally, we would grab a state-of-the-art
library of cryptography primitives like &lt;a
href="https://nacl.cr.yp.to/"&gt;&lt;code&gt;NaCl&lt;/code&gt;&lt;/a&gt; or &lt;a
href="https://doc.libsodium.org/"&gt;&lt;code&gt;libsodium&lt;/code&gt;&lt;/a&gt; and wire
everything up.&lt;/p&gt;
&lt;p&gt;First, some good news: once these routines are written, we can
reliably unit test them. Though the specification states that “test
vectors… are available upon request”, it isn’t clear &lt;em&gt;who&lt;/em&gt; to
request from. But Google’s reference implementation is itself
unit-tested, and sure enough, it contains a &lt;a
href="https://github.com/google/exposure-notifications-internals/blob/a0394e69c51aa118f5000b8a2c2f15f1f9aedb7d/app/src/androidTest/java/com/google/samples/exposurenotification/testing/TestVectors.java"&gt;&lt;code&gt;TestVectors.java&lt;/code&gt;&lt;/a&gt;
file, from which we can grab the vectors for a complete set of unit
tests.&lt;/p&gt;
&lt;p&gt;After patting ourselves on the back for writing unit tests, we’ll
need to pick a library to implement the cryptography. Suppose we try
&lt;code&gt;NaCl&lt;/code&gt; first. We’ll quickly realize the primitives we need
are missing, so we move onto &lt;code&gt;libsodium&lt;/code&gt;, which is
backwards-compatible with NaCl. For a moment, this will work –
&lt;code&gt;libsodium&lt;/code&gt; has upstream support for HKDF-SHA256.
Unfortunately, the version of &lt;code&gt;libsodium&lt;/code&gt; shipping in Debian
testing is too old for HKDF-SHA256. Not a big problem – we can backwards
port the implementation, written in terms of the underlying HMAC-SHA256
operations, and move on to the AES.&lt;/p&gt;
&lt;p&gt;AES is a standard symmetric cipher, so &lt;code&gt;libsodium&lt;/code&gt; has
excellent support… for some modes. However standard, AES is not
&lt;em&gt;one&lt;/em&gt; cipher; it is a family of ciphers with different key
lengths and operating modes, with dramatically different security
properties. “AES-128-CTR” in the Exposure Notifications specification is
clearly 128-bit AES in CTR
(&lt;strong&gt;C&lt;/strong&gt;oun&lt;strong&gt;t&lt;/strong&gt;e&lt;strong&gt;r&lt;/strong&gt;) mode, but
what about “AES-128” alone, stated to operate on a “single AES-128
block”?&lt;/p&gt;
&lt;p&gt;The mode implicitly specified is known as ECB
(&lt;strong&gt;E&lt;/strong&gt;lectronic &lt;strong&gt;C&lt;/strong&gt;ode&lt;strong&gt;b&lt;/strong&gt;ook)
mode and is known to have fatal security flaws in most applications.
Because AES-ECB is generally insecure, &lt;code&gt;libsodium&lt;/code&gt; does not
have any support for this cipher mode. Great, now we have &lt;em&gt;two&lt;/em&gt;
problems – we have to rewrite our cryptography code against a new
library, and we have to consider if there is a vulnerability in Exposure
Notifications.&lt;/p&gt;
&lt;p&gt;ECB’s crucial flaw is that for a given key, identical plaintext will
always yield identical ciphertext, regardless of position in the stream.
Since AES is block-based, this means identical blocks yield identical
ciphertext, leading to &lt;a
href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#ECB"&gt;trivial
cryptanalysis&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In Exposure Notifications, ECB mode is used only to derive rolling
proximity identifiers from the rolling proximity identifier key and the
timestamp, by the equation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;RPI_ij = AES_128_ECB(RPIK_i, PaddedData_j)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;…where &lt;code&gt;PaddedData&lt;/code&gt; is a function of the quantized
timestamp. Thus the issue is avoided, as every plaintext will be unique
(since timestamps are monotonically increasing, unless you’re trying to
contact trace &lt;em&gt;Back to the Future&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;Nevertheless, &lt;code&gt;libsodium&lt;/code&gt; doesn’t know that, so we’ll need
to resort to a ubiquitous cryptography library that doesn’t, uh, take
security quite so seriously…&lt;/p&gt;
&lt;p&gt;I’ll leave &lt;a href="https://en.wikipedia.org/wiki/Heartbleed"&gt;the
implications&lt;/a&gt; up to your imagination.&lt;/p&gt;
&lt;h2 id="database"&gt;Database&lt;/h2&gt;
&lt;p&gt;While the Bluetooth and cryptography sections are governed by
upstream specifications, making sense of the data requires tracking a
significant amount of state. At &lt;em&gt;minimum&lt;/em&gt;, we must:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Record received packets (the Rolling Proximity Identifier and the
Associated Encrypted Metadata).&lt;/li&gt;
&lt;li&gt;Query received packets for diagnosed identifiers.&lt;/li&gt;
&lt;li&gt;Record our Temporary Encryption Keys.&lt;/li&gt;
&lt;li&gt;Query our keys to upload if we are diagnosed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If we were so inclined, we could handwrite all the serialization and
concurrency logic and hope we don’t have a bug that results in COVID-19
mayhem.&lt;/p&gt;
&lt;p&gt;A better idea is to grab &lt;a
href="https://sqlite.org/index.html"&gt;SQLite&lt;/a&gt;, perhaps &lt;a
href="https://sqlite.org/mostdeployed.html"&gt;the most deployed software
in the world&lt;/a&gt;, and express these actions as SQL queries. The database
persists to disk, and we can even express natural unit tests with a
synthetic in-memory database.&lt;/p&gt;
&lt;p&gt;With this infrastructure, we’re now done with the primary daemon,
recording Exposure Notification identifiers to the database and
broadcasting our own identifiers. That’s not interesting if we never
&lt;em&gt;do&lt;/em&gt; anything with that data, though. Onwards!&lt;/p&gt;
&lt;h2 id="key-retrieval"&gt;Key retrieval&lt;/h2&gt;
&lt;p&gt;Once per day, Exposure Notifications implementations are expected to
query the server for Temporary Encryption Keys associated with diagnosed
COVID-19 cases. From these keys, the cryptography implementation can
reconstruct the associated Rolling Proximity Identifiers, for which we
can query the database to detect if we have been exposed.&lt;/p&gt;
&lt;p&gt;Per Google’s &lt;a
href="https://developers.google.com/android/exposure-notifications/exposure-key-file-format"&gt;documentation&lt;/a&gt;,
the servers are expected to return a &lt;code&gt;zip&lt;/code&gt; file containing
two files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;export.bin&lt;/code&gt;: a container serialized as &lt;a
href="https://en.wikipedia.org/wiki/Protocol_Buffers"&gt;Protocol
Buffers&lt;/a&gt; containing Diagnosis Keys&lt;/li&gt;
&lt;li&gt;&lt;code&gt;export.sig&lt;/code&gt;: a signature for the export with the public
health agency’s key&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The signature is not terribly interesting to us. On Android, it
appears the system pins the public keys of recognized public health
agencies as an integrity check for the received file. However, this
public key is given directly to Google; we don’t appear to have an easy
way to access it.&lt;/p&gt;
&lt;p&gt;Does it matter? For our purposes, it’s unlikely. The Canadian key
retrieval server is already transport-encrypted via HTTPS, so tampering
with the data would already require compromising a certificate authority
in addition to intercepting the requests to &lt;a href="https://canada.ca"
class="uri"&gt;https://canada.ca&lt;/a&gt;. Broadly speaking, that limits
attackers to nation-states, and since Canada has no reason to attack its
own infrastructure, that limits our threat model to foreign
nation-states. International intelligence agencies probably have better
uses of resources than getting people to take extra COVID tests.&lt;/p&gt;
&lt;p&gt;It’s worth noting other countries’ implementations could serve this
zip file over plaintext HTTP, in which case this signature check becomes
important.&lt;/p&gt;
&lt;p&gt;Focusing then on &lt;code&gt;export.bin&lt;/code&gt;, we may import the relevant
protocol buffer definitions to extract the keys for matching against our
database. Since this requires only read-only access to the database and
executes infrequently, we can safely perform this work from a separate
process written in a higher-level language like Python, interfacing with
the cryptography routines over the Python &lt;a
href="https://docs.python.org/3/library/ctypes.html"&gt;foreign function
interface &lt;code&gt;ctypes&lt;/code&gt;&lt;/a&gt;. Extraction is easy with the Python
protocol buffers implementation, and downloading should be as easy as a
&lt;code&gt;GET&lt;/code&gt; request with the standard library’s
&lt;code&gt;urllib&lt;/code&gt;, right?&lt;/p&gt;
&lt;p&gt;Here we hit a gotcha: the retrieval endpoint is guarded behind an &lt;a
href="https://en.wikipedia.org/wiki/HMAC"&gt;HMAC&lt;/a&gt;, requiring
authentication to download the &lt;code&gt;zip&lt;/code&gt;. The protocol
documentation states:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Of course there’s no reliable way to truly authenticate these
requests in an environment where millions of devices have immediate
access to them upon downloading an Application: this scheme is purely to
make it much more difficult to casually scrape these keys.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Ah, security by obscurity. Calculating the HMAC itself is simple
given the documentation, but it requires a “secret” HMAC key specific to
the server. As the documentation is aware, this key is hardly secret,
but it’s not available on the Canadian Digital Service’s &lt;a
href="https://github.com/cds-snc/covid-alert-app"&gt;official
repositories&lt;/a&gt;. Interoperating with the upstream servers would require
some “extra” tricks.&lt;/p&gt;
&lt;p&gt;From purely academic interest, we can write and debug our
implementation without any such authorization by running our own sandbox
server. Minus the configuration, the server source is available, so
after spinning up a virtual machine and fighting with Go versioning, we
can test our Python script.&lt;/p&gt;
&lt;p&gt;Speaking of a personal sandbox…&lt;/p&gt;
&lt;h2 id="key-upload"&gt;Key upload&lt;/h2&gt;
&lt;p&gt;There is one essential edge case to the contact tracing
implementation, one that we &lt;em&gt;can’t&lt;/em&gt; test against the Canadian
servers. And edge cases matter. In effect, the entire Exposure
Notifications infrastructure is designed for the edge cases. If you
don’t care about edge cases, you don’t care about digital contact
tracing (so please, stay at home.)&lt;/p&gt;
&lt;p&gt;The key feature – and key edge case – is uploading Temporary Exposure
Keys to the Canadian key server in case of a COVID-19 diagnosis. This
upload requires an alphanumeric code generated by a healthcare provider
upon diagnosis, so if we used the shared servers, we couldn’t test an
implementation. With our sandbox, we can generate as many alphanumeric
codes as we’d like.&lt;/p&gt;
&lt;p&gt;Once sandboxed, there isn’t much to the implementation itself: the
keys are snarfed out of the SQLite database, we handshake with the
server over protocol buffers marshaled over POST requests, and we throw
in some public-key cryptography via the Python bindings to
&lt;code&gt;libsodium&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This functionality neatly fits into a second dedicated Python script
which does &lt;em&gt;not&lt;/em&gt; interface with the main library. It’s exposed as
a command line interface with flow resembling that of the mobile
application, adhering reasonably to the UNIX philosophy. Admittedly I’m
not sure wrestling with the command line is top on the priority list of
a Linux hacker ill with COVID-19. Regardless, the interface is suitable
for higher-level (graphical) abstractions.&lt;/p&gt;
&lt;p&gt;Problem solved, but of course there’s a gotcha: if the request is
malformed, an error should be generated as a key robustness feature.
Unfortunately, while developing the script against my sandbox, a bug led
the request to be dropped unexpectedly, rather than returning with an
error message. On the server implemented in &lt;a
href="https://en.wikipedia.org/wiki/Go_(programming_language)"&gt;Go&lt;/a&gt;,
there was an apparent &lt;code&gt;nil&lt;/code&gt; dereference. Oops. Fixing this
isn’t necessary for this project, but it’s still a bug, even if it
requires a COVID-19 diagnosis to trigger. So I went and did the Canadian
thing and sent a pull request.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;All in all, we end up with a Linux implementation of Exposure
Notifications functional in Ontario, Canada. What’s next? Perhaps
supporting contact tracing systems elsewhere in the world – patches
welcome. Closer to home, while functional, the aesthetics are not (yet)
anything to write home about – perhaps we could write a touch-based
Linux interface for mobile Linux interfaces like &lt;a
href="https://en.wikipedia.org/wiki/KDE_Plasma_5#Plasma_Mobile"&gt;Plasma
Mobile&lt;/a&gt; and &lt;a
href="https://developer.puri.sm/Librem5/Software_Reference/Environments/Phosh.html"&gt;Phosh&lt;/a&gt;,
maybe even running it on a Android flagship flashed with &lt;a
href="https://en.wikipedia.org/wiki/PostmarketOS"&gt;postmarketOS&lt;/a&gt; to go
full circle.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gitlab.freedesktop.org/alyssa/liben"&gt;Source code for
&lt;code&gt;liben&lt;/code&gt; is available&lt;/a&gt; for any one who dares go near.
Compiling from source is straightforward but necessary at the time of
writing. As for packaging?&lt;/p&gt;
&lt;p&gt;Here’s hoping COVID-19 contact tracing will be obsolete by the time
&lt;code&gt;liben&lt;/code&gt; hits Debian stable.&lt;/p&gt;
&lt;section id="footnotes" class="footnotes footnotes-end-of-document"
role="doc-endnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn1"&gt;&lt;p&gt;Today (Monday) is Labour Day, so this is a 3-day
weekend. But I started on Saturday and posted this today, so it
&lt;em&gt;technically&lt;/em&gt; counts.&lt;a href="#fnref1" class="footnote-back"
role="doc-backlink"&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/fun-and-games-with-exposure-notifications.html</guid><pubDate>Mon, 07 Sep 2020 00:00:00 -0500</pubDate></item><item><title>Hilariously Fast Volume Computation with the Divergence Theorem</title><link>https://alyssarosenzweig.ca/blog/hilariously-fast-volume-computation-with-the-divergence-theorem.html</link><description>&lt;p&gt;(No, there won’t be jokes.)&lt;/p&gt;
&lt;p&gt;The following presents a fast algorithm for volume computation of a
simple, closed, triangulated 3D mesh. This assumption is a consequence
of the divergence theorem. Further extensions may generalise to other
meshes as well, although that is presently out of scope.&lt;/p&gt;
&lt;p&gt;We begin with the definition of volume as the triple integral over a
region of the constant one:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \iiint_R 1 \mathrm{d}V\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Let &lt;span class="math inline"&gt;\(\mathbf{F}\)&lt;/span&gt; be a function in
&lt;span class="math inline"&gt;\(\mathbb{R}^3\)&lt;/span&gt; such that its
divergence is equal to one. For the purposes of this paper, we
choose:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\mathbf{F}(x, y, z) = &amp;lt;x, 0,
0&amp;gt;\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It can easily be verified that&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\mathrm{div} \mathbf{F} = \frac{\partial
F}{\partial x} + \frac{\partial F}{\partial y} + \frac{\partial
F}{\partial z} = 1 + 0 + 0 = 1\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Therefore,&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \iiint_R 1 dV = \iiint_R
\mathrm{div} \mathbf{F}(x, y, z) \mathrm{d}V\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;By the Divergence Theorem, this is equal to the surface integral:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \iint_S \mathbf{F}(x, y, z)
\mathrm{d}\mathbf{S}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This surface integral, defined over the surface S of the 3D mesh, is
equal to the sum of its piecewise triangle parts. Let &lt;span
class="math inline"&gt;\(T_i\)&lt;/span&gt; denote the surface of the &lt;span
class="math inline"&gt;\(i\)&lt;/span&gt;’th triangle in the mesh. Then,&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \sum_{i = 0} \iint_{T_i}
\mathbf{F}(x, y, z) \mathrm{d}\mathbf{S}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Let &lt;span class="math inline"&gt;\(T_{in}\)&lt;/span&gt; represent the &lt;span
class="math inline"&gt;\(n\)&lt;/span&gt;’th vertex of the &lt;span
class="math inline"&gt;\(i\)&lt;/span&gt;’th triangle. Let &lt;span
class="math inline"&gt;\(\Delta_1\)&lt;/span&gt; equal the vector difference
between &lt;span class="math inline"&gt;\(T_{i1}\)&lt;/span&gt; and &lt;span
class="math inline"&gt;\(T_{i0}\)&lt;/span&gt;, and &lt;span
class="math inline"&gt;\(\Delta_2\)&lt;/span&gt; likewise equal to &lt;span
class="math inline"&gt;\(T_{i2} - T{i0}\)&lt;/span&gt;. Each individual triangle
&lt;span class="math inline"&gt;\(T_i\)&lt;/span&gt; may thus be parametrised
as:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\mathbf{r}(u, v) = T_{i0} + u\Delta_1 +
v\Delta_2\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Then, simple differentiation yields:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\mathbf{r}_u = \Delta_1\]&lt;/span&gt; &lt;span
class="math display"&gt;\[\mathbf{r}_v = \Delta_2\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Therefore,&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\mathbf{r}_u \times \mathbf{r}_v =
\Delta_1 \times \Delta_2\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Thus, the surface integral can be rewritten in terms of this
parametrisation, substituting in the definition of &lt;span
class="math inline"&gt;\(\mathbf{F}\)&lt;/span&gt; as needed:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \sum_{i = 0} \iint_{T_i}
\mathbf{F}(x, y, z) (\mathbf{r}_u \times \mathbf{r}_v) dA\]&lt;/span&gt; &lt;span
class="math display"&gt;\[= \sum_{i = 0} \iint_{T_i} \mathbf{F}(x, y, z)
\dot (\Delta_{i1} \times \Delta_{i2}) dA\]&lt;/span&gt; &lt;span
class="math display"&gt;\[= \sum_{i = 0} \iint_{T_i} &amp;lt;x, 0, 0&amp;gt; \dot
(\Delta_{i1} \times \Delta_{i2}) dA\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This cross product is constant throughout the triangle and easy to
calculate from the vertex data. Only the X component of the cross
product should be calculated; the others are equal to zero due to the
dot product with the zero components of &lt;span
class="math inline"&gt;\(\mathbf{F}\)&lt;/span&gt;. &lt;span
class="math inline"&gt;\(V\)&lt;/span&gt; can be thus be rewritten as:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \sum_{i = 0} (\Delta_{i1} \times
\Delta_{i2})_x \iint_{T_i} x dA\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;We now focus on the surface integral &lt;span
class="math inline"&gt;\(\iint_{T_i} x dA\)&lt;/span&gt;. Expanding with the
parametrisation yields:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\iint_{T_i} x dA = \int_{0}^{1}
\int_{0}^{u} x dv du = \int_{0}^{1} \int_{0}^{u} (T_{i0x} + u
\Delta_{i1x} + v \Delta_{i2x}) dv du\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This integral can be directly evaluated, treating vertex data as
constants:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[\int_{0}^{1} \int_{0}^{1-u} (T_{i0x} + u
\Delta_{i1x} + v \Delta_{i2x}) dv du\]&lt;/span&gt; &lt;span
class="math display"&gt;\[= T_{i0x} \int_{0}^{1} \int_{0}^{1-u} dv du +
\Delta_{i1x} \int_{0}^{1} \int_{0}^{1-u} u dv du + \Delta_{i2x})
\int_{0}^{1} \int_{0}^{1-u} v dv du\]&lt;/span&gt; &lt;span
class="math display"&gt;\[= T_{i0x} (\frac{1}{2}) + \Delta_{i1x}
(\frac{1}{6}) + \Delta_{i2x} (\frac{1}{6})\]&lt;/span&gt; &lt;span
class="math display"&gt;\[= T_{i0x} (\frac{1}{2}) + (T_{i1x} -
T_{i0x})(\frac{1}{6}) + (T_{i2x} - T_{i0x})(\frac{1}{6})\]&lt;/span&gt; &lt;span
class="math display"&gt;\[= T_{i0x} (\frac{1}{6}) + (T_{i1x})(\frac{1}{6})
+ (T_{i2x})(\frac{1}{6})\]&lt;/span&gt; &lt;span class="math display"&gt;\[=
\frac{1}{6}(T_{i0x} + T_{i1x} + T_{i2x})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Substituting into the original sum and pulling out a constant factor
of &lt;span class="math inline"&gt;\(\frac{1}{6}\)&lt;/span&gt; to avoid the inner
loop division, this yields the following compact formula for the
volume:&lt;/p&gt;
&lt;p&gt;&lt;span class="math display"&gt;\[V = \frac{1}{6} \sum_{i = 0}
(\Delta_{i1} \times \Delta_{i2})_x (T_{i0x} + T_{i1x} +
T_{i2x})\]&lt;/span&gt;&lt;/p&gt;
&lt;h2 id="performance-analysis"&gt;Performance analysis&lt;/h2&gt;
&lt;p&gt;The final algorithm contains no numerical integration nor
differentiation. In contrast to common naive algorithms for volume,
which are equivalent to rendering the mesh and then sampling the render,
an expensive operation, there is only a single loop in this algorithm,
over the triangles. Thus, this algorithm for volume computation is O(n)
to the number of the triangles. Furthermore, the per-triangle
calculation is similarly efficient: given the natural expansion of the
cross product, the inner part contains seven additions and three
multiplications. On the outside of the loop is only a single
multiplication. Thus, for a mesh of &lt;span
class="math inline"&gt;\(n\)&lt;/span&gt; triangles, the algorithm requires &lt;span
class="math inline"&gt;\(8n - 1\)&lt;/span&gt; additions and &lt;span
class="math inline"&gt;\(3n + 1\)&lt;/span&gt; multiplications, or &lt;span
class="math inline"&gt;\(11n\)&lt;/span&gt; floating point operations. This is
&lt;em&gt;very&lt;/em&gt; fast.&lt;/p&gt;
&lt;p&gt;For a ballpark number, if volume needs to be calculated every frame
in a high-performance 60 frames per second application, without the aid
of a GPU, only using the CPU capabilities of a &lt;a
href="https://raspberrypi.stackexchange.com/questions/55862/what-is-the-performance-and-the-performance-per-watt-of-raspberry-pi-3-in-gflops"&gt;$35
Raspberry Pi&lt;/a&gt;, around 30 million triangles could be measured every
frame.&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;The vector calculus exam is soon, and I need to study. Plus, who
doesn’t love 3D graphics?!&lt;/p&gt;
&lt;p&gt;&lt;del&gt;I would be (pleasantly) surprised if the algorithm is
novel.&lt;/del&gt; Further research &lt;em&gt;after&lt;/em&gt; posting reveals the paper
&lt;a
href="http://chenlab.ece.cornell.edu/Publication/Cha/icip01_Cha.pdf"&gt;Efficient
Feature Extraction for 2D/3D Objects in Mesh Representation&lt;/a&gt; by Cha
Zheng and Tsuhan Chen, which appears to describe the same algorithm,
although the derivation is different. It was fun while it lasted!&lt;/p&gt;
</description><guid isPermaLink="true">https://alyssarosenzweig.ca/blog/hilariously-fast-volume-computation-with-the-divergence-theorem.html</guid><pubDate>Fri, 16 Feb 2018 00:00:00 -0500</pubDate></item></channel></rss>