Ubuntu Linux with Ollama ROCm on AMD Ryzen 780M iGPU


If you’ve got an AMD Ryzen 7040 or 8040 series chip — laptops, mini PCs, the Phoenix family of APUs — you’ve got a Radeon 780M iGPU sitting there. gfx1103 in ROCm terminology. It’s a perfectly capable RDNA3 GPU with about 16 GiB of usable memory once you count UMA + GTT. It would be a great target for local LLM inference via Ollama.

Except nothing in the standard Linux stack ships kernels for it. Not Ubuntu’s librocblas5, not AMD’s official ROCm 7.0 .deb, not Ollama’s bundled ROCm libraries. gfx1103 is treated as “consumer mobile” and quietly skipped by every official precompiled-kernel package. You can confirm this for yourself by running ls /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ | grep gfx on a stock Ubuntu 26.04 system — you’ll find kernels for gfx1030, gfx1100, gfx1101, gfx1151, gfx1200, gfx1201 and several data-center archs, but no gfx1103.

This post documents the working setup I landed on for getting native ROCm acceleration on the Phoenix iGPU with Ollama. By “native” I mean Ollama reports compute=gfx1103 (not a HSA_OVERRIDE’d alias) and inference uses real GPU kernels, not CPU fallback. With the full setup below, throughput on gemma4:e4b sustains 6–14 tokens/sec depending on prompt length, and most importantly 52 of 52 consecutive Open WebUI web-search chats completed successfully in the most recent measurement window.

Revision note (late May 2026): this post has been rewritten three times now. The original recipe worked but had a ~42% mid-inference crash rate due to a Fedora rocBLAS kernel version mismatch. The first revision corrected that to Fedora 7.1.1 kernels, dropping to ~14%. The second revision identified that the remaining failures were largely a Linux kernel bug in amdgpu’s MES (Micro Engine Scheduler) hardware-queue teardown, mitigated by Ubuntu’s HWE kernel 7.0.0-15. The third revision — this one — identifies the last piece: if you’re consuming Ollama from Open WebUI with web search enabled, OWUI’s bypass_embedding_and_retrieval setting must be False (not True as previously documented). Ollama Cloud Search returns full page content (Wikipedia articles can be ~41k tokens per result), and shipping all of it to the model unchunked produced 150k+ token prompts that Ollama truncated to 16k and which exercised the residual MES issue at scale. With chunking + retrieval enabled, prompts drop to ~3k tokens and the bug stops triggering in practice.

If you followed an earlier version of this recipe and are still seeing crashes, the most likely fixes (in order of how recently this post was wrong about each):

  1. If you’re using Open WebUI: flip bypass_embedding_and_retrieval from True to False in Admin Panel → Settings → Web Search. (Or via SQLite directly — see Step 5 below.)
  2. Run sudo apt install linux-image-generic-hwe-26.04 && sudo reboot to pick up the HWE kernel 7.0.0-15 or newer.
  3. Re-run setup.sh from the companion repo — it now downloads Fedora 44’s rocBLAS 7.1.1 kernels instead of Fedora 43’s 6.4 set.
  4. Confirm OLLAMA_FLASH_ATTENTION=0, OLLAMA_CONTEXT_LENGTH=16384, and OLLAMA_NUM_BATCH=256 are all set in your Ollama systemd drop-in.

TL;DR

If you just want the recipe and trust me on the why:

  1. Be on Ubuntu kernel 7.0.0-15 or newer (HWE kernel for 26.04). Older kernels have an MES bug that causes intermittent GPU resets mid-inference.
  2. Build ollama-for-amd from source with -DAMDGPU_TARGETS=gfx1103.
  3. Apply three patches to ml/device.go (sort fix, validation skip, parent-env respect).
  4. Extract Fedora 44’s rocblas-7.1.1-7.fc44.x86_64.rpm — just the gfx1103 Tensile kernel files — into your system’s rocBLAS library directory. The Fedora version must match your system rocBLAS version; an earlier version of this post used Fedora 6.4 kernels and that ABI mismatch is the source of much pain.
  5. Configure a systemd drop-in with OLLAMA_FLASH_ATTENTION=0, OLLAMA_CONTEXT_LENGTH=16384, OLLAMA_NUM_BATCH=256, OLLAMA_KEEP_ALIVE=24h, and ROCR_VISIBLE_DEVICES=N if you have multiple GPUs and only want Ollama on one. The context-length and batch-size knobs keep compute graphs small enough that the residual MES instability rarely triggers.
  6. If you’re using Open WebUI as the consumer, in Admin Panel → Settings → Web Search set bypass_web_loader to True AND bypass_embedding_and_retrieval to False. The asymmetry is counterintuitive but necessary — the False value makes OWUI chunk Ollama Cloud’s full-page-content results down to retrievable pieces instead of shoving 100k+ tokens at the model.

With all six conditions met I measured 52 of 52 consecutive Open WebUI web-search chat completions in the most recent session (real-world workload, mix of fresh-thread and continuation chats across multiple topics, with reasoning enabled). Zero MES events, zero GPU resets. Throughput is 6–14 tok/s on gemma4:e4b depending on context length. Bumping OLLAMA_CONTEXT_LENGTH above 16k or restoring OLLAMA_NUM_BATCH to its default 512 will reintroduce frequent MES-related crashes; the current values are sized specifically to keep the residual susceptibility from triggering.

I put a one-command setup script and the patches in a companion GitHub repo: johnsonfarmsus/ollama-rocm-gfx1103-ubuntu. The script now refuses to proceed on a kernel older than 7.0.0-15, so following the README on a fresh box will save you from the surprise of a working ROCm setup that’s still constantly crashing. The rest of this post explains why each piece is necessary.

The hardware

My specific setup:

  • AMD Ryzen 7 H255 (Hawk Point family, Phoenix-class iGPU)
  • Radeon 780M iGPU = gfx1103 = the target
  • Radeon RX 5500 discrete GPU = gfx1012 (RDNA1, officially dropped from ROCm 6+) — reserved for gaming, NOT used for AI
  • 28 GiB system RAM, of which the iGPU can address ~16 GiB via UMA + GTT
  • Ubuntu 26.04 LTS “resolute,” kernel 7.0.0-15-generic (HWE), ROCm 7.1 system libs

If you only have one GPU and it’s a Phoenix iGPU, the recipe is even simpler — you can skip the multi-GPU pinning. If you have a different combination (say, Phoenix iGPU + a working RDNA3+ dGPU), most of this still applies but the device-selection patches will need adjusting.

The six gaps in the ecosystem

It took a long time to figure out that there wasn’t one thing broken — there were six, each blocking the next. Working through them in order:

Gap 1: Ollama’s bundled rocBLAS lacks gfx1103 Tensile kernels

Ollama (both the official builds and the AMD-tuned ollama-for-amd fork) bundles its own copy of the ROCm runtime libraries under /usr/local/lib/ollama/rocm/. If you look inside the bundled rocblas/library/ directory, you’ll see Kernels.so-000-gfxXXXX.hsaco files for gfx1030, gfx1100, gfx1101, gfx1102, gfx1150 — but no gfx1103. When Ollama tries to initialize rocBLAS on a gfx1103 device, the runtime calls rocblas_initialize(), can’t find kernels for the actual hardware, and SIGABRTs the runner process. Ollama interprets the crash as “unsupported device,” falls back to CPU, and you get inference at CPU speed.

The well-publicized workaround is HSA_OVERRIDE_GFX_VERSION=11.0.0, which makes ROCm report the iGPU as gfx1100 and load gfx1100 kernels for it. This works for discovery — Ollama detects the device and lists it as a ROCm GPU. But the gfx1100 kernels were compiled for Navi 31 (96 CU, with WMMA matrix-multiply instructions), and Phoenix is a 12-CU part without WMMA. The first time inference actually executes a kernel that uses those instructions, the GPU faults and the runner process SIGABRTs again. So HSA_OVERRIDE gets you most of the way and then bails at the worst time.

Gap 2: Ubuntu’s system rocBLAS ships the runtime but not the kernels

Ubuntu 26.04 ships ROCm 7.1 in the universe repository — rocm-dev, libamdhip64-7, librocblas5, libhipblas3, the whole stack. You can rocminfo and it cleanly enumerates both my GPUs. But check the actual kernel data: /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ contains the runtime library but no precompiled kernels. Ubuntu’s packaging policy doesn’t ship the multi-gigabyte Tensile binary blobs that AMD generates for each architecture. So even pointing Ollama at the system libs doesn’t help — same gap as the bundled libs.

Gap 3: AMD’s official ROCm 7.0 .deb skips gfx1103 too

So download AMD’s own .deb directly, right? I did. rocblas7.0.0_5.0.0.70000-38~24.04_amd64.deb from repo.radeon.com/rocm/apt/7.0/. It’s 152 MB and contains gfx908, gfx90a, gfx942, gfx950 (data center), gfx1030 (RDNA2), gfx1100/1101/1102 (RDNA3 discrete), gfx1151 (RDNA3.5), gfx1200/1201 (RDNA4) — and intentionally not gfx1103. AMD’s stance is that Phoenix is consumer-mobile silicon, not a supported ROCm target. There’s an open issue (ROCm/rocBLAS#1536) asking for gfx1103 inclusion that’s been alive for over a year without resolution.

Gap 4: The community fork is Windows-only on the binary side

The likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU repo ships community-built rocBLAS libraries with gfx1103 kernels — exactly what’s missing. But the README explicitly says “ROCm is available for Linux on the releases page, recommend against using it directly,” and the prebuilt 7z files are formatted as Windows HIP SDK drop-ins (target directory %HIP_PATH%\bin\rocblas\library). The linked likelovewant/ollama-for-amd fork — which has the AMD-specific code changes — only ships Windows binaries in its releases. The Linux path on this fork is “build from source.”

Gap 5: Ollama’s scheduler picks the wrong GPU on multi-GPU systems

This one only matters on multi-GPU systems. Ollama’s ByFreeMemory.Less sort function unconditionally ranks integrated GPUs as less-than discrete GPUs regardless of memory:

func (a ByFreeMemory) Less(i, j int) bool {
    if a[i].Integrated && !a[j].Integrated {
        return true  // integrated always "less than" discrete
    } else if !a[i].Integrated && a[j].Integrated {
        return false
    }
    return a[i].FreeMemory < a[j].FreeMemory
}

After sort.Reverse(), this puts discrete GPUs first. On my box that means Ollama prefers the 4 GiB RX 5500 (RDNA1, broken under ROCm 6+) over the 16 GiB Phoenix iGPU. Even if the iGPU is the only one that actually works, the scheduler picks the dGPU and crashes. The assumption that discrete > integrated is true for most consumer PCs (one nice GPU + a tiny iGPU), but inverted for any APU-with-extra-iGPU-memory setup.

Gap 6: The amdgpu kernel driver has a gfx11 MES bug on Linux ≤ 6.13

This is the gap that took the longest to find. Once Gaps 1–5 were closed, the setup looked like it worked — inference ran, output was correct, performance was good. But about 1 in 7 requests would crash mid-completion with ROCm error: unspecified launch failure from hipStreamSynchronize. I assumed it was a userspace bug for several weeks. I rebuilt rocBLAS from upstream source. I rebuilt ollama-for-amd‘s HIP backend. Neither moved the needle.

The actual symptom was hiding in dmesg:

amdgpu 0000:c8:00.0: MES failed to respond to msg=REMOVE_QUEUE
amdgpu 0000:c8:00.0: failed to remove hardware queue from MES, doorbell=0x1002
amdgpu 0000:c8:00.0: MES might be in unrecoverable state, issue a GPU reset
amdgpu 0000:c8:00.0: GPU reset succeeded, trying to resume
amdgpu 0000:c8:00.0: GPU reset(N) succeeded!

MES is the AMD GPU’s Micro Engine Scheduler — firmware running on a microcontroller inside the GPU that schedules work onto the compute units. Under certain hardware-queue teardown patterns (specifically when the kernel sends a REMOVE_QUEUE message during a runner-process exit), the MES firmware on gfx11.x chips could hang. amdgpu’s only recovery was a full GPU reset, which killed any in-flight inference work. From userspace, that surfaces as the unspecified launch failure in the hipStreamSynchronize call.

The fix landed upstream in the Linux 6.11–6.13 window: better timeout handling, more graceful MES recovery, and reduced likelihood of the hang in the first place. Ubuntu’s HWE kernel 7.0.0-15-generic (in package linux-image-generic-hwe-26.04) contains the backports. The Ubuntu numbering is unrelated to upstream — what matters is that the package -15 has the fixes, the package -14 doesn’t.

Upgrading from -14 to -15 took the measured failure rate from ~14% to ~1% (1 failure in 84 stress tests at the conservative env-var config). The improvement is real and large, but the bug is not 100% eliminated — pushing context above 16k or batch size to its default of 512 reintroduces frequent MES events on this hardware even on the new kernel. The conservative env-var configuration (16k context, 256 batch) keeps the compute graphs small enough that the residual susceptibility rarely triggers. We measured 4 MES events in dmesg when we briefly tested 32k context with the default batch — the bug is still there in the firmware, just much harder to trigger.

The actual fix

Six gaps, fixed in dependency order. Gap 6’s fix (kernel upgrade) goes first because everything else depends on it.

Step 0: Get on a recent enough kernel

Before anything else, confirm or install the HWE kernel:

uname -r
# Want: 7.0.0-15-generic or newer

# If older:
sudo apt install linux-image-generic-hwe-26.04
sudo reboot

Then re-verify uname -r after the reboot. If you skip this step, you’ll do all the work in steps 1–4 and still be hitting the ~14% crash rate that I spent weeks chasing. The companion repo’s setup.sh now refuses to proceed on kernels older than 7.0.0-15 for exactly this reason.

Step 1: Build ollama-for-amd from source

The likelovewant/ollama-for-amd fork has changes to the GPU detection and HIP backend code that aren’t in upstream Ollama. The fork’s HIP backend has gfx1103 in its target list, so when you build with -DAMDGPU_TARGETS=gfx1103, the generated libggml-hip.so contains gfx1103-native kernel code for the operations Ollama itself does (everything outside of rocBLAS calls).

Build prerequisites on Ubuntu 26.04:

sudo apt install -y golang-go cmake clang rocm-cmake ninja-build \
                    rocm-dev libamdhip64-dev librocblas-dev librocm-smi-dev \
                    libhipblas-dev \
                    libarchive-tools  # for bsdtar (extracting Fedora RPM later)

Then build:

git clone --depth 1 https://github.com/likelovewant/ollama-for-amd.git
cd ollama-for-amd
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS="gfx1103"
cmake --build build -j$(nproc)        # 2-25 minutes depending on cores
go build -trimpath -o ollama .         # ~1-2 minutes

The output is a single ollama binary plus build/lib/ollama/libggml-*.so shared libraries.

Step 2: Apply the patches

Three small patches to ml/device.go:

Patch A — sort by free memory only (skip the integrated-vs-discrete bias):

// In func (a ByFreeMemory) Less, remove the Integrated check:
func (a ByFreeMemory) Less(i, j int) bool {
    return a[i].FreeMemory < a[j].FreeMemory
}

Patch B — skip Ollama’s deep-init validation:

func (d DeviceInfo) AddInitValidation(env map[string]string) {
    // env["GGML_CUDA_INIT"] = "1" // patched out
}

The original code sets GGML_CUDA_INIT=1, which triggers rocblas_initialize() during discovery. The comment in the source code is honest about it: “force deep initialization to trigger crash on unsupported GPUs.” Once we have the kernels in place (next step), this crash-detection is unnecessary and just risks crashing the iGPU on first init.

Patch C — respect parent env’s ROCR_VISIBLE_DEVICES:

// Add "os" to the imports

// In func (d DeviceInfo) updateVisibleDevicesEnv, inside the ROCm case:
case "ROCm":
    envVar = "ROCR_VISIBLE_DEVICES"
    if runtime.GOOS != "linux" {
        envVar = "HIP_VISIBLE_DEVICES"
    }
    if os.Getenv(envVar) != "" {
        return  // respect parent env's pin
    }

This is the trickiest one to explain. Without it, when Ollama spawns the model-load runner subprocess, it builds ROCR_VISIBLE_DEVICES from the chosen device’s FilterID. If the parent process (Ollama main, started from systemd with ROCR_VISIBLE_DEVICES=1) already pre-filtered to a single device, that device’s FilterID is 0 (re-indexed). So Ollama sets ROCR_VISIBLE_DEVICES=0 in the runner — which in a fresh process means physical device 0, the dGPU. With this patch, if the parent env already has the variable set, Ollama leaves it alone.

Rebuild after applying:

go build -trimpath -o ollama .

Step 3: Get gfx1103 Tensile kernels from Fedora 44 (rocBLAS 7.1.1)

Fedora’s rocblas package builds Tensile kernels for gfx1103 as part of their normal build process. The kernel binaries are pure GPU bytecode (.hsaco files compiled from HIP) and the Tensile metadata (.dat files) is msgpack-encoded — both are distribution-agnostic at the file level. What’s not distribution-agnostic is the rocBLAS runtime ABI: the version of Fedora you pull from must match the version of rocBLAS installed on your Ubuntu system.

For Ubuntu 26.04’s rocBLAS 7.1, the right source is Fedora 44’s rocBLAS 7.1.1 build:

# Download Fedora 44's rocblas package (276 MB)
cd /tmp
curl -O https://kojipkgs.fedoraproject.org/packages/rocblas/7.1.1/7.fc44/x86_64/rocblas-7.1.1-7.fc44.x86_64.rpm

# Extract with bsdtar (Ubuntu's rpm2cpio can't handle Fedora's zstd-compressed cpio)
mkdir fedora-rocblas
cd fedora-rocblas
bsdtar -xf /tmp/rocblas-7.1.1-7.fc44.x86_64.rpm

# Copy just the gfx1103 files into the system rocBLAS library dir
sudo cp /tmp/fedora-rocblas/usr/lib64/rocblas/library/*gfx1103* \
        /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/

That copies 96 files (a few hundred KB to a few MB each) — kernel binaries and Tensile selection metadata for every GEMM type x layout combination on gfx1103. The system’s librocblas5 picks them up automatically at next load.

If you already installed Fedora 43’s rocBLAS 6.4 files from the old version of this recipe, replace them. The kernels will load — the names collide and the new files overwrite the old ones — but for safety, remove any orphans first:

sudo rm /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/*gfx1103*
# then copy the Fedora 44 files in as above

Step 4: Install the patched Ollama and configure systemd

Replace the binary and libraries:

sudo systemctl stop ollama

sudo cp /usr/local/bin/ollama /usr/local/bin/ollama.upstream.bak
sudo install -m 0755 ./ollama /usr/local/bin/ollama

sudo mv /usr/local/lib/ollama /usr/local/lib/ollama.upstream.bak
sudo mkdir /usr/local/lib/ollama
sudo cp build/lib/ollama/* /usr/local/lib/ollama/

One subtle thing about the lib layout: put libggml-hip.so at the top level of /usr/local/lib/ollama/, NOT in a rocm/ subdirectory. Ollama’s backend loader looks for variants matching libggml-hip-*.so (with a trailing dash), and falls back to libggml-hip.so if no variants are found. The fallback path only checks the search paths it was given — putting the file in a subdirectory makes it invisible to the fallback unless you also tell Ollama to search that subdir.

Create a systemd drop-in:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
# Pin to the iGPU on multi-GPU systems. Adjust the index based on rocminfo.
# Single-GPU systems can omit this line.
Environment="ROCR_VISIBLE_DEVICES=1"

# Bind to localhost. Change to 0.0.0.0:11434 only if you want LAN access
# AND have a private overlay network (Tailscale, etc.) protecting it.
Environment="OLLAMA_HOST=127.0.0.1:11434"

# Disable Flash Attention. Brittle on this stack even with kernel -15.
Environment="OLLAMA_FLASH_ATTENTION=0"

# Keep models loaded for 24h. Reloading a 6.6 GB model into iGPU memory
# takes ~3 seconds, and idle gaps >5 min are common.
Environment="OLLAMA_KEEP_ALIVE=24h"

# Cap context at 16k. The residual MES bug returns at higher contexts on
# this hardware even with kernel -15. 16k is plenty for chat and for OWUI
# web-search-style RAG; if you genuinely need 32k+, raise this but expect
# more frequent crashes.
Environment="OLLAMA_CONTEXT_LENGTH=16384"

# Reduce prompt batch size from default 512. Smaller batches dispatch
# smaller compute graphs, which the residual MES bug rarely chokes on.
# ~10-15% prompt-eval speed cost; the difference between ~99% and ~50%
# reliability on long-context prompts.
Environment="OLLAMA_NUM_BATCH=256"
EOF

sudo systemctl daemon-reload
sudo systemctl start ollama

Replace ROCR_VISIBLE_DEVICES=1 with the physical index of your iGPU (use rocminfo to see the device list — the iGPU is the one with Marketing Name: AMD Radeon 780M Graphics; usually device 0 if it’s the only GPU, or whatever index the kernel assigned otherwise).

Step 5: If you’re using Open WebUI, fix its bypass setting

If your Ollama is being consumed by Open WebUI with web search enabled (a really common setup), there’s one more knob that matters as much as the Ollama-side env-vars. It’s in OWUI’s config, not Ollama’s, but it interacts directly with the MES bug we just spent four steps avoiding.

Two settings under Admin Panel → Settings → Web Search:

SettingValueWhy
bypass_web_loaderTrueSkip the langchain HTML scrape of every search result URL. Without this, OWUI re-fetches each URL via a default user-agent that Cloudflare-protected sites (egpu.io, dev.to, virtualizationhowto.com, etc.) block. Symptom: model says “I see N sources but no content.”
bypass_embedding_and_retrievalFalse  ← counterintuitiveEarlier guidance (including my own, in earlier versions of this post) said True. That’s wrong for any search engine that returns full page content as the “content” field, which Ollama Cloud Search does. I measured 164,165 chars (~41k tokens) of content for the Wikipedia Paul McCartney result. With 3 query variants × 3 results = up to 9 such results per turn, prompts hit 150k-200k tokens, which Ollama truncates to 16k and which exercises the residual MES bug at scale. With this setting False, OWUI chunks the content (default chunk_size=1000), embeds chunks on CPU using sentence-transformers/all-MiniLM-L6-v2, retrieves only the top-k most relevant chunks (default top_k=3), and ships ~3k tokens to the model instead of 150k. Same answer quality, ~50× smaller prompt, dramatically more stable.

The OWUI UI toggle can be overwritten by OWUI’s in-memory state on restart, so the most reliable way to flip bypass_embedding_and_retrieval from True to False is via SQLite while the container is stopped:

sudo docker stop open-webui

# patch the setting in the SQLite blob while OWUI is at rest
sudo docker run --rm -v open-webui:/data alpine sh -c '
  apk add --quiet sqlite > /dev/null
  sqlite3 /data/webui.db "
    UPDATE config
    SET data = json_set(data, '\''$.rag.web.search.bypass_embedding_and_retrieval'\'', json('\''false'\''));"
'

sudo docker start open-webui

The failure mode this fixes is sneaky because the userspace symptom is identical to the kernel MES bug from Gap 6: ROCm error: unspecified launch failure from hipStreamSynchronize, sometimes accompanied by MES might be in unrecoverable state in dmesg. The difference is that this version is triggered by the bypass setting making prompts huge, not by the kernel scheduler bug itself. The Ollama-side env-vars + kernel update reduce the chance of MES failure on small prompts to ~0%, but a 100k+ token prompt is still going to trigger it. Step 5 keeps the prompts small enough that the rest of the stack stays in its happy zone.

Verification

Check that discovery reports compute=gfx1103, not an HSA_OVERRIDE’d alias:

sudo journalctl -u ollama -n 30 | grep "inference compute"

You should see something like:

level=INFO source=types.go:42 msg="inference compute"
  id=0 library=ROCm compute=gfx1103
  name=ROCm0 description="AMD Radeon 780M Graphics"
  type=iGPU total="16.3 GiB" available="16.1 GiB"

If you see library=cpu instead, something failed silently — check the full journal for the actual error.

Then check for the MES bug to confirm you’re on a fixed kernel:

sudo dmesg | grep -i MES

On 7.0.0-15 or newer, this should be empty or show only benign messages like ring registration. If you see MES failed to respond to msg=REMOVE_QUEUE or MES might be in unrecoverable state you’re on the broken kernel — go back to Step 0.

Finally, run an actual inference:

ollama pull gemma4:e4b
curl -s http://127.0.0.1:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "stream": false,
  "messages": [{"role": "user", "content": "What is the capital of France?"}]
}' | python3 -m json.tool

You should get a sensible answer in a few seconds. The response JSON includes timing fields — divide eval_count by eval_duration (nanoseconds) for tokens/sec.

Performance

Numbers from gemma4:e4b (~9.6 GB quantized) on the Phoenix iGPU after this setup, kernel 7.0.0-15-generic:

WorkloadSpeedComparison
Prompt eval (input tokens, <1k)~290 tok/s~5× CPU baseline
Generation (short response, <1k context)~13–14 tok/swell above reading speed
Generation (medium response, ~6k context)~10 tok/sstill smooth chat
Generation (long context, ~14k tokens)~6–7 tok/susable for long-doc Q&A

The drop-off at longer contexts is dominated by KV cache scans — the iGPU shares DDR5 with the rest of the system, and attention-over-context is fundamentally bandwidth-bound. For chat-style use, 10–14 tok/s sustained generation is well above reading speed and feels instant.

Stability

With all six gaps closed and the conservative env-var config above — plus, if you’re using Open WebUI, the bypass fix from Step 5 — the setup is reliable enough for daily use. Most recent measurement: 52 of 52 consecutive Open WebUI web-search chat completions succeeded in a real-world session (mixed topics, reasoning enabled, in-thread continuations and fresh threads alike). Zero MES events, zero GPU resets, zero Ollama runner terminations during that window.

The journey to that number was longer than I’d like to admit. Here’s the abridged history:

StateCrash rateThe fix
Original recipe, Fedora 6.4 rocBLAS kernels under Ubuntu’s 7.1 runtime~42%Use the matching Fedora 7.1.1 build instead. ABI mismatch was producing kernels that loaded but failed on certain code paths.
Correct kernels, but kernel 7.0.0-14~14%Upgrade to HWE kernel 7.0.0-15. The MES REMOVE_QUEUE fix landed in upstream Linux 6.11–6.13 and is backported.
Kernel -15 + Ollama env-vars, but consuming from OWUI with default bypass settings~30% specifically on OWUI web-search workloadsFlip OWUI’s bypass_embedding_and_retrieval from True to False. Ollama Cloud returns full pages as the “content” field; without chunking, OWUI was sending 150k-token prompts at the model, exercising the residual MES bug at scale.
All four layers correct0% observed52/52 in the latest session. The stack is reliable when all four pieces are right.

Two things worth knowing about the residual susceptibility (even though it’s not currently triggering):

  • The MES bug isn’t fully eliminated in kernel -15, only mitigated. If you push OLLAMA_CONTEXT_LENGTH above 16k or restore OLLAMA_NUM_BATCH to its default 512, the larger compute graphs reintroduce MES failures. I measured 4 MES events in 7 prompts at 32k context + default batch on this same kernel. The conservative limits in the recipe are sized specifically to stay below that threshold.
  • Userspace symptoms of the bug are identical regardless of cause. Wrong kernel, wrong Fedora version, wrong OWUI bypass setting, and just pushing context too high all produce the same ROCm error: unspecified launch failure at hipStreamSynchronize. The differentiator is dmesg | grep MES versus checking your config layers in order. The triage list below covers the order I’d check.

What you should expect at the recommended config:

  • Routine chat (short prompts): snappy, ~13–14 tok/s, near-zero failures
  • OWUI web-search queries (1–4k token prompts after chunking + retrieval): 10–12 tok/s, near-zero failures
  • Long-document analysis (12–16k tokens): ~6–7 tok/s, occasional retry
  • Always-on agent or bot use: reliable enough for daily traffic; still wise to wrap in retry logic for the rare case

If you’re seeing failures after following the recipe in order, check in this order:

  1. Wrong kernel. uname -r should report 7.0.0-15-generic or newer. If older, install linux-image-generic-hwe-26.04 and reboot. This is by far the biggest stability lever.
  2. Wrong Fedora rocBLAS version. The kernels in /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ should be from Fedora 44 (rocBLAS 7.1.1), not Fedora 43 (rocBLAS 6.4). The file count for gfx1103 should be 96; if it’s 56 you have the old set.
  3. If consuming from OWUI: wrong bypass setting. Check OWUI’s webui.db: bypass_embedding_and_retrieval should be False, bypass_web_loader should be True. If it’s the other way around — especially the first one set to True — that’s almost certainly your crash source on web-search workloads. The SQLite-while-stopped procedure in Step 5 is the reliable way to fix it.
  4. Context or batch too high. Verify OLLAMA_CONTEXT_LENGTH=16384 and OLLAMA_NUM_BATCH=256 in your systemd drop-in. Defaults are noticeably less reliable.
  5. Flash Attention enabled. Verify OLLAMA_FLASH_ATTENTION=0. Brittle on this stack regardless of kernel version.
  6. Check dmesg for MES events. sudo dmesg | grep MES. If you see MES failed to respond to msg=REMOVE_QUEUE messages clustered around crashes, you’re hitting the residual bug — one of items 1–5 above is misconfigured. If dmesg is clean but you still have crashes, it’s something else and worth opening an issue.

Shelf life and what to watch for

This setup is held together by a specific configuration of components. Things to watch:

  • Ubuntu eventually ships gfx1103 kernels. Likely in 27.04 or whenever AMD’s ROCm policy on Phoenix changes. At that point Step 3 (the Fedora extraction) becomes unnecessary — you can apt install and be done.
  • Ollama integrates the AMD-tuned changes upstream. The patches in Step 2 are workarounds for issues in upstream Ollama’s GPU selection logic; upstream may absorb them or change the surrounding code. Watch the ml/device.go diff when you upgrade.
  • An apt upgrade of librocblas5 will overwrite anything in /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ with the package’s gfx1103-less version. After such an upgrade, re-extract the Fedora RPM and re-copy.
  • Ubuntu jumps rocBLAS major versions. When 26.10 or 27.04 ships rocBLAS 7.2 or 8.x, you’ll need a matching Fedora rocBLAS build at that version. Gap 6’s rule applies: match the kernel ABI to the runtime ABI.
  • Kernel regressions. The MES fix in 7.0.0-15 is a backport of upstream patches; future kernel updates could in theory regress, though so far they haven’t. If you ever see MES failed to respond reappear in dmesg after an apt upgrade, file an issue.

None of these are imminent as of writing (late May 2026).

The companion repo

I packaged the whole flow as a one-command setup at github.com/johnsonfarmsus/ollama-rocm-gfx1103-ubuntu. It contains:

  • setup.sh — automates everything in Steps 1–4: installs build deps, clones the fork, applies the patches, builds, downloads the matching Fedora RPM, extracts, installs, writes the systemd drop-in (with the FA-disable and keep-alive defaults), restarts the service, and runs a smoke test. Refuses to proceed if your running kernel is older than 7.0.0-15 — saves you from the surprise of a working ROCm setup that’s still constantly crashing.
  • patches/ — the three ml/device.go patches as standalone .patch files.
  • override.conf — the systemd drop-in template, including the recommended env vars.
  • README that mirrors what’s above, with the same Stability story.

If you ran an earlier version of setup.sh, the right move is: upgrade your kernel to -15 if you haven’t, then pull the latest repo and re-run setup.sh. It’ll swap your old Fedora 6.4 kernel files for the matching 7.1.1 ones and add the recommended env vars to your systemd drop-in.

The first ml/device.go patch — sort by free memory — is a legitimate correctness issue, not a workaround. ByFreeMemory.Less ignoring memory in favor of integrated-vs-discrete preference affects anyone with an iGPU that has more usable memory than a small dGPU. Patch is being prepared for an upstream PR.

References

If you find an issue in the recipe, the companion repo is the place. Open an issue or a PR.