Ubuntu Linux with Ollama ROCm on AMD Ryzen 780M iGPU


If you’ve got an AMD Ryzen 7040 or 8040 series chip (laptops, mini PCs, the Phoenix family of APUs), you’ve got a Radeon 780M iGPU sitting there — gfx1103 in ROCm terminology. It’s a perfectly capable RDNA3 GPU with about 16 GiB of usable memory once you count UMA + GTT. It would be a great target for local LLM inference via Ollama.

Except nothing in the standard Linux stack ships kernels for it. Not Ubuntu’s librocblas5, not AMD’s official ROCm 7.0 .deb, not Ollama’s bundled ROCm libraries. gfx1103 is treated as “consumer mobile” and quietly skipped by every official precompiled-kernel package. You can confirm this for yourself by running ls /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ | grep gfx on Ubuntu 26.04 — you’ll find kernels for gfx1030, gfx1100, gfx1101, gfx1151, gfx1200, gfx1201 and several data-center archs, but no gfx1103.

This post documents the working setup I landed on, after a multi-hour debugging session, for getting native ROCm acceleration on the Phoenix iGPU with Ollama. By “native” I mean Ollama reports compute=gfx1103 (not a HSA_OVERRIDE’d alias) and inference uses real GPU kernels, not CPU fallback. The whole thing runs at 24–48 tokens/sec on gemma4:e2b — roughly 1.5–3× the CPU-only baseline.

TL;DR

If you just want the recipe and trust me on the why:

  1. Build ollama-for-amd from source with -DAMDGPU_TARGETS=gfx1103.
  2. Apply three patches to ml/device.go (sort fix, validation skip, parent-env respect).
  3. Install Fedora 43’s rocblas-6.4.0-7.fc43.x86_64.rpm gfx1103 Tensile kernel files into your system’s rocBLAS library directory — without installing the RPM itself, just extracting the relevant files with bsdtar.
  4. Configure a systemd drop-in with ROCR_VISIBLE_DEVICES=1 if you have multiple GPUs and only want Ollama on one.

I put a one-command setup script and the patches in a companion GitHub repo: johnsonfarmsus/ollama-rocm-gfx1103-ubuntu. The rest of this post explains why each piece is necessary.

The hardware

My specific setup:

  • AMD Ryzen 7 H255 (Hawk Point family, Phoenix-class iGPU)
  • Radeon 780M iGPU = gfx1103 = the target
  • Radeon RX 5500 discrete GPU = gfx1012 (RDNA1, officially dropped from ROCm 6+) — reserved for gaming, NOT used for AI
  • 28 GiB system RAM, of which the iGPU can address ~16 GiB via UMA + GTT
  • Ubuntu 26.04 LTS “resolute,” kernel 7.0.0-14-generic

If you only have one GPU and it’s a Phoenix iGPU, the recipe is even simpler — you can skip the multi-GPU pinning. If you have a different combination (say, Phoenix iGPU + a working RDNA3+ dGPU), most of this still applies but the device-selection patches will need adjusting.

The five gaps in the ecosystem

It took me a long time to figure out that there wasn’t one thing broken — there were five things, each blocking the next. Working through them in order:

Gap 1: Ollama’s bundled rocBLAS lacks gfx1103 Tensile kernels

Ollama (both the official builds and the AMD-tuned ollama-for-amd fork) bundles its own copy of the ROCm runtime libraries under /usr/local/lib/ollama/rocm/. If you look inside the bundled rocblas/library/ directory, you’ll see Kernels.so-000-gfxXXXX.hsaco files for gfx1030, gfx1100, gfx1101, gfx1102, gfx1150 — but no gfx1103. When Ollama tries to initialize rocBLAS on a gfx1103 device, the runtime calls rocblas_initialize(), can’t find kernels for the actual hardware, and SIGABRTs the runner process. Ollama interprets the crash as “unsupported device,” falls back to CPU, and you get inference at CPU speed.

The well-publicized workaround is HSA_OVERRIDE_GFX_VERSION=11.0.0, which makes ROCm report the iGPU as gfx1100 and load gfx1100 kernels for it. This works for discovery — Ollama detects the device and lists it as a ROCm GPU. But the gfx1100 kernels were compiled for Navi 31 (96 CU, with WMMA matrix-multiply instructions), and Phoenix is a 12-CU part without WMMA. The first time inference actually executes a kernel that uses those instructions, the GPU faults and the runner process SIGABRTs again. So HSA_OVERRIDE gets you most of the way and then bails at the worst time.

Gap 2: Ubuntu’s system rocBLAS ships the runtime but not the kernels

Ubuntu 26.04 ships ROCm 7.1 in the universe repository — rocm-dev, libamdhip64-7, librocblas5, libhipblas3, the whole stack. You can rocminfo and it cleanly enumerates both my GPUs. But check the actual kernel data: /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ contains the runtime library but no precompiled kernels. Ubuntu’s packaging policy doesn’t ship the multi-gigabyte Tensile binary blobs that AMD generates for each architecture. So even pointing Ollama at the system libs doesn’t help — same gap as the bundled libs.

Gap 3: AMD’s official ROCm 7.0 .deb skips gfx1103 too

So download AMD’s own .deb directly, right? I did. rocblas7.0.0_5.0.0.70000-38~24.04_amd64.deb from repo.radeon.com/rocm/apt/7.0/.... It’s 152 MB and contains gfx908, gfx90a, gfx942, gfx950 (data center), gfx1030 (RDNA2), gfx1100/1101/1102 (RDNA3 discrete), gfx1151 (RDNA3.5), gfx1200/1201 (RDNA4) — and intentionally not gfx1103. AMD’s stance is that Phoenix is consumer-mobile silicon, not a supported ROCm target. There’s an open issue (ROCm/rocBLAS#1536) asking for gfx1103 inclusion that’s been alive for over a year without resolution.

Gap 4: The community fork is Windows-only on the binary side

The likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU repo ships community-built rocBLAS libraries with gfx1103 kernels — exactly what’s missing. But the README explicitly says “ROCm is available for Linux on the releases page, recommend against using it directly,” and the prebuilt 7z files are formatted as Windows HIP SDK drop-ins (target directory %HIP_PATH%\bin\rocblas\library). The linked likelovewant/ollama-for-amd fork — which has the AMD-specific code changes — only ships Windows binaries in its releases. The Linux path on this fork is “build from source.”

Gap 5: Even after a working setup, Ollama’s scheduler picks the wrong GPU

This one only matters on multi-GPU systems. Ollama’s ByFreeMemory.Less sort function unconditionally ranks integrated GPUs as less-than discrete GPUs regardless of memory:

func (a ByFreeMemory) Less(i, j int) bool {
    if a[i].Integrated && !a[j].Integrated {
        return true  // integrated always "less than" discrete
    } else if !a[i].Integrated && a[j].Integrated {
        return false
    }
    return a[i].FreeMemory < a[j].FreeMemory
}

After sort.Reverse(), this puts discrete GPUs first. On my box that means Ollama prefers the 4 GiB RX 5500 (RDNA1, broken under ROCm 6+) over the 16 GiB Phoenix iGPU. Even if the iGPU is the only one that actually works, the scheduler picks the dGPU and crashes. The assumption that discrete > integrated is true for most consumer PCs (one nice GPU + a tiny iGPU), but inverted for any APU-with-extra-iGPU-memory setup.

The actual fix

Five gaps, five fixes. They’re listed in dependency order — each step is necessary for the next to work.

Step 1: Build ollama-for-amd from source

The likelovewant/ollama-for-amd fork has changes to the GPU detection and HIP backend code that aren’t in upstream Ollama. The fork’s HIP backend has gfx1103 in its target list, so when you build with -DAMDGPU_TARGETS=gfx1103, the generated libggml-hip.so contains gfx1103-native kernel code for the operations Ollama itself does (everything outside of rocBLAS calls).

Build prerequisites on Ubuntu 26.04:

sudo apt install -y golang-go cmake clang rocm-cmake ninja-build \
                    rocm-dev libamdhip64-dev librocblas-dev librocm-smi-dev \
                    libarchive-tools  # for bsdtar (extracting Fedora RPM later)

Then build:

git clone --depth 1 https://github.com/likelovewant/ollama-for-amd.git
cd ollama-for-amd
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS="gfx1103"
cmake --build build -j$(nproc)        # 5-25 minutes depending on cores
go build -trimpath -o ollama .         # ~1-2 minutes

The output is a single ollama binary plus build/lib/ollama/libggml-*.so shared libraries.

Step 2: Apply the patches

Three small patches to ml/device.go:

Patch A — sort by free memory only (skip the integrated-vs-discrete bias):

// In func (a ByFreeMemory) Less, remove the Integrated check:
func (a ByFreeMemory) Less(i, j int) bool {
    return a[i].FreeMemory < a[j].FreeMemory
}

Patch B — skip Ollama’s deep-init validation:

func (d DeviceInfo) AddInitValidation(env map[string]string) {
    // env["GGML_CUDA_INIT"] = "1" // patched out
}

The original code sets GGML_CUDA_INIT=1, which triggers rocblas_initialize() during discovery. The comment in the source code is honest about it: “force deep initialization to trigger crash on unsupported GPUs.” Once we have the kernels in place (next step), this crash-detection is unnecessary and just risks crashing the iGPU on first init.

Patch C — respect parent env’s ROCR_VISIBLE_DEVICES:

// Add "os" to the imports

// In func (d DeviceInfo) updateVisibleDevicesEnv, inside the ROCm case:
case "ROCm":
    envVar = "ROCR_VISIBLE_DEVICES"
    if runtime.GOOS != "linux" {
        envVar = "HIP_VISIBLE_DEVICES"
    }
    if os.Getenv(envVar) != "" {
        return  // respect parent env's pin
    }

This is the trickiest one to explain. Without it, when Ollama spawns the model-load runner subprocess, it builds ROCR_VISIBLE_DEVICES from the chosen device’s FilterID. If the parent process (Ollama main, started from systemd with ROCR_VISIBLE_DEVICES=1) already pre-filtered to a single device, that device’s FilterID is 0 (re-indexed). So Ollama sets ROCR_VISIBLE_DEVICES=0 in the runner — which in a fresh process means physical device 0, the dGPU. With this patch, if the parent env already has the variable set, Ollama leaves it alone.

Rebuild after applying:

go build -trimpath -o ollama .

Step 3: Get gfx1103 Tensile kernels from Fedora 43

This is the keystone. Fedora has been packaging rocblas-gfx1103 as an experimental package since Fedora 40, and starting with Fedora 41 they fold gfx1103 into the main rocblas package. The kernel binaries are pure GPU bytecode (.hsaco files compiled from HIP) — they don’t depend on the host distribution’s libstdc++ or libc. The Tensile metadata (.dat files) is msgpack-encoded and has been stable across rocBLAS minor versions.

So: download Fedora’s RPM, extract just the gfx1103 files, drop them next to your system rocBLAS library:

# Download Fedora 43's rocblas package (170 MB)
cd /tmp
curl -O https://kojipkgs.fedoraproject.org/packages/rocblas/6.4.0/7.fc43/x86_64/rocblas-6.4.0-7.fc43.x86_64.rpm

# Extract with bsdtar (Ubuntu's rpm2cpio can't handle Fedora's zstd-compressed cpio)
mkdir fedora-rocblas
cd fedora-rocblas
bsdtar -xf /tmp/rocblas-6.4.0-7.fc43.x86_64.rpm

# Copy just the gfx1103 files into the system rocBLAS library dir
sudo cp /tmp/fedora-rocblas/usr/lib64/rocblas/library/*gfx1103* \
        /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/

That copies 56 files (a few KB each) — kernel binaries and Tensile selection metadata for every GEMM type x layout combination on gfx1103. The system’s librocblas5 picks them up automatically at next load.

Step 4: Install the patched Ollama and configure systemd

Replace the binary and libraries:

sudo systemctl stop ollama

sudo cp /usr/local/bin/ollama /usr/local/bin/ollama.upstream.bak
sudo install -m 0755 ./ollama /usr/local/bin/ollama

sudo mv /usr/local/lib/ollama /usr/local/lib/ollama.upstream.bak
sudo mkdir /usr/local/lib/ollama
sudo cp build/lib/ollama/* /usr/local/lib/ollama/

One subtle thing about the lib layout: put libggml-hip.so at the top level of /usr/local/lib/ollama/, NOT in a rocm/ subdirectory. Ollama’s backend loader looks for variants matching libggml-hip-*.so (with a trailing dash), and falls back to libggml-hip.so if no variants are found. The fallback path only checks the search paths it was given — putting the file in a subdirectory makes it invisible to the fallback unless you also tell Ollama to search that subdir.

Create a systemd drop-in to pin the iGPU on a multi-GPU system:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="ROCR_VISIBLE_DEVICES=1"
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOF

sudo systemctl daemon-reload
sudo systemctl start ollama

Replace ROCR_VISIBLE_DEVICES=1 with the physical index of your iGPU (use rocminfo to see the device list — the iGPU is the one with Marketing Name: AMD Radeon 780M Graphics; usually device 0 if it’s the only GPU, or whatever index the kernel assigned otherwise).

Verification

Check that discovery reports compute=gfx1103, not an HSA_OVERRIDE’d alias:

sudo journalctl -u ollama -n 30 | grep "inference compute"

You should see something like:

level=INFO source=types.go:42 msg="inference compute"
  id=0 library=ROCm compute=gfx1103
  name=ROCm0 description="AMD Radeon 780M Graphics"
  type=iGPU total="16.3 GiB" available="16.1 GiB"

If you see library=cpu instead, something failed silently — check the full journal for the actual error.

Run an actual inference. Pull a small model (the example uses Gemma 4 e2b but any will do):

ollama pull gemma4:e2b
curl -s http://127.0.0.1:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "stream": false,
  "messages": [{"role": "user", "content": "What is the capital of France?"}]
}' | python3 -m json.tool

You should get a sensible answer in a few seconds. The response JSON includes timing fields — divide eval_count by eval_duration (nanoseconds) for tokens/sec.

What I’m getting on this hardware

Benchmark numbers from gemma4:e2b (~7 GB quantized) on the Phoenix iGPU after this setup:

WorkloadSpeedComparison
Prompt eval (input tokens)~290 tok/s~5× CPU baseline
Generation (short response)~48 tok/s~3× CPU baseline (16 tok/s)
Generation (long response, 300+ tokens)~24 tok/s~1.5× CPU baseline

The drop-off at longer generation is mostly KV cache contention with system memory bandwidth — the iGPU shares DDR5 with the rest of the system. For chat-style use, 24 tok/s sustained generation is well above reading speed and feels instant.

Shelf life and what to watch for

This whole approach is held together by the gap between distribution-built rocBLAS (no gfx1103 kernels) and Fedora’s community-built rocBLAS (has them). The day one of these things changes, you may need to revise:

  • Ubuntu eventually ships gfx1103 kernels. Likely in 27.04 or whenever ROCm 7.x integrates Phoenix. At that point you can skip Step 3 entirely.
  • Ollama integrates the AMD-tuned changes upstream. The patches I’m applying are workarounds for the integrated/discrete sort bias and the env-passing quirks; the upstream Ollama may absorb them or change the surrounding code. Watch the ml/device.go diff when you upgrade.
  • An apt upgrade of librocblas5 will overwrite anything in /usr/lib/x86_64-linux-gnu/rocblas/5.1.0/library/ with the package’s version, which doesn’t include gfx1103. After such an upgrade, re-run Step 3.
  • Fedora rocBLAS major-version-jumps could change the Tensile data format. If Fedora ships rocBLAS 7.x and the .dat format isn’t backward-compatible with Ubuntu’s rocBLAS 5.x runtime, the kernels won’t load. So far (Fedora 6.4 vs Ubuntu 7.1) the formats happen to be compatible.

None of these are imminent. As of writing (May 2026), the recipe above is the most reliable path I’ve found to a working ROCm-Ollama setup on the AMD Phoenix iGPU.

The companion repo

I packaged the whole flow as a one-command setup at github.com/johnsonfarmsus/ollama-rocm-gfx1103-ubuntu. It contains:

  • setup.sh — automates everything above: installs build deps, clones the fork, applies the patches, builds, downloads the Fedora RPM, extracts, installs, writes the systemd drop-in, restarts the service, and runs a smoke test.
  • patches/ — the three ml/device.go patches as standalone .patch files.
  • override.conf — the systemd drop-in template.
  • README with run instructions.

For the curious, the upstream Ollama sort fix is also being submitted as a PR — ByFreeMemory.Less ignoring memory in favor of integrated-vs-discrete preference is a legitimate correctness issue, not a workaround, and affects anyone with an iGPU that has more usable memory than a small dGPU.

References

If you find an issue in the recipe, or a cleaner way to handle any of the five gaps, the companion repo is the place — open an issue or a PR.