HomeCommunityLaptops and Desktops blog
May 28, 2026

Building a Windows on Arm porting agent for Python wheels

How I used Codex, a deep spec, hosted Windows on Arm runners, and a real WoA laptop to make Python wheel migration more repeatable

By Michael Gamble

Share
Reading time 6 minutes

Python support on Windows on Arm is improving. However, there are many packages that do not publish native win_arm64 wheels. This gap matters because “it installs” and “it works well on native WoA” are not the same thing. The porting work is often repetitive. You identify out the build system, wire up CI, build the wheel, install the artifact, smoke test it, then repeat that on real hardware. Resources are limited. AI tooling is not. So I started with a simple question: could I use Codex and a software engineering agent to do most of the porting work for me?

That question came from a very practical constraint. I was trying to improve Python support on Windows on Arm. I had a good reference point, though: a coworker had already shown how useful automation could be for Arm enablement work in another ecosystem. That encouraged me to try the same idea for Python packaging. I started manually, working with Codex and my SWE agent until I had a couple of packages building and smoke testing on GitHub-hosted windows-11-arm runners. After repeating the process several times, a pattern emerged: this was not complex, one-off engineering. It was mostly a repeatable workflow that had never been systematized.

So I wrote a deep specification and turned the manual process into what I now think of as a WoA Porting Agent. Take a backlog of important packages, prioritize them, patch the forks, build native wheels on WoA runners, validate the exact artifacts in CI, send the same artifacts to a physical WoA laptop, rerun the same smoke contract there, and record the result. In other words, use hosted Arm CI for throughput and real hardware for truth.

The hypothesis

The hypothesis was not “an agent can magically port everything.” It was narrower and more useful:

  1. Most wheel migrations would fail for a small number of repeatable reasons.
  2. Those reasons could be discovered and fixed by an agentic loop.
  3. Exact-artifact validation mattered more than another static checklist.

That last point turned out to be critical. A wheel that builds successfully is not automatically a wheel that works on Windows on Arm. The system had to validate the same built artifact twice:

  • First on GitHub-hosted windows-11-arm
  • Then on a physical Windows on Arm laptop

That requirement shaped the design more than anything else.

The workflow I built

The WoA Porting Agent ended up being a controller plus a validation loop:

  1. Look at the target package backlog.
  2. Pick the next unblocked package.
  3. Analyze the repo shape and build system.
  4. Patch the fork, not upstream.
  5. Build a native win_arm64 wheel on windows-11-arm.
  6. Install the wheel artifact and run a package-specific smoke test in CI.
  7. Queue the same artifact to a physical WoA laptop.
  8. Install it in a fresh environment on the laptop and run the same smoke contract again.
  9. Report the result back into the controller inventory.

That gave me a bounded system where the controller tracked package status, smoke contracts, validation results, and the fork branch that carried the work. The laptop was not used as a self-hosted CI runner. Instead, it acted as a final validation stage that pulled artifacts and ran them on real hardware.

Here is the shape of one smoke-test contract:

- smoke_test_id: asyncpg-main

  package: asyncpg

  execution_modes: [ci, device]

  steps:

    - step_id: import-asyncpg

      kind: import_module

      module: asyncpg

    - step_id: load-native-pgproto-codec

      kind: python_code

      code: |

        import asyncpg

        from asyncpg.pgproto import pgproto

        value = "a10f71f2-3d8d-4ec2-a9a3-f8b4d2f2a8ee"

        parsed = pgproto.UUID(value)

        assert str(parsed) == value

The important detail is that the same contract runs in both places. I was not using one test for CI and a weaker or different test on the device.

What the first 13 packages taught me

I ran the workflow successfully on 13 repositories. The surprising part was not that the agent could help; it was where the problems actually were.

Out of the first 13 packages:

  • 10 were basically workflow-only enablement
  • 1 needed validation-environment shaping
  • 1 needed a real build-config fix
  • 1 needed a real packaging fix

That means only 2 of the first 13 required actual package-file changes. The rest were mostly blocked by missing WoA CI/build paths and missing artifact-based validation.

The most common issues were:

  1. No native windows-11-arm build or test lane
  2. Wrong or incomplete build environment on WoA
  3. Packaging logic that selected the wrong native payload
  4. Runtime dependency gaps during validation
  5. Validation plumbing problems, like artifact naming or staging assumptions

This was the most useful result of the exercise. If I had assumed that most packages needed invasive source work, I probably would have deprioritized the effort. The data pointed the other way: the work was usually simpler, more repetitive, and far easier to automate than I expected.

Three representative examples

To make that concrete, here are three packages that capture the range of work.

  • safetensors was the clean case. It needed a WoA workflow, installed-wheel validation, and device validation, but no package code changes. Once the Rust-backed wheel was built on a real ARM64 runner and tested from the artifact, it passed.
  • asyncpg was a build-config case. It needed a WoA workflow, a setuptools pin in pyproject.toml, and proper submodule handling to make the ARM64 build path stable. That is a real package change, but it is still a build-chain issue, not a rewrite of the library itself.
  • fastparquet was a validation-environment case. The package itself did not need source edits, but the validation loop had to install runtime dependencies and stage an unpublished WoA cramjam wheel alongside it so the parquet roundtrip could run on the laptop. This is exactly the kind of issue static analysis tends to miss.

Across the full set, I also hit real packaging bugs. The clearest example was a wheel that built successfully but bundled the wrong ARM64 native payload. That kind of failure is why I do not trust “it compiled” as the finish line.

What I would and would not claim

I would confidently say that this approach works as a practical engineering workflow. I now have a controller-driven system that can prioritize targets, patch forks, build and smoke test wheels on hosted WoA runners, validate those same artifacts on a physical WoA laptop, and record the results for later review..

I would not describe this as fully autonomous software delivery. I would still keep humans in the loop for upstream PR cleanup, release decisions, and tasks that needs broader maintainer coordination. I also did not instrument this tightly enough to quote a precise “hours saved” number, so I cannot provide an exact ROI figure. What I can say is that the nature of the work changed: once the workflow existed, the effort moved away from ad hoc setup and toward repeated execution.

Why this matters

For me, the key lesson is that Windows on Arm ecosystem work is often limited less by difficult engineering problems and more by missing process. If you can give an agent the right specification, the right validation loop, and access to hosted Arm runners plus one real device, a lot of wheel migration work becomes tractable.

That is the real story of this project. I did not set out to build a flashy autonomous system. I set out to remove the repetitive and expensive parts of porting Python packages to Windows on Arm. The result was a WoA Porting Agent that made the work repeatable, measurable, and easier to scale.

That may be the most useful way to think about AI-assisted enablement on Arm: not as magic, but as leverage for the repetitive engineering work that keeps ecosystems from moving faster.

Call to action

If you maintain Python packages, or you care about developer tooling on Windows on Arm, the takeaway is simple: start validating the actual wheel artifacts on real WoA hardware. Hosted windows-11-arm CI is a huge enabler, but the combination of hosted CI plus exact-artifact device validation is what turns a port from “probably works” into something you can trust.

To learn more about this methodology, check out my talk at Microsoft Build next week: Automate Windows on Arm migrations with AI-assisted specs.  Or come to the Arm booth and talk to me or other experts about accelerating moving your application to Windows on Arm.  


Log in to like this post
Share

Article text

Re-use is only permitted for informational and non-commercial or personal use only.

placeholder