Master of Puppets: Foundation Models Tool Calling in Swift

A few months ago I wrote about running four AI features entirely on-device in VinylCrate using Apple’s Foundation Models framework. That post was about @Generable — type-safe structured output from a local model, no network, no per-token bill. It held up. Those features still ship.

But there was a hole in that post, and a few people poked it.

Everything I described was a closed loop. I assembled context up front, handed the model a prompt, and got a typed struct back. The model never reached into anything. When I needed a real Discogs release ID, I treated the model’s guess as a hint and verified it against the API afterward. I even called that out as a limitation: “You can’t give the model a tool to call Discogs mid-generation. You build the context up front, generate, then validate and enrich afterward.”

That was true for the features I’d built. It was not true for the framework.

The model can reach into your app. Mid-generation. It can call a function you wrote, get a result, and fold that result into its answer — all in a single respond call. That’s the Tool protocol — on-device tool calling — and it’s the difference between an AI feature that talks about your data and one that talks to it.

Here’s the moment that made me build it.

Battery — The Question Text Generation Couldn’t Answer

VinylCrate’s collection lives in SwiftData. A user has anywhere from a dozen to a few thousand records. The existing AI features summarize shapes — the character of a crate, the vibe of a collection. They’re good at “your collection leans heavy on ’70s soul.” They’re useless at “do I own anything by Tool?”

Because that’s a lookup. It’s not a vibe. It’s a fact about this user’s actual rows in SwiftData, and the model has no idea those rows exist. Ask it “what should I spin tonight?” and it’ll happily recommend a record you don’t own. Confidently. Which is worse than useless — it’s a bug that reads like a feature.

You can fake your way around this. Stuff the entire collection into the prompt and let the model pretend it knows. I did the math on that in the last post: a 500-record collection with metadata blows past the on-device context window long before you finish serializing it. Truncate to fit and the model is now answering questions about a sample of the collection while sounding authoritative about all of it.

That’s the trap. The honest fix isn’t a bigger prompt. It’s giving the model a way to ask the database itself.

Master of Puppets — The `Tool` Protocol, Pulled Apart

A Tool is a function the model can decide to call. This is tool calling — the same function-calling pattern you know from server-side LLMs, except the whole loop runs on-device. You describe it in natural language, declare its arguments as a @Generable type, and implement the body in plain Swift. At runtime the model reads your description, decides whether the user’s request warrants calling it, generates the arguments, and waits for your result before composing its final answer.

Four pieces matter:

name — a stable identifier the model uses to reference the tool.
description — natural-language explanation of when to use it. This is prompt engineering, not documentation. The model reads it.
A nested @Generable struct Arguments — the typed parameters the model fills in.
func call(arguments:) async throws -> String — your implementation. The return string goes back into the model’s context.

Here’s the smallest honest version — a tool that answers “do I own anything by this artist?”

import FoundationModels

struct ArtistLookupTool: Tool, Sendable {
    let name = "lookupArtist"
    let description = """
        Search the user's own vinyl collection for records by a specific \
        artist. Use this whenever the user asks whether they own something, \
        what they have by an artist, or anything that depends on their \
        actual collection rather than general music knowledge.
        """

    @Generable
    struct Arguments: Sendable {
        @Guide(description: "The artist or band name to search for")
        var artist: String
    }

    func call(arguments: Arguments) async throws -> String {
        // ...query the real collection, return a string the model can read
        "..."
    }
}

That’s the whole shape. The interesting decisions are all inside those four members, so let’s slow down on each.

The description is doing the most work, and it’s the part people underinvest in. The model isn’t matching keywords. It reads that sentence and reasons about whether the user’s request fits. “Use this whenever the user asks whether they own something” is an instruction aimed at the model’s judgment. Vague descriptions produce a tool that fires at the wrong times or never fires at all. Mine went through four rewrites before it stopped triggering on general “tell me about this band” questions where no collection lookup was needed.

The Thing That Should Not Be — Shaping What The Model Hands You

The Arguments struct is @Generable, which means everything from the @Generable structured-output post applies — @Guide constraints, soft suggestions on strings, hard contracts on ranges and counts. The difference is who fills it in. With output generation, the model produces the struct as your answer. With a tool, the model produces the struct as a question to your code.

That distinction changes how you constrain it. You’re not shaping a UI payload — you’re shaping a query. Over-constrain and the model can’t express what the user actually asked. Under-constrain and you get garbage arguments your call body has to defend against.

A recommendation tool makes this concrete. “What should I spin tonight?” carries intent — a mood, maybe a decade, maybe a genre. I want the model to extract that intent into structured arguments so my SwiftData query can act on it:

struct SpinSuggestionTool: Tool, Sendable {
    let name = "suggestFromCollection"
    let description = """
        Suggest records to play from the user's own collection based on a \
        mood, genre, or era they describe. Only returns records the user \
        actually owns. Use this for "what should I play" style requests.
        """

    @Generable
    struct Arguments: Sendable {
        @Guide(description: "A mood or vibe the user described, e.g. 'late night', 'energetic', 'mellow'. Empty if none was given.")
        var mood: String

        @Guide(description: "A musical genre to filter by, or empty if the user did not specify one")
        var genre: String

        @Guide(description: "How many records to suggest", .range(1...5))
        var limit: Int
    }

    private let store: CollectionStore

    init(store: CollectionStore) {
        self.store = store
    }

    func call(arguments: Arguments) async throws -> String {
        let matches = try await store.records(
            matchingGenre: arguments.genre.isEmpty ? nil : arguments.genre,
            limit: arguments.limit
        )
        return Self.render(matches, mood: arguments.mood)
    }
}

Note the .range(1...5) on limit. That’s a hard contract — the model cannot hand my code a limit of 200 and make me page through the whole collection. And note the empty-string convention on mood and genre: the on-device model handles “absent” awkwardly with optionals, so I model “the user didn’t say” as an empty string and branch on it in call. Not elegant. But honest, and it survives Swift 6 without an optional dance the model keeps getting wrong.

One thing the reference makes explicit and I’ll repeat: under Swift 6 strict concurrency, both the Tool conformer and its @Generable Arguments need to be Sendable. A tool gets dispatched concurrently — that’s the whole point — so shared mutable state inside one is a real data race, not a theoretical one. My tools hold an immutable reference to a store and nothing else, which makes them naturally Sendable. No @unchecked, no nonisolated(unsafe). If you find yourself reaching for either, the tool is holding state it shouldn’t.

Welcome Home — Wiring `call` To Real SwiftData

This is where it stops being a framework demo and starts touching production. The call body queries the actual collection. VinylCrate’s records live in SwiftData — the same collection I wired into Siri through App Intents — and the model context is @MainActor-bound, so I keep the read on an actor that owns its own ModelContext and hand back plain Sendable value types.

struct CollectionRecord: Sendable {
    let title: String
    let artist: String
    let year: Int?
    let genre: String
}

actor CollectionStore {
    private let context: ModelContext

    init(container: ModelContainer) {
        self.context = ModelContext(container)
    }

    func records(matchingGenre genre: String?, limit: Int) async throws -> [CollectionRecord] {
        var descriptor = FetchDescriptor<Album>(
            sortBy: [SortDescriptor(\.dateAdded, order: .reverse)]
        )
        if let genre {
            descriptor.predicate = #Predicate { $0.genre.localizedStandardContains(genre) }
        }
        descriptor.fetchLimit = limit * 4  // over-fetch to leave headroom for post-filtering

        let albums = try context.fetch(descriptor)
        return albums.prefix(limit).map {
            CollectionRecord(title: $0.title, artist: $0.artist, year: $0.year, genre: $0.genre)
        }
    }

    func records(byArtist artist: String) async throws -> [CollectionRecord] {
        let descriptor = FetchDescriptor<Album>(
            predicate: #Predicate { $0.artist.localizedStandardContains(artist) }
        )
        return try context.fetch(descriptor).map {
            CollectionRecord(title: $0.title, artist: $0.artist, year: $0.year, genre: $0.genre)
        }
    }
}

The SwiftData model objects never cross the actor boundary. They can’t — Album is @MainActor-bound through SwiftData and isn’t Sendable. What crosses is CollectionRecord, a flat value type that is. The tool gets clean, sendable data and never touches a managed object on the wrong isolation. This is the part the compiler will fight you on if you get it wrong, and that’s the compiler doing its job.

Now the tool’s call finishes the job — turn rows into a string the model can actually read:

extension SpinSuggestionTool {
    static func render(_ records: [CollectionRecord], mood: String) -> String {
        guard !records.isEmpty else {
            return "The user owns no records matching that request."
        }
        let lines = records.map { record in
            let year = record.year.map { " (\($0))" } ?? ""
            return "- \(record.title) by \(record.artist)\(year), genre: \(record.genre)"
        }
        return """
            The user owns these matching records:
            \(lines.joined(separator: "\n"))
            """
    }
}

Two things I want to be loud about here.

First, the empty case matters more than the populated one. “The user owns no records matching that request.” is the sentence that stops the model from hallucinating a recommendation. Without an explicit “you have nothing here,” the model fills the silence — and it fills it with records the user doesn’t own. The tool’s job isn’t only to return data. It’s to return the absence of data in language the model respects.

Second, you’re writing for a reader, not a parser. The return value is a String that goes straight back into the model’s context. Prose with structure beats raw JSON here — the model reads “The user owns these matching records:” and grounds its answer in it. I tried returning a serialized array first. The model treated it like data to summarize rather than facts to reason from. A sentence framed as ground truth lands better.

Disposable Heroes — Registering Tools And Keeping Context

Tools register on the session, alongside instructions. Once registered, the model decides when to call them. You don’t invoke them — you describe them and step back.

For VinylCrate’s collection chat, the session drives the UI, so it lives on a @MainActor @Observable view model. The tools hold a reference to the store; the session holds the tools.

@MainActor
@Observable
final class CollectionChatModel {
    private let session: LanguageModelSession
    var messages: [ChatMessage] = []
    var isResponding = false

    init(store: CollectionStore) {
        self.session = LanguageModelSession(
            tools: [
                ArtistLookupTool(store: store),
                SpinSuggestionTool(store: store)
            ],
            instructions: """
                You are a knowledgeable record collecting companion for the \
                user's personal vinyl collection. When a question depends on \
                what the user actually owns, use the available tools to look \
                it up. Never claim the user owns a record unless a tool \
                confirms it. Keep answers warm and conversational.
                """
        )
    }

    func send(_ text: String) async {
        guard !session.isResponding else { return }
        isResponding = true
        defer { isResponding = false }

        messages.append(ChatMessage(role: .user, text: text))
        do {
            let response = try await session.respond(to: text)
            messages.append(ChatMessage(role: .assistant, text: response.content))
        } catch {
            messages.append(ChatMessage(role: .assistant, text: friendlyMessage(for: error)))
        }
    }
}

The instructions carry one load-bearing sentence: “Never claim the user owns a record unless a tool confirms it.” That line plus the empty-result string from the tool are what keep the feature honest. Belt and suspenders. The model is being asked, explicitly, to defer to the tools on questions of fact — and the tools are built to answer “nothing” when there’s nothing.

Because it’s one long-lived session, context carries across turns. The user asks “do I own anything by Tool?” — the model calls lookupArtist, sees the result, answers. Then “what about something in that vibe for tonight?” — and that refers back to the previous turn. The session remembers the artist, the genre, the answer it gave. A fresh session per message would lose all of it. For a chat feature, one session, reused, is the whole point. (This is the opposite of my advice in the last post, where each one-shot feature got its own purpose-built session. Different shape, different rule.)

The guard !session.isResponding is not optional. The session is single-threaded — one call at a time. Fire a second respond before the first finishes and the session faults. The @MainActor isolation plus the guard prevent it.

Damage, Inc. — When The Tool, Or The Model, Says No

Tools add a failure mode the closed-loop features never had: the tool itself can throw. Foundation Models wraps that in ToolCallError, which carries both the offending tool and the underlying error. You want to handle it distinctly from a generation failure, because the fix is different — a thrown tool is your bug, not the model’s.

private func friendlyMessage(for error: Error) -> String {
    switch error {
    case let toolError as LanguageModelSession.ToolCallError:
        // Your code threw inside call(arguments:) — log it, recover gracefully
        logger.error("Tool \(toolError.tool.name) failed: \(toolError.underlyingError)")
        return "I had trouble checking your collection just now. Try again?"

    case LanguageModelSession.GenerationError.guardrailViolation:
        return "I can't help with that one."

    case LanguageModelSession.GenerationError.refusal:
        logger.notice("Model refused the request")
        return "I'd rather not answer that."

    case LanguageModelSession.GenerationError.rateLimited(_):
        return "Give me a second to catch up."

    case LanguageModelSession.GenerationError.exceededContextWindowSize(_):
        return "This conversation got long — let's start fresh."

    default:
        return "Something went sideways. Try again?"
    }
}

A few field notes.

ToolCallError is the one you’ll cause yourself. A SwiftData fetch throws, an actor hop fails, your render code force-unwraps something it shouldn’t — it all surfaces here, tagged with toolError.tool.name so you know exactly which tool fell over. Log it with the tool name. Future-you will thank present-you when one specific tool starts misbehaving in the field.

exceededContextWindowSize arrives faster with tools than without. Every tool call appends its result string into the transcript, and the transcript is the context. A long collection chat with several lookups accumulates. The fix in a chat surface is to offer a clean session — “let’s start fresh” — rather than silently truncating and confusing the model.

And the same availability check from before still gates everything. None of this runs if SystemLanguageModel.default.availability isn’t .available. Tools don’t change that — there’s no model to call tools, no feature. Check first, degrade to the non-AI screen, never show a broken chat.

The Verdict — What Actually Works, And What’s Still Rough

I’ve been living with this in VinylCrate for a few weeks. Honest read:

Grounding works, and it’s the whole game. The model genuinely defers to the tools. “Do I own anything by X” now returns the truth instead of a confident guess. That alone justified the build. The closed-loop features could never do this, full stop.

Tool selection is mostly right, occasionally weird. The model picks the right tool the large majority of the time. But it’ll sometimes skip a tool on a question that clearly needs one, or call a tool on a general-knowledge question that didn’t. Sharpening the description field fixes most of it. It is prompt engineering, and it is iterative, and you will rewrite those descriptions more than you expect.

Latency is real and it stacks. A plain generation is a second or two. A generation that calls a tool is that plus your call execution plus a second model pass to fold the result in. Two tool calls in one turn and you’re staring at a spinner for four or five seconds on current hardware. For a chat surface that’s tolerable — people expect chat to think. For anything that needs to feel instant, tools are the wrong reach. Streaming the final response helps the perception, but the tool round-trip happens before the first token, so the wait to first token is still there.

The Arguments-modeling friction is real. The optional-versus-empty-string dance is the kind of thing that feels like fighting the framework. It works. It isn’t pretty. I expect this to get better as the model improves, but today you design around it.

So: is on-device tool calling everything cloud function-calling is? No. It’s slower, the model is smaller, and you’ll do more work to keep tool selection sharp. But it queries the user’s real, private collection without a single byte leaving the device — and for VinylCrate, that’s the entire reason the app exists. The records stay where they belong. Now the model can finally read them.

The strings were always there. I just hadn’t picked them up.

Master of Puppets: Foundation Models Tool Calling in Swift

Battery — The Question Text Generation Couldn’t Answer

Master of Puppets — The Tool Protocol, Pulled Apart

The Thing That Should Not Be — Shaping What The Model Hands You

Welcome Home — Wiring call To Real SwiftData

Disposable Heroes — Registering Tools And Keeping Context

Damage, Inc. — When The Tool, Or The Model, Says No

The Verdict — What Actually Works, And What’s Still Rough

Further Reading

Stay in the loop

Master of Puppets — The `Tool` Protocol, Pulled Apart

Welcome Home — Wiring `call` To Real SwiftData