Voice and speaker identification¶

Ostler captures voice in two narrow places: the Companion's optional conversation capture, and the Watch app's quick-capture button. Where voice is captured, Ostler does speaker identification only – matching a voice against people you have already told the Hub about. It does not infer mood, emotion, sentiment, or stress. This page is the precise version of that promise, including the regional consent rules.

What this page is for

The privacy nutrition label in the App Store is a one-line summary. This page is the long version: what voice processing actually does, what it does not do, and what consent gate fires before any of it runs depending on where you are in the world.

What we mean by speaker identification¶

Speaker identification is the technical task of matching a recorded voice sample against a known speaker. Ostler uses speaker identification for one purpose: to label who said what in a captured conversation.

If you record a meeting with three people, Ostler's transcript wants to attribute lines to Alice, Bob, and You, not to "Speaker 1", "Speaker 2", "Speaker 3". To do that, the Hub keeps a small voice profile for each person you choose to tag, and the conversation pipeline matches new audio segments against those profiles.

The voice profile is a numerical representation – a vector – not a recording of the person. It is computed locally on the Hub from samples you provided. It cannot be played back as audio. It cannot be reverse-engineered into the original speech.

What speaker ID is not¶

There is a class of voice features that adjacent products ship which Ostler deliberately does not. We list them explicitly so the line is clear.

Feature	Ostler	Sometimes seen elsewhere
Identify who is speaking (against your own labelled profiles)	Yes – this is the entire feature	Yes
Infer emotion or mood from voice	No	Yes (commercial sentiment APIs)
Infer stress, fatigue, drunkenness	No	Yes (some HR / call-centre vendors)
Match against a stranger's voice	No	Yes (forensic / surveillance tooling)
Train a voice model on your audio	No	Often opted-in by default
Send audio off-device	No	Standard for cloud STT vendors

The first row is the product. Everything below it is a category of voice processing Ostler is built not to do, and the Hub has no code path that does any of them.

Where voice is captured¶

There are two voice surfaces in v0.1:

Companion conversation capture¶

When you start a conversation capture in the iPhone Companion, microphone audio is encoded locally and streamed to the Hub over your local network (or over Tailscale when you are away from home). The Hub:

Runs a local speech-to-text model (Whisper) on the audio.
Segments the transcript by speaker turn.
Matches each turn against your saved voice profiles.
Writes the labelled transcript to your local conversation store.
Discards the audio buffer.

The audio is not retained by default. The transcript is. Both stay on the Hub.

Watch quick-capture¶

Tapping the quick-capture complication on the Watch records a short voice memo and sends it to the iPhone Companion, which forwards it to the Hub via the same local-network channel. Same flow as above: Whisper transcription, speaker matching, transcript saved, audio discarded.

What never leaves the device¶

Even with voice features enabled, none of the following ever crosses the wire:

Raw audio. Audio is transcribed locally on the Hub and the audio buffer is dropped after transcription. There is no audio upload to Creative Machines or to any third party.
Voice profiles. Your speaker embeddings live in the Hub's encrypted local store, alongside the people they are tagged to.
Transcripts. Conversation transcripts are written to local Markdown files on the Hub. They sit inside the same encryption layers as the rest of your knowledge graph.

If you disconnect the Hub from the internet, voice capture and speaker identification both keep working in full. They do not depend on a cloud STT service or a hosted speaker-ID model.

Voice processing carries different regulatory weight depending on where you are. Ostler's defaults vary by region, with the strictest rules applied first.

United States and United Kingdom¶

Voice features are default-on when you turn on conversation capture, with a subtle opt-out in the Companion's privacy settings. The reasoning: you have already chosen to capture audio in the first place; speaker identification is the labelling step that makes the capture useful.

If you would rather have the unlabelled "Speaker 1 / Speaker 2" transcripts and skip the speaker-matching step:

Settings > Privacy > Speaker identification

Toggle it off and the Hub stops matching voices to profiles. New transcripts come back with anonymous speaker labels you can rename by hand.

European Union, EEA, and Switzerland¶

Voice characteristics are special-category data under Article 9 of the GDPR. The bar for processing them lawfully is higher than for other personal data – essentially, "you must explicitly consent, and the consent has to be informed and specific."

Ostler honours that with an explicit consent gate. On EU / EEA / Swiss installs:

Voice features (capture, transcription, speaker identification) are default-off.
The first time you try to start a conversation capture, the Companion shows a consent screen describing exactly what will be processed and where.
You must tick a clearly-labelled checkbox to enable voice processing. Without that tick, the Hub refuses to accept audio and the Companion does not start the recording.
The consent decision is recorded in ~/.ostler/posture/consent.json on the Hub so it can be cited later.
You can revoke the consent at any time from the same Privacy settings panel, which immediately stops voice processing and clears the saved consent flag.

The consent screen does not pre-tick the box. There is no "consent by silence" path.

Other regions¶

We default to the EU posture (consent gate) anywhere we are not certain that the simpler US / UK default applies. If you find your install is showing the consent gate when you would prefer the default-on behaviour, the region detection is configurable in ~/.ostler/config/.env; see reference / configuration.

What is logged¶

When voice processing runs, the Hub writes a short audit-log entry – timestamp, event type ("voice capture started", "speaker matched", "consent granted"), and a hash of the transcript identifier. The log entry does not contain the audio, the transcript, or the speaker name in plain text. It is there so that, if you ever want to reconstruct what voice processing happened on what day, you can.

The audit log is itself encrypted at rest in the Hub's SQLCipher store. See architecture / encryption for the details.

Disabling voice features entirely¶

If you do not want any voice processing at all – no capture, no STT, no speaker identification – the simplest path is:

Companion: Settings > Privacy > Voice features > Off. Disables the capture button on the iPhone and the quick-capture complication on the Watch.
Microphone permission revoked at the OS level. iOS Settings > Privacy & Security > Microphone > Ostler Companion > Off. This is the belt-and-braces option.

The Hub's voice pipeline only runs when audio is sent to it. With the Companion's mic disabled, no audio is sent, and nothing is processed.

App Store privacy nutrition label¶

For transparency, this is exactly what the Companion's App Store privacy label declares for v0.1:

Data type	Linked to identity	Used for tracking	Purpose
Health	No	No	App functionality
Fitness	No	No	App functionality
Precise location	No	No	App functionality
Photos or videos	No	No	App functionality
Other user content	No	No	App functionality
Media library	No	No	App functionality

"Other user content" is the bucket Apple's nutrition label uses for transcripts and similar generated content. None of these categories are linked to your identity, none are used to track you across apps or websites, and all are used only to make the app function.

The full machine-readable declaration is in the app's PrivacyInfo.xcprivacy and is what App Review checks against.

What stays local – the broader inventory.
What we never collect – including microphone audio, training data.
Architecture / Encryption – what protects voice profiles and transcripts at rest.
Architecture / Data flows – the precise wire-level picture of voice capture.