Voice API v1

The what3words Voice API allows a user to say three words into any application or service, with it returning a list of suggested what3words addresses, through a single API call.

Utilising WebSockets for realtime audio steaming, and powered by the Speechmatics WebSocket Speech API, the fast and simple interface provides a powerful AutoSuggest function, which can validate and autocorrect user input and limit it to certain geographic areas.

All coordinates are latitude,longitude pairs in standard WGS-84 as commonly used worldwide in GPS systems. All latitudes must be in the range of -90 to 90 (inclusive).

Overview

Get started

  1. Select our free plan and create an API key
  2. Add a Voice API plan in your account

Quick Start

A simple standalone Python application has been provided to demonstrate the capabilities of the Voice API. It enables the recording of a what3words address using the microphone (audio is streamed in realtime to the API), with a list of suggestions returned to console.

WebSockets

WebSockets are used to provide a two-way transport layer between your client and the what3words Voice API, enabling use with most modern web-browsers, and programming languages. See RFC 6455 for the detailed specification of the WebSocket protocol.

The wire protocol used with the WebSocket consists mostly of packets of stringified JSON objects which comprise a message name, plus other fields that are message dependant. The only exception is that a binary message is used for transmitting the audio.

You can develop your real time client using any programming language that supports WebSockets. This reference provides a list of the messages that are required for the Client and Server communication. Some of the messages are required to be sent in a particular order (outlined below) whilst others are optional.

Resource URL

wss://voiceapi.what3words.com/v1/autosuggest

Configuration

Voice Language

The voice-language request parameter is mandatory, and must always be specified. The language code provided is used to configure both the Speechmatics ASR, and the what3words AutoSuggest algorithm.

Please provide one of the following voice-languagecodes.

Voice Language CodeDescription
arArabic
cmnMandarin Chinese
deGerman
enGlobal English
esSpanish
hiHindi
jaJapanese
koKorean

Clipping and Focus

Our clipping allows you to specify a country (or list of countries) and/or geographic area to exclude results that are not likely to be relevant to your users. To give a more targeted, shorter set of results to your users, we recommend you use the clipping parameters. If you know your user’s current location, we also strongly recommend that you use the focus to return results that are likely to be more relevant (i.e. results near the user).

In summary, the clipping policy is used to optionally restrict the list of candidate AutoSuggest results, after which, if focus has been supplied, this will be used to weight the results in order of relevance to the focus.

Multiple clipping policies can be specified, though only one of each type. For example you can clip to country and clip to circle in the same AutoSuggest call, and it will clip to the intersection of the two (results must be in the circle AND in the country). However, you can't specify two clip-to-circle policies in the same call.

Parameters

ParameterRequired/OptionalDescriptionExample
keyrequiredA valid API key; if not supplied as a parameter, a key must be supplied as a request headerkey=[API-KEY]
voice-languagerequiredShould be one of: ar, cmn, de,en, es, hi, ja orkovoice-language=en
focusoptionalThis is a location, specified as latitude,longitude (often where the user making the query is). If specified, the results will be weighted to give preference to those near the focus. For convenience, longitude is allowed to wrap around the 180 line, so 361 is equivalent to 1.focus=51.521251,-0.203586
clip-to-countryoptionalRestricts AutoSuggest to only return results inside the countries specified by comma-separated list of uppercase ISO 3166-1 alpha-2 country codes (for example, to restrict to Belgium and the UK, use clip-to-country=GB,BE). Clip-to-country will also accept lowercase country codes. Entries must be two a-z letters. WARNING: If the two-letter code does not correspond to a country, there is no error: API simply returns no results.clip-to-country=NZ,AU
clip-to-bounding-boxoptionalRestrict AutoSuggest results to a bounding box, specified by coordinates.south_lat,west_lng,north_lat,east_lng, where:south_lat less than or equal to north_latwest_lng less than or equal to east_lngIn other words, latitudes and longitudes should be specified order of increasing size. Lng is allowed to wrap, so that you can specify bounding boxes which cross the ante-meridian: -4,178.2,22,195.4clip-to-bounding-box=51.521,-0.343,52.6,2.3324
clip-to-circleoptionalRestrict AutoSuggest results to a circle, specified by lat,lng,kilometres, where kilometres in the radius of the circle. For convenience, longitude is allowed to wrap around 180 degrees. For example 181 is equivalent to -179.clip-to-circle=51.521,-0.343,142
clip-to-polygonoptionalRestrict AutoSuggest results to a polygon, specified by a comma-separated list of lat,lng pairs. The polygon should be closed, i.e. the first element should be repeated as the last element; also the list should contain at least 4 entries. The API is currently limited to accepting up to 25 pairs.clip-to-polygon=51.521,-0.343,52.6,2.3324,54.234,8.343,51.521,-0.343

Client ↔ API endpoint

The communication is done using WebSockets, which are implemented in most of the modern web-browsers, as well as in many common programming languages.

Messages

Each message that the Server accepts is a stringified JSON object with the following fields:

  • message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

StartRecognition

Initiates recognition, based on details provided in the following fields:

  • message: "StartRecognition"
  • audio_format (Object:AudioType): Required. Audio stream type you are going to send: see Supported audio types.

A StartRecognition message must be sent exactly once after the WebSocket connection is opened. The client must wait for a RecognitionStarted message before sending any audio.

In case of success, a message with the following format is sent as a response:

  • message: "RecognitionStarted"
  • id (String): Required. A randomly-generated GUID which acts as an identifier for the session. e.g. "807670e9-14af-4fa2-9e8f-5d525c22156e".

In case of failure, an error message is sent, with typebeing one of the following: invalid_model,invalid_audio_type or job_error

An example of the StartRecognition message:

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 16000
  }
}
Copied

The example above starts a session ready to consume raw PCM encoded audio with float samples at 16kHz.

An example of the RecognitionStarted message a client will receive following the successful receipt of aStartRecognition message:

{
  "message": "RecognitionStarted",
  "id": "807670e9-14af-4fa2-9e8f-5d525c22156e"
}
Copied

AddAudio

Adds more audio data to the recognition job started on the WebSocket using StartRecognition. The server will only accept audio after it is initialized with a job, which is indicated by a RecognitionStarted message. Only one audio stream in one format is currently supported per WebSocket (and hence one recognition job). AddAudiois a binary message containing a chunk of audio data and no additional metadata.

AudioAdded

If the AddAudio message is successfully received, anAudioAdded message is sent as a response. The following fields are present in the response:

  • message: "AudioAdded"
  • seq_no (Int): Required. An incrementing number which is equal to the number of audio chunks that the server has processed so far in the session. The count begins at 1 meaning that the 5th AddAudio message sent by the client, for example, should be answered by anAudioAdded message with seq_noequal to 5.

Please note: Following the completion of a three word utterance automatic end of speech detection will engage. At this point a Suggestions message will be returned.Suggestions messages are the last response a client will receive from the server. This means that any outstanding AudioAdded messages yet to be transmitted, will be left unsent.

Suggestions

This message is sent from the Server to the Client when it's been determined that an utterance is complete. The following fields are present in the response:

  • message: "Suggestions"
  • suggestions (String[]): Required. A list of valid 3 word addresses. The list of returned suggestions can be fine tuned by the request parameters

An example of the Suggestions message a client will recieve upon successful transcription:

{
  "message": "Suggestions",
  "suggestions": [
    {
      "country": "GB",
      "nearestPlace": "West Bromwich, Sandwell",
      "words": "star.words.forced",
      "distanceToFocusKm": 166,
      "rank": 1,
      "language": "en"
    },
    {
      "country": "AU",
      "nearestPlace": "Katherine, Northern Territory",
      "words": "star.wards.force",
      "distanceToFocusKm": 14079,
      "rank": 2,
      "language": "en"
    },
    {
      "country": "US",
      "nearestPlace": "Rossmoor, Maryland",
      "words": "star.words.force",
      "distanceToFocusKm": 5880,
      "rank": 3,
      "language": "en"
    }
  ]
}
Copied

Supported audio types

An AudioType object always has three mandatory fields: type, encoding andsample_rate.

The following values are supported:

type

  • raw

encoding

  • pcm_f32le - Corresponds to 32 bit float PCM used in the WAV audio format, little-endian architecture.
  • pcm_s16le - Corresponds to 16 bit signed integer PCM used in the WAV audio format, little-endian architecture.
  • mulaw - Corresponds to 8 bit μ-law (mu-law) encoding.

sample_rate

  • (Int): Sample rate of the audio.

Error Handling

Error states fall into two categories. Errors during the initial connection before a StartRecognition message is sent, and errors during WebSocket streaming.

Connection Errors

Errors during the initial connection will cause the WebSocket connection to immediate close, with a status code greater than 1000. As well as providing a status code in the error range, a reason will also be provided.

Error reasons produce JSON output in the following format:

{
    "code": "MissingKey",
    "message": "Authentication failed; missing required API key parameter or header"
  }
Copied

The following code values could be returned:

  • MissingKey
  • InvalidKey
  • SuspendedKey
  • BadInput
  • NotFound

Streaming Errors

Once a WebSocket connection has been established it's possible that during communication with the Server, an error could be returned.

Streaming error messages have the following fields:

  • message: "Error"
  • code - (Int): Optional. A numerical code for the error.
  • type - (String): Required. A code for the error message. See the list of possible errors below.
  • reason - (String): Required. A human-readable reason for the error message..

Error types

The following type values could be returned:

  • invalid_message - The message received was not understood.
  • invalid_audio_type - Audio type is not supported, is deprecated, or the audio_type is malformed.
  • job_error - Unable to do any work on this request, the Server might have timed out etc.
  • data_error - Unable to accept the data specified - usually because there is too much data being sent at once.
  • buffer_error - Unable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster then realtime.
  • protocol_error - Message received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order.
  • unknown_error - An error that did not fit any of the types above.

Note that invalid_message, protocol_error andunknown_error can be triggered as a response to any type of messages.

The connection is closed after any error.

Info Messages

Info messages denote additional information sent from the Server to the Client. Those are similar to Error messages in syntax, but don't actually denote any problem. The Client can safely ignore these messages or use them for additional client-side logging.

  • message: "Info"
  • code - (Int): Optional. A numerical code for the informational message.
  • type - (String): Required. A code for the info message. See the list of possible info messages below.
  • reason - (String): Required. A human-readable reason for the informational message.

Info message types

  • recognition_quality - Informs the Client what particular quality-based model is used to handle the recognition. It provides the following extra field:
    • quality - (String): Quality-based model name. It is one of "telephony" , "broadcast" . The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.

Example communication

The communication consists of 3 stages - initialization, transcription and final suggestion.

On initialization, the StartRecognition message is sent from the Client to the API and the Client must block and wait until it receives a RecognitionStarted message.

Afterwards, the transcription stage happens. The client keeps sendingAddAudio messages. The API asynchronously replies withAudioAdded messages.

The API will automatically detect the completion of a three word utterance, at which point a the Server sends a Suggestions message as it's last message. No more messages are handled by the API afterwards, and the connection is closed by the Server. At this point, the Client can also safely disconnect.

Note: In the example below, -> denotes a message sent by the Client to the API, <- denotes a message send by the API to the Client. Any comments are denoted "[like this]" .

-> {
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 16000
  }
}

<- {
  "message": "RecognitionStarted",
  "id": "807670e9-14af-4fa2-9e8f-5d525c22156e"
}

-> "[binary message - AddAudio 1]"
-> "[binary message - AddAudio 2]"

<- {
  "message": "AudioAdded",
  "seq_no": 1
}

<- {
  "message": "Info",
  "type": "recognition_quality",
  "quality": "broadcast",
  "reason": "Running recognition using a broadcast model quality."
}

<- {
  "message": "AudioAdded",
  "seq_no": 2
}

"[asynchronously received transcripts]"

<- {
  "message": "Suggestions",
  "suggestions": [
    {
      "country": "GB",
      "nearestPlace": "West Bromwich, Sandwell",
      "words": "star.words.forced",
      "distanceToFocusKm": 166,
      "rank": 1,
      "language": "en"
    },
    {
      "country": "AU",
      "nearestPlace": "Katherine, Northern Territory",
      "words": "star.wards.force",
      "distanceToFocusKm": 14079,
      "rank": 2,
      "language": "en"
    },
    {
      "country": "US",
      "nearestPlace": "Rossmoor, Maryland",
      "words": "star.words.force",
      "distanceToFocusKm": 5880,
      "rank": 3,
      "language": "en"
    }
  ]
}

"[WebSocket closes]"
Copied