~geb/numen

Simple voice control for handsfree computing on Linux
62f8b992 — John Gebbie 4 months ago
run: silence kill
8b0e099e — John Gebbie 4 months ago
record: anchor microphone grep and use sysdefault
1ca0780e — John Gebbie 4 months ago
phrases: have "shouting" clear modifiers beforehand

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~geb/numen
read/write
git@git.sr.ht:~geb/numen

You can also use your local clone with git send-email.

#Numen

Simple voice control for handsfree computing on Linux. Words are mapped to actions:

egg:press e
troll:mod ctrl
import:run xclip -selection clipboard -o
change:eval { text="$(./dictmenu -a)" && [ "$text" ] && printf '%s\n' "nevermind type $text"; } &
@instant index:click left
@instant @cancel no
@instant @dictate dictate

And there's literal transcription too. This is my real setup I use daily with phrases and speech recognition included. It's libre software and it's around 200 lines.

#The Speech Recognition

The speech recognition uses Vosk. It's open source and offline. The language model installed by ./install-vosk.sh is under 70MB unzipped.

#Requirements

Vosk currently requires a 64-bit processor.

  • An X Server, which you very likely use as your graphical environment with application windows, mouse cursor, etc
  • arecord from alsa-utils to record the microphone
  • python3 a programming language
  • pip a Python package manager (make sure it's for Python3), to install the speech recognition programming interface
  • xdotool to simulate keyboard and mouse
  • xset to query Caps Lock
  • Optionally dmenu used for a dictation result menu.

#Getting Started

git clone https://git.sr.ht/~geb/numen
cd numen
./install-vosk.sh  # installs the speech recognition
./run

Try writing something in your text editor, looking at phrases/char for the alphabet, numbers and symbols, and phrases/control for space, return, arrowkeys, shift, etc.

#Phrases

Without any arguments passed ./run gets phrase mappings from whatever files are in the phrases directory. I have given you:

  • char with the alphabet, numbers and symbols
  • control with space, return, backspace, left, right, pagedown, pageup, paste, shift, control, alt, etc
  • voice with speech recognition stuff "no" to cancel your sentence, "dictate" to type a sentence literally, etc
  • wm with mappings to keybindings I use for window management (see Workflow)
  • app with a tiny amount of application specific stuff

Words need to be in the language model's vocabulary. You can add more words to it but it looks a faff.

#Files

  • model/ The language model for the speech recognition.
  • phrases/ The phrase mappings.
  • record Outputs your microphone audio data formatted well for the speech recognition. You can listen to the output with the aplay command.
  • sr The speech recognition part, prints actions when it hears phrases and conditionally dictation results.
  • handler The action handler that mainly emulates keyboard and mouse.
  • middleman A simple filter between sr and handler that writes dictation results to a file.
  • dictmenu A simple menu that outputs a chosen dictation result for use in your phrases or scripts.
  • run Runs a pipeline of record -> sr -> middleman -> handler killing a previous instance.

I like having it all in one directory but ./run has optional arguments and options for specifying different paths.

#Workflow

As you can see this isn't "call Randy and apologize profusely" but a keyboard (and simple mouse) alternative. Numen requires a good text editor to be effective for programming and writing. I'd recommend Neovim, it's keyboard based with commands like indent paragraph, go to the next j, capitalize four lines, as well as normal editing. For example, to delete text in brackets you type dib so we'd say "drum ice bat". Neovim is the great-grandchild of Vi, there's a Vi style program for nearly everything. I browse the web with qutebrowser (you type the two keys next to a link to click it) and documents with zathura. There's also Vi modes for shells and I even use a Vi style extension to select stuff on my terminal.

My window manager is bspwm which arranges windows so you don't have to drag them about, and let's you control everything with keybindings. I use the phrases in phrases/wm to navigate windows, close them, switch desktop, etc. These phrases just simulate keychords I have bspwm listen for, this way everything still works with a keyboard too. I've made an installer for it here.

#Limitations

Limited to the X environment. As said above the words have to be in the language model's vocabulary.

The Vosk API is strangely unable to do multiword keyphrases, so it's single word phrases only, but I think this will change. It would be nice for rarer things like "function 1" for F1 and "please mount" to mount devices.