~n0mn0m/blog

976eedf6fece3c9a6ac27f162e57e7015dbbf6c7 — n0mn0m a month ago 12f6c9e
Wrapping up.

Add post wrapping up train all the things project.
A content/blog/train-all-the-things-custom-model.md => content/blog/train-all-the-things-custom-model.md +67 -0
@@ 0,0 1,67 @@
+++
title = "Train All the Things: Model Training"
description = ""
date = 2020-02-08
in_search_index = true

[taxonomies]
tags = ["hackaday", "maker", "machine-learning", "training", "train-all-the-things", "tensorflow", "esp"]
categories = ["programming"]
+++

Recently I spent some time learning how to generate synthetic voices using [espeak](@/blog/train-all-the-things-data-generation.md). After working with the tools to aligning with the Tensorflow keyword models expectations I was ready for training, and to see how well the synthetic data performed. TLDR: not well :)

I started by training using the keywords `hi`, `smalltalk` and `on`. This let me have a known working word while testing two synthetic words. Although training went well:

```text
INFO:tensorflow:Saving to "/Users/n0mn0m/projects/on-air/voice-assistant/train/model/speech_commands_train/tiny_conv.ckpt-18000"
I0330 10:34:28.514455 4629171648 train.py:297] Saving to "/Users/n0mn0m/projects/on-air/voice-assistant/train/model/speech_commands_train/tiny_conv.ckpt-18000"
INFO:tensorflow:set_size=1445
I0330 10:34:28.570324 4629171648 train.py:301] set_size=1445
WARNING:tensorflow:Confusion Matrix:
 [[231   3   3   0   4]
 [  2 178   6  29  26]
 [  3  12 146   2   2]
 [  4  17   2 352  21]
 [  2  16   7  16 361]]
W0330 10:34:32.116044 4629171648 train.py:320] Confusion Matrix:
 [[231   3   3   0   4]
 [  2 178   6  29  26]
 [  3  12 146   2   2]
 [  4  17   2 352  21]
 [  2  16   7  16 361]]
WARNING:tensorflow:Final test accuracy = 87.8% (N=1445)
W0330 10:34:32.116887 4629171648 train.py:322] Final test accuracy = 87.8% (N=1445)
```

The model didn't respond well once it was loaded onto the ESP-EYE. I tried a couple more rounds with other keywords and spectrogram samples with similar results.

Because of the brute force nature that I used to generate audio the synthtetic training data isn't very representative of real human voices. While the experiment didn't work out, I do think that generating data this way could be useful with the right amount of time and research. Instead of scaling parameters in a loop I think researching the characteristic of various human voices and using those to tune the data generated via espeak could actually work out well. That said it's possible the model may pick up on characteristics of the espeak program too. Regardless, voice data that is ready for training is still a hard problem in need of more open solutions.

Along with the way I scaled the espeak parameters another monkey wrench is that the microspeech model makes use of a CNN and spectrogram of the input audio instead of full signal processing. This means it's highly likely the model will work with voices around the comparison spectrogram well, but not generalize. This makes picking the right spectrogram relative to the user another key task.

Because of these results and bigger issues I ended up tweaking my approach and used [`visual`](https://git.sr.ht/~n0mn0m/on-air/tree/master/voice-assistant/smalltalk/main/main_functions.cc) as my wake word followed by on/off. All of these are available in the TF command words dataset, and visual seems like an ok wake word when controlling a display. For somebody working on a generic voice assistant you will want to work on audio segmentation since many datasets are sentences, or consider using something like [Skainet](https://github.com/espressif/esp-skainet). All of this was less fun than running my own model from synthtetic data, but I needed to continue forward. After a final round of training with all three words I followed the TF [docs](https://www.tensorflow.org/lite/microcontrollers?hl=he) to represent the model as a C array and then flashed it onto the board with the rest of the program. Using `idf monitor` I was able to observe the model working as expected:

```text
I (31) boot: ESP-IDF v4.1
I (31) boot: compile time 13:35:43
I (704) wifi: config NVS flash: enabled
I (734) WIFI STATION: Setting WiFi configuration SSID Hallow...
I (824) WIFI STATION: wifi_init_sta finished.
I (1014) TF_LITE_AUDIO_PROVIDER: Audio Recording started
Waking up
Recognized on
I (20434) HTTPS_HANDLING: HTTPS Status = 200, content_length = 1
I (20434) HTTPS_HANDLING: HTTP_EVENT_DISCONNECTED
I (20444) HTTPS_HANDLING: HTTP_EVENT_DISCONNECTED
Going back to sleep.
Waking up
Recognized off
I (45624) HTTPS_HANDLING: HTTPS Status = 200, content_length = 1
I (45624) HTTPS_HANDLING: HTTP_EVENT_DISCONNECTED
I (45634) HTTPS_HANDLING: HTTP_EVENT_DISCONNECTED
```

This was an educational experiment. It helped me put some new tools in my belt while thinking further about the problem of voice and audio processing. I developed some [scripts](https://git.sr.ht/~n0mn0m/on-air/tree/master/voice-assistant/train) to run through the full data generation, train and export cycle. Training will need to be done based on the architecture somebody is using, but hopefully it's useful.

The code, docs, images etc for the project can be found [here](https://git.sr.ht/~n0mn0m/on-air) and I'll be posting updates as I continue along to [HackadayIO](https://hackaday.io/project/170228-on-air) and this blog. If you have any questions or ideas reach [out](mailto:alexander@unexpextedeof.net).
\ No newline at end of file

A content/blog/train-all-the-things-data-generation.md => content/blog/train-all-the-things-data-generation.md +77 -0
@@ 0,0 1,77 @@
+++
title = "Train All the Things: Synthetic Voice Generation"
description = ""
date = 2020-03-19
in_search_index = true

[taxonomies]
tags = ["hackaday", "maker", "machine-learning", "train-all-the-things", "data", "synthesis"]
categories = ["programming"]
+++

After getting the display and worker up and running I started down the path of training my model for keyword recognition. Right now I've settled on the wake words `Hi Smalltalk`. After the wake word is detected the model will then detect `silence`, `on`, `off`, or `unknown`.

My starting point for training the model was the [`micro_speech`](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/micro_speech) and [`speech_commands`](https://github.com/tensorflow/docs/blob/master/site/en/r1/tutorials/sequences/audio_recognition.md) tutorials that are part of the Tensorflow project. One of the first things I noticed while planning out this step was the lack of good wake words in the speech command dataset. There are [many](https://github.com/jim-schwoebel/voice_datasets) voice datasets available online, but many are unlabeled or conversational. Since digging didn't turn up much in the way of open labeled word datasets I decided to use `on` and `off` from the speech commands [dataset](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html) since that gave me a baseline for comparison with my custom words. After recording myself saying `hi` and `smalltalk` less then ten times I knew I did not want to generate my own samples at the scale of the other labeled keywords.

Instead of giving up on my wake word combination I started digging around for options and found an interesting [project](https://github.com/JohannesBuchner/spoken-command-recognition) where somebody had started down the path of generating labeled words with text to speech. After reading through the repo I ended up using [espeak](http://espeak.sourceforge.net/) and [sox](http://sox.sourceforge.net/) to generate my labeled dataset. 

The first step was to generate the [phonemes](https://en.wikipedia.org/wiki/Phoneme) for the wake words:

```bash
$ espeak -v en -X smalltalk
 sm'O:ltO:k
```

I then stored the phoneme in a word file that will be used by `generate.sh`.

```bash
$ cat words
hi 001 [[h'aI]]
busy 002 [[b'Izi]]
free 003 [[fr'i:]]
smalltalk 004 [[sm'O:ltO:k]]
```

After modifying `generate.sh` from the spoken command repo (eliminating some extra commands and extending the loop to generating more samples) I had everything I needed to synthetically generate a new labeled word dataset.

```bash
#!/bin/bash
# For the various loops the variable stored in the index variable
# is used to attenuate the voices being created from espeak.

lastwordid=""

cat words | while read word wordid phoneme

do
	echo $word
	mkdir -p db/$word

	if [[ $word != $lastword ]]; then
		versionid=0
	fi

	lastword=$word

	# Generate voices with various dialects
	for i in english english-north en-scottish english_rp english_wmids english-us en-westindies
	do 
	    # Loop changing the pitch in each iteration
	    for k in $(seq 1 99)
		do
		    # Change the speed of words per minute
		    for j in 80 100 120 140 160; do 
			    echo $versionid "$phoneme" $i $j $k
			    echo "$phoneme" | espeak -p $k -s $j -v $i -w db/$word/$versionid.wav
			    # Set sox options for Tensorflow
			    sox db/$word/$versionid.wav -b 16 --endian little db/$word/tf_$versionid.wav rate 16k
			    ((versionid++))
		    done
		done
	done
done
```

After the run I have samples and labels with a volume comparable to the other words provided by Google. The pitch, speed and tone of voice changes with each loop which will hopefully provide enough variety to make this dataset useful in training. Even if this doesn't work out learning about `espeak` and `sox` was interesting. I've already got some future ideas on how to use those. If it does work the ability to generate training data on demand seems incredibly useful.

Next up, training the model and loading to the ESP-EYE. The code, docs, images etc for the project can be found [here](https://git.sr.ht/~n0mn0m/on-air) and I'll be posting updates as I continue along to [HackadayIO](https://hackaday.io/project/170228-on-air) and this blog. If you have any questions or ideas reach [out](mailto:alexander@unexpextedeof.net).
\ No newline at end of file

A content/blog/train-all-the-things-dependencies-and-bugs.md => content/blog/train-all-the-things-dependencies-and-bugs.md +73 -0
@@ 0,0 1,73 @@
+++
title = "Train All the Things: Speed bumps"
description = ""
date = 2020-02-08
in_search_index = true

[taxonomies]
tags = ["hackaday", "maker", "dependency-management", "bugs", "esp-idf", "esp", "tensorflow", "train-all-the-things"]
categories = ["programming"]
+++

As part of getting started on my project a couple months back I took a look at what boards were supported by [Tensorflow lite](https://www.tensorflow.org/lite/microcontrollers#supported_platforms). Seeing an esp board I went that route since I've heard alot from the maker/hacker community and thought it would be a good opportunity to learn more. Additionally it's been quite a while since I had a project that was primarily `C/C++` so that was exciting. Like any good project I ran into multiple unexpected bumps, bugs and issues. Some were minor, others were frustrating. I'm capturing some of those here for anybody else that may be starting down the path of using Tensorflow Lite and an ESP32 board.

### Tensorflow speed bumps

Getting started with TF Lite is easy enough, but something I noticed as I continued to work on the project is just how little things are designed specific to the platform. Instead the examples are setup with Arduino as a default, and then work is done to make that run on X target. In the case of the `ESP-EYE` this looks like packing everything into an Arduino compatible loop, and handling that in a single FreeRTOS task. I get the reason for this, but it's also a bit of a headache later on as it feels like an anti pattern when addin in new task and event handlers. 

Another bump you are likely to notice is that the TF Lite examples rely on functionality present in the TF `1.x` branch for training, but require TF `>= 2.2` for micro libs. Not the end of the world, but it means your going to manage multiple environts. If managing this using `venv`/`virtualenv` keep in mind you're going to need the `esp-idf` requirements in the 2.x environment, or just install in both as you may find yourself switching back and forth. In addition to python lib versions the examples note `esp-idf 4.0`, but you will want to use `>=4.0` with [this](https://github.com/espressif/esp-idf/pull/4251) commit or you will run into compiler failures. I ended up using `4.1` eventually, but something to note.

Finally interaction with the model feels flaky. It's an example so this kind of makes sense, but I found that while the word detected was pretty accurate the `new_command` and some of the attributes of the keyword being provided by the model weren't matching my expectation/use. I ended up using the `score` value and monitoring the model to setup the conditionals for responding to commands in my application.

Overall the examples are great to have, and walking you through the train, test and load cycle is really helpful. The main thing I wish I had known was that the TF Arduino path for ESP was pretty much the same as the ESP native path with regards to utility and functionality just using the `esp-idf` toolchain.

### ESP speed bumps

From the ESP side of things the core `idf` tooling is nice. I like how open it is and how much I can understand the different pieces. This helped a few times when I ran into unexpected behavior. One thing to note is if you follow the documented path of cloning `esp-idf` you will want to consider how you manage the release branch you use and when you merge updates. Updates are not pushed into minor/bug fix branches instead they go into the release branch targeted on merge.

Being new to the esp platform something I didn't know when I got started was that [`esp-idf 4.x`](https://github.com/espressif/esp-idf/releases/tag/v4.0) released in February of 2020. Because of this alot of the documentation and examples such as [`ESP-WHO`](https://github.com/espressif/esp-who) and [`esp-skainet`](https://github.com/espressif/esp-skainet) are still based on `3.x` which has a variety of differences and changes in things like the TCP/network stack. Because of this checking the version used in various docs, examples etc is (as usual) important. Since the TF examples reference version 4 that's where I started, but a lot of what's out there is based on v3.

One other bump somebody may run into is struct initilization in a modern toolchain when calling the underlying esp C libraries from C++. I spent some time digging around after transitioning the http request example into the TF C++ `command_responder` code and the compiler told me I was missing uninitilized struct fields and their order made them required.

The example code:

```c
esp_http_client_config_t config = {
        .url = "http://httpbin.org/get",
        .event_handler = _http_event_handler,
        .user_data = local_response_buffer,
};
esp_http_client_handle_t client = esp_http_client_init(&config);
esp_err_t err = esp_http_client_perform(client);
```

And how I had to do it in C++:

```c++
esp_http_client_config_t* config = (esp_http_client_config_t*)calloc(sizeof(esp_http_client_config_t), 1);
config->url = URL;
config->cert_pem = unexpectedeof_casa_root_cert_pem_start;
config->event_handler = _http_event_handler;

esp_http_client_handle_t client = esp_http_client_init(config);
esp_http_client_set_method(client, HTTP_METHOD_PUT);
esp_err_t err = esp_http_client_perform(client);
```

I had a similar issue with wifi and you can see the solution [here](https://git.sr.ht/~n0mn0m/on-air/tree/master/voice-assistant/smalltalk/main/http/wifi.cc#L40).

I really enjoyed my lite trip into `idf`. It's an interesting set of components and followed a workflow that I use and appreciate. I wrote a couple aliases that somebody might find useful:

```bash
alias adf="export ADF_PATH=$HOME/projects/esp-adf"
alias idf-refresh="rm -rf $HOME/projects/esp-idf && git clone --recursive git@github.com:espressif/esp-idf.git $HOME/projects/esp-idf && $HOME/projects/esp-idf/install.sh"
alias idf=". $HOME/projects/esp-idf/export.sh"
alias idf3="pushd $HOME/projects/esp-idf && git checkout release/v3.3 && popd && . $HOME/projects/esp-idf/export.sh"
alias idf4x="pushd $HOME/projects/esp-idf && git checkout release/v4.0 && popd && . $HOME/projects/esp-idf/export.sh"
alias idf4="pushd $HOME/projects/esp-idf && git checkout release/v4.1 && popd && . $HOME/projects/esp-idf/export.sh"
alias idf-test="idf.py --port /dev/cu.SLAB_USBtoUART flash monitor"
```

And I look forward to writing more about esp as I continue to use it in new projects.

Approaching the end of this project it's been a larger undertaking than I expected, but I've learned a lot. It's definitely generated a few new project ideas. The code, docs, images etc for the project can be found [here](https://git.sr.ht/~n0mn0m/on-air) and I'll be posting updates as I continue along to [HackadayIO](https://hackaday.io/project/170228-on-air) and this blog. If you have any questions or ideas reach [out](mailto:alexander@unexpextedeof.net).
\ No newline at end of file

A content/blog/train-all-the-things-finished.md => content/blog/train-all-the-things-finished.md +31 -0
@@ 0,0 1,31 @@
+++
title = "Train All the Things: Wrapping up"
description = ""
date = 2020-02-08
in_search_index = true

[taxonomies]
tags = ["hackaday", "maker", "machine-learning", "tensorflow", "esp", "train-all-the-things"]
categories = ["programming"]
+++

And now I'm at `v0.1` of the [`on-air`](https://git.sr.ht/~n0mn0m/on-air) project. I was able to achieve what I was hoping to along the way. I learned more about model development, tensorflow and esp. While this version has some distinct differences from what I outlined for the logic flow (keywords, VAD) it achieves the functional goal. The code, docs, images etc for the project can be found in [this](https://git.sr.ht/~n0mn0m/on-air) repo, and the project details live on [HackadayIO](https://hackaday.io/project/170228-on-air). When I get back to this project and work on `v1.x` I'll make updates available to each.


![Voice Display Demo](/images/transition_one.gif "Display responding to voice.")
![Voice Display Demo Two](/images/transition_two_speed.gif "Display responding to voice.")


A couple thoughts having worked through this in the evening for a couple months:

- I really should have outlined the states that the esp program was going to cycle through, and then mapped those into task on the FreeRTOS event loop. While the high level flow captures the external systems behavior the esp has the most moving parts at the applications level, and is where most of the state is influenced.

- I want to spend some more time with C++ 14/17 understanding the gotchas of interfacing with C99. I ran into a few different struct init issues and found a few ways to solve them. I'm sure there is a good reason for different solutions, but it's not something I've spent a lot of time dealing with so I need to learn. 

- While continuing to learn about esp-idf I want to look into some of the esp hal work too. I briefly explored esp-adf and skainet while working through `on-air`. Both focus on a couple boards but seems to have functionality that would be interesting for a variety of devices. Understanind the HAL and components better seems to be where to start.

- Data, specifically structured data is going to continue to be a large barrier for open models and for anybody to be able to train a model for their own want/need. While sources like Kaggle, arvix, data.world and others have worked to help this there's still a gulf between what I can get at home and what I can get at work. Additionally many open datasets are numeric or text datasets while video, audio and other sources are still lacking.

- Document early, document often. Too many times I got so caught up in writing code, or just getting one more thing done that by the time I did that getting myself to do a thorough write up of issues I experienced, interesting findings, or even successful moments was difficult. I know that I put this off sometimes, and different parts of the project are not as well documented, or details have been lost to the days in between.

- There's a lot of fun stuff left to explore here I can see why I've heard alot about esp and look forward to building more.

A content/blog/train-all-the-things-v01.md => content/blog/train-all-the-things-v01.md +73 -0
@@ 0,0 1,73 @@
+++
title = "Train All the Things: Version 0.1"
description = ""
date = 2020-02-08
in_search_index = true

[taxonomies]
tags = ["hackaday", "maker", "machine-learning", "tensorflow", "esp", "train-all-the-things"]
categories = ["programming"]
+++

My first commit to `on-air` shows March 3, 2020. I know that the weeks leading up to that commit I spent some time reading through the TF Lite documentation, playing with Cloudflare Workers K/V and getting my first setup of `esp-idf` squared away. After that it was off to the races. I outlined my original goal in the [planning](@/blog/train-all-the-things-planning.md) post. I didn't quite get to that goal. The project currently doesn't have a VAD to handle the scenario where I forget to activate the display before starting a call or hangout. Additionally I wasn't able to train a custom keyword as highlighted in the [custom model](@/blog/train-all-the-things-custom-model.md) post. I was however able to get a functional implementation of the concept. I am able to hang the display up, and then in my lab with the `ESP-EYE` plugged in I can use the wake word `visual` followed by `on/off` to toggle the display status.

![Voice Display Demo](/images/transition_one.gif "Display responding to voice.")
![Voice Display Demo Two](/images/transition_two_speed.gif "Display responding to voice.")

While it's not quite what I had planned it's a foundation. I've got a lot more tools and knowledge under my belt. Round 2 will probably involved [Skainet](https://github.com/espressif/esp-skainet) just due to the limitations in voice data that's readily available. Keep an eye out for a couple more post highlighting some bumps along the way and final take aways. 

The code, docs, images etc for the project can be found [here](https://git.sr.ht/~n0mn0m/on-air) and I'll be posting any further updates to [HackadayIO](https://hackaday.io/project/170228-on-air). For anybody that might be interested in building this the instructions below provide a brief outline. Updated versions will be hosted in the [repo](https://git.sr.ht/~n0mn0m/on-air/tree/master/docs). If you have any questions or ideas reach [out](mailto:alexander@unexpextedeof.net). 

**Required Hardware:**

1. [ESP-EYE](https://www.espressif.com/en/products/hardware/esp-eye/overview)
  1. Optional [ESP-EYE case](https://www.thingiverse.com/thing:3586384)
2. [PyPortal](https://www.adafruit.com/product/4116)
  1. Optional [PyPortal case](https://www.thingiverse.com/thing:3469747)
3. Two 3.3v usb to outler adapters and two usb to usb mini cables

**OR**

1. Two 3.3v micro usb wall outlet chargers

Build Steps:

1. Clone the [on-air](https://git.sr.ht/~n0mn0m/on-air) repo.

Cloudflare Worker:

1. Setup [Cloudflare](https://www.cloudflare.com/dns/) DNS records for your domain and endpoint, or setup a new [domain](https://www.cloudflare.com/products/registrar/) with Cloudflare if you don't have one to resolve the endpoint.
2. Setup a [Cloudflare workers](https://workers.cloudflare.com/) account with worker K/V.
3. Setup the [Wrangler](https://developers.cloudflare.com/workers/tooling/wrangler) CLI tool.
4. `cd` into the `on-air/sighandler` directory.
5. Update `[toml](https://git.sr.ht/~n0mn0m/on-air/tree/master/sighandler/wrangler.toml)`
6. Run `wrangler preview`
7. `wrangler publish`
8. Update `[Makefile](https://git.sr.ht/~n0mn0m/on-air/tree/master/sighandler/Makefile)` with your domain and test calling.

PyPortal:

1. Setup CircuitPython 5.x on the [PyPortal](https://circuitpython.org/board/pyportal/).
  1. If you're new to CircuitPython you should [read](https://learn.adafruit.com/welcome-to-circuitpython/circuitpython-essentials) this first.
2. Go to the directory where you cloned on-air.
3. `cd` into display.
4. Update `[secrets.py`](https://git.sr.ht/~n0mn0m/on-air/tree/master/display/secrets.py) with your wifi information and status URL endpoint.
5. Copy `code.py`, `secrets.py` and the bitmap files in `screens/` to the root of the PyPortal.
6. The display is now good to go.

      ESP-EYE:

1. Setup `[esp-idf](https://docs.espressif.com/projects/esp-idf/en/latest/esp32/get-started/)` using the 4.1 release branch.
2. Install [espeak](http://espeak.sourceforge.net/) and [sox](http://sox.sourceforge.net/).
3. Setup a Python 3.7 virtual environment and install Tensorflow 1.15.
4. `cd` into `on-air/voice-assistant/train`
5. `chmod +x orchestrate.sh` and `./orchestrate.sh`
6. Once training completes `cd ../smalltalk`
7. Activate the `esp-idf` tooling so that `$IDF_PATH` is set correctly and all requirements are met.
8. `idf.py menuconfig` and set your wifi settings.
9. Update the URL in `[toggle\_status.cc](https://git.sr.ht/~n0mn0m/on-air/tree/master/voice-assistant/smalltalk/main/http/toggle_status.cc)`
  1. This should match the host and endpoint you deployed the Cloudflare worker to above
10. `idf.py build`
11. `idf.py --port \<device port\> flash monitor`
12. You should see the device start, attach to WiFi and begin listening for the wake word "visual" followed by "on" or "off".


A static/images/transition_one.gif => static/images/transition_one.gif +0 -0

A static/images/transition_two.gif => static/images/transition_two.gif +0 -0

A static/images/transition_two_speed.gif => static/images/transition_two_speed.gif +0 -0