The process is usually straightforward. I'll select part of a WAV file from a voice acting session, that best encapsulates a given word for use in my speech synthesis, and save it as a separate file. This file is named after that word and placed in an appropriate folder.

Below are screenshots of both ends of the word "acts", zoomed in so that I can ensure that it begins and ends with silence.

You may have noticed that the word I have extracted does not include all the sound that the voice actor made when saying that word. This is often the case, because actors read words out individually and placed more emphasis on them, effectively treating them as words at the ends of sentences. Naturally, I sometimes do include all of the recorded word if it is a noun and therefore certain to be a the end of a sentence (within the limited set of sentence structures I'm using).

Which elements of the original spoken words I use may also have consequences for how my work is read. If I go to a great effort to make the voices sound real by leaving emphasis, breathing, ticks and trailing off, in the final words, people may assume that the main focus of my work is how people speak. Conversely, if I strip my work of the natural irregularities of spoken language, people might assume that I'm more concerned with the overarching system that dictates what they can say. To put it another way: Do I want people to think I'm looking at social structure or social agency? What about both? I must also consider the fact that my limited technical abilities prevent me from seamlessly belending different words together, which results in a talking clock effect. Therefore, including irregular elements of speech may further the sense of machine-speech, because the different words will seem even more disparate in source. This problem doesn't have an easy solution, though fortunately the tension can be read as useful in that I tend to think of this project as exploring the mixing of abstract models of human social identity with more concrete social processes.

Problems and Solutions

Occasionally, things go wrong. Words are missing or mispronounced in my original recordings. In these cases, where possible, I've mixed together parts of separate words to get an approximation of the desired word. In the example below, the word "Mongoloid" is created from two separate sources:

The two parts were repositioned and their gain (volume) envelopes tweaked until they could be mixed into a convincing word.

In other cases, I have been less fortunate with missing and malformed words and have resorted to copying words spoken in one of the other tones of voice; all voice actors contributed three sets of words with different emotive intonations.

I recently made the mistake of extracting all 196 of a set of words without first increasing the volume of the source recording so all of its peaks were at 0 decibels (maximum volume.) Instead of re-doing the entire recording, which would have taken several hours, I found a free audio editing program, Wavosaur, with batch processing. At first I was disappointed by the limited selection of pre-defined batch operations. However, I soon discovered that I could process multiple audio files using any VST plug-in I could get my hands on. Enter Blue Cat's Gain Suite, a completely free VST plug-in for adjusting audio volume. Despite Wavosaur's crude interface for accessing VST parameters, it worked well enough.

Hours of work reduced to seconds. Even if I take into account time spent researching the problem, I'm happy with this solution. Try doing something like this for free on a Mac!

Cleaning up After Myself

Being a tad dyspraxic means I need to spell-check everything, including my file names. Computer programs aren't forgiving of typos and misspellings; they just crash.

The best method I could come up with is to use Command Prompt to save a list of the files in a directory to a text file.

I then remove file extensions and run this text through an online spell-checker. I have my own spell checks but I'm worried that years of abuse have immunised them to some of my misspellings, as I may have simply added the misspelled words to their dictionaries.

At this point I will correct the name of Asains.WAV and any other offending files. The need that all files are named correctly intersects with a consideration of how I'm accessing the words (audio files) from my program. This will be the topic of a future post.

If you've made it this far, gentle reader, I commend your tolerance for dullness. The topic of this post is extremely boring and I've gone to little effort to liven it up.