Talking About Social Identity is the somewhat cumbersome working title of my final project for the BA I'm taking in fine art.

My main claim is that our social identities – ours and others' conceptions of our sex, race, class, gender and sexuality – are constructed through an ongoing process of conversation and negotiation. I'm investigating this claim by building a sound installation that simulates a conversation between up to six virtual participants. These participants (literally electronic speakers) negotiate theirs and each other's social identity according to simple set of rules based on their belonging, or not belonging, to different identity groups such as middle-class or feminine. In this blog series I'm going to look at the major processes involved in making the sound installation as well as some theory and planning. Throughout the remainder of this post, I will introduce three key components of the project: motion tracking, conversation simulation, and speech sample extraction.

Motion Tracking

As I want this installation to be interactive, I need a way to monitor the audience's behaviour. This is where motion tracking comes in; I plan to install a web cam above the installation space, and use it to track people's positions in the space. At this point, I haven't decided precisely how I'm going to use the motion tracking data, though I do have some ideas which I will discuss in a future post.

I have, however, put a GUI together which allows me to easily use a motion detection and tracking algorithm from the open source Computer Vision library Aforge.NET. While waiting for the delivery of the webcam I hope to use for this project, I have experimented with the motion tracking using my laptop's inbuilt webcam.

While the camera's optics are poor and its sensor low quality, the algorithm – which looks for differences between the current frame of camera input and the background frame – can still detect motion. In the above image, I've set the background frame to one in which my hand and the pair of scissors were not in the field of view. The fact that the right handle of the pair of scissors is not recognised as a motion object against the darker background indicates a weakness in this kind of motion detection, its reliance on the difference between pixel values. Practically, I intend to address this weakness in three ways: Get a better camera, apply a contrast filter to the camera input before motion detection is carried out, and most importantly, make my installation space as white as possible, with consistent lighting.

Conversation Simulation

A lot of what you can see in the above image is unchanged since the first version of my conversation generation program that I made at the very end of 2012.

The topmost section gives me access to variables which determine, for example, how different the beliefs of my conversation's participants (which I call interlocutors) will be to start with and how much memory they have to store past utterances (things which they, and others, have said). The only major change I have made here is adding fields for disposition: a property which represents how participants in a conversation would feel about a belief, were it true. If my terminology makes no sense to you at this point, please bear with me. I will write about these settings and the algorithm that they control in much more detail in future posts.

The middle section for file output is fairly self-explanatory. This was the only form of output I used in 2012's version of the project. It allowed for a conversation, of a limited length, to be saved as a WAV file, each participant's speech occupying a different channel. In the version of the program I'm working on now, I want it to be possible to output in real-time and to file, simultaneously. I have not implemented this yet and will likely post something about the technical challenges it poses. The bottom section houses settings specific to real-time output. At the moment, I have settings for the positions of six speakers. These positions will roughly correspond to where each speaker is in my web cam's field of view and allow me to use audience members' positions relative to the speakers to change how they function. For example, a speaker could be more likely to emit speech if an audience member was closer to it.

Speech Sample Extraction

This is one of the most time consuming, and certainly the most repetitive of the processes involved in this project. I have to manually extract approximately 4,704 words in total from voice recordings and save them as individual files to be used to synthesise speech.

The above image shows the extraction of the word Caucasoids, a word which I will have to extract on 24 separate occasions. While I would like to be able to get a computer to do all of this automatically, the quality of these samples is going to have a major impact on the final outcome of this project. The computer saves me a lot of time in that I can automatically amplify each file so all the loudest points are equally loud and remove any background noise. However, this is when it stops being automatic: Each word needs to be listened to several times until I find the best range of it to save to file. The words also have to sound good in context and this often means cutting them short, especially when a word is likely to be in the middle of a sentence. I can allow words such as nouns, which are likely to be at the end of the sentence, to trail off and have greater emphasis; with adjectives, I have to compromise as they can be both in the middle and at the end of sentences. There is also the matter of attempting to repair mispronunciations and missing words. These tasks, and some issues relating to how I want the piece to be interpreted, will be discussed in a future post.

I hope you gained something from reading this. At least know that I gained something from writing it as I find it easier to clarify my ideas one I have expressed them in writing.