Friday, August 27, 2010

/ay/, Animated.

The Animation

Click on the image to see the animation.

The Data

The data underlying the animation consists of 928 tokens of /ay/ drawn from an interview with a 60 year old Philadelphian. The data was transcribed by an undergraduate supported by an NSF grant. It is coincidental that I was also the interviewer. A forced alignment of the transcript to the audio was performed using the Penn Phonetics Forced Aligner (P2FA). I extracted formant measurements at 6 millisecond intervals from every stressed /ay/ using Praat. I coded contextual information based on a syllabification of CMU dictionary transcriptions.

One super-long token of /ay/ was excluded because it was extremely poorly tracked, possibly due to a misalignment.

The Analysis

I rescaled the time variable to between 0 and 1 for all tokens. I then fit a smoothing spline anova model for F1 and F2 in R using ssanova() from the gss package with the following formulas
  • F1 ~ Voice*log(Duration)*Time
  • F2 ~ Voice*log(Duration)*Time
These models took a long time to fit. Using these F1 and F2 models, I got the predicted fits for F1 and F2 values at given time point in a vowel of a given duration in a given voicing context.

The Animation (again)

The "velocity" of the "gesture" is represented in two ways:
  1. The larger the point, the slower the velocity.
  2. The bluer the point, the slower the velocity.
However, these two indicators have different scales.
  1. Size: Size represents velocity relative to vowels of any duration. Two points of the same size in a short vowel and a long vowel represent the same velocity
  2. Color: Color represents velocity relative to vowels of the same duration. So, a very blue point means "short for a vowel of this duration." Points with the same color from vowels of different durations do not necessarily represent the same velocity.

The x and y axes are negative logged hertz values, and are constrained so that an inch of plot space corresponds to the same amount of negative logged formant space for both x and y.

I generated 100 frames representing a smooth transition from the minimum duration to the maximum duration. At some point, voiceless context /ay/ disappears. This is because no pre-voiceless /ay/s were longer than 0.240 seconds.

Each frame was generated using the ggplot2 library in R, then saved to a .png. Then, I used png2swf from swftools to sew the .png's together into a flash animation running at 15 frames per second.

Room for Improvement

The formant data was very messy. I simply set the maximum formant and number of formants for the entire file, without making any adjustment. I might try to implement some kind of estimate evaluation like from Keelan Evanini's dissertation, except bootstrapping from the speaker's own data.

No comments:

Post a Comment

Disqus for Val Systems