generating syllable timing from audio track


dear anyone,

i want to create a track of timing data that corresponds to the spoken syllables from an audio track.  ie: for a given audio track (of speech) i'd like to create a track where each syllable of 'video' (for instance) would be created corresponding to the Vid-Ee-O sounds.  it's for creating kinetic typography with.

firstly, if anyone knows of an audio utility (or github project) that does something like this that would be great.  if not i might have to make my own.  was hoping i could make a HTML5 app that might do it, roughly as follows....

>> get the app to display the waveform data on the screen & play through it.  ideally with some checkbox for changing the playback speed (to slow)

>> underneath have another 'track' that - when the above audio track is playing - creates markers every time the user presses a key (but otherwise keeps going).  with hopefully the ability to slide these markers around later (a la editing) but i can probably figure this bit out myself

the attached image is a rough idea of how it might look in the browser (blue = marker, red = playhead)

if anyone knows if this is easy to do with the web audio api or in any other way any suggestions v welcome




I'd suggest doing it offline, run the audio files through a specialist tool that extracts the syllable data into a cue sheet - for example:


Then process that sheet as your audio plays, synchronising the audio / visual as necessary.  Main benefits: 1) a higher level of audio analysis (using other people's expertise) including phonetics, 2) not bogging down the browser at runtime with what is a constant analysis.

Or, if you're looking to do things dynamically, then a rudimentary peak analysis might suffice for syllables (e.g. quantise the amplitude to 100ms and assume a syllable if the value changes by +75%)?

