Jump to content

WebAudio support for version 2.0


JCPalmer
 Share

Recommended Posts

DK,

After looking in repository & seeing that the new builds were for 2.0, I started digging into what is in it.  Saw WebAudio.

 

Is this just playback, or will you be able to record? I already have a use for that!

 

Cocoon also has sound playback, could that be called on detection?  I have completed Cocoon tolerance changes, but not put in pull request.  Was going to do capabilities testing, but maybe should just get it up there & let others do some of the work.

 

Jeff

Link to comment
Share on other sites

Hey,

 

My application is a game development tool, not a game itself.

 

https://googledrive.com/host/0B6-s6ZjHyEwUSDVBUGpHOXdtbHc

 

The idea is to make game characters talk.  It is very difficult syncing meshes to audio.  Going from the other direction of syncing audio to meshes just means a little practice looking at the mesh in the tool.  Pre-recorded audio is also likely copyrighted, so making your own audio is also free from issues.

 

This tool does look ups of sentences in Carnegie Mellon University's, DARPA funded, Phoneme database.  I wrote a Java program which converts it to a JavaScript module.  The database is big, but you will only need it in the tools, not the final application.

 

It is not really done, but getting there.  Will be adding a check box for starting a count down.  You could use a phone to make the actual recording, but might be less fumble, if the tool could do it.

 

Jeff

Link to comment
Share on other sites

After your comment on "Media Capture API", I have been doing some searching.  Here are some notes for myself, or anyone who might stumble on this:

 

  1. See Media Capture API here: http://www.w3.org/TR/html-media-capture/ (<form tag> based. ok for starting, but how to stop in code?  Is this is implemented anywhere? No need for integration into BJS)
  2. There is the navigator.getUserMedia() here: http://w3c.github.io/mediacapture-main/getusermedia.html & old but historic commentary here: http://www.html5rocks.com/en/tutorials/getusermedia/intro/

Need to finish all other aspects, then might circle back to this.  Only desktop I have with a microphone is MacBook Pro, so hopefully can get something that is implemented there.

Link to comment
Share on other sites

Hey,

 

My application is a game development tool, not a game itself.

 

https://googledrive.com/host/0B6-s6ZjHyEwUSDVBUGpHOXdtbHc

 

The idea is to make game characters talk.  It is very difficult syncing meshes to audio.  Going from the other direction of syncing audio to meshes just means a little practice looking at the mesh in the tool.  Pre-recorded audio is also likely copyrighted, so making your own audio is also free from issues.

 

This tool does look ups of sentences in Carnegie Mellon University's, DARPA funded, Phoneme database.  I wrote a Java program which converts it to a JavaScript module.  The database is big, but you will only need it in the tools, not the final application.

 

It is not really done, but getting there.  Will be adding a check box for starting a count down.  You could use a phone to make the actual recording, but might be less fumble, if the tool could do it.

 

Jeff

 

Works really well Jeff :)

 

I need to give him eyes and teeth - less of a zombie look :o

 

cheers, gryff :)

Link to comment
Share on other sites

 

DK, Thanks for giving some info on the audio play / synthesizer  you are building upon.  Having a mockup, maybe in d.ts form, of how this is planned to be worked into BJS could mean I could work on implementing it before it was done.

 

If I can successfully get this tool to record crisply timed audio files, then I could maybe work their playback right into the same Voice.ts with just an optional arg.  Could even have 3 buttons in this tool:  "Say", "Record", & "Playback".

- - - - - - - -

Gryff, a better head would be welcome.  Like every time I seem to get a new .blend with more features from you, it shows up new problems.  Want this to work with a fully developed head that someone would actually use, in development.

 

Also, thinking that I am going to want a fourth shape key, called 'mouth-smile'.  Smiling does involve more than just the mouth, but cheeks should be their own ShapeKeyGroup.  If you remember from the multi-group test (with the left, right, drumming, & conflict buttons), having different ShapeKeyGroups modifying the same vertices can cause problems.

Link to comment
Share on other sites

Also, thinking that I am going to want a fourth shape key, called 'mouth-smile'.  Smiling does involve more than just the mouth, but cheeks should be their own ShapeKeyGroup.  If you remember from the multi-group test (with the left, right, drumming, & conflict buttons), having different ShapeKeyGroups modifying the same vertices can cause problems.

 

Jeff, the mouth-smile would involve cheeks as you state, but it would also involve at least the corners of the mouth. Currently, the mouth-wide is modifying vertices of the other two shape keys - so how big are the problems if the corners of the mouth were modified ? And if you want to go for modifying the corners of the mouth and cheeks - one or two keys (one for each side)?

 

As for improving the head - started that (added eyes this morning). Can I add additional bones to the armature (currently only a single bone)?

 

By the way, I forgot to mention, I liked the loudness effect and its impact on the shape keys.

 

cheers, gryff :)

Link to comment
Share on other sites

Gryff,

 

There might be some combinations of mouth-smile & mouth-wide that only the "Joker" could pull off, but it is not a code problem.  The additive shapekey concept of MORH.Mesh v 1.1 (formerly BABYLON.Automaton), means you just toss a ReferenceDeformation a 0 to 1 for each key you wish to combine.  It then builds a vertices endpoint that combines them all together & computes matching endpoint normals, or gets a precompiled set, if it exists. 

 

Every beforeRender() call to MORPH.Mesh, each shapekeygroup compares its new endpoint to the endpoint of the prior move, and interpolates new vertex positions & normals based on the amount of time that has elapsed over how long you said it was supposed to take.  No pre-compiled "frames".  That just leads to choppyiness, kind of like in math of doing rounding before the last step.

 

The problem with vertices across Shapekeygroups is it cannot be additive like intra-group keys.  It is last group run wins.

 

One key for smile & one bone is fine. 

 

The version of the tool in the link is already old, & does not pre-compile endpoints.  Doing this makes it even smoother, especially for quick moves.

 

Thought you would like loudness.  The loudest is 100% of the values for the keys you provided.  Looks  un natural.  Thinking about dropping it, as the # pre-compiles is based on the total # of shapes it could take.  Currently 10 VISEMES * 3 loudnesses.  Add in a 2 smile settings, and that would be 60.  They are small, but if no-one would use the loudest, could drop it & only have 40.

 

Jeff

Link to comment
Share on other sites

  • 2 weeks later...

Hi,

 

 I'm in charge of the Web audio engine. I'll focus myself on the sound playback. Recording needs the usage of the Media Capture API. Why would you need sound recording in a game?

 

Bye,

 

David

 

Though all these standards are conflicting, I did manage to get the Web Audio API to record as well.  Updated the link above.  Ignore the Mood slider.  It is not implemented yet.  Did not do a countdown.  Just recording over an over is faster.  Use Playback button to verify result. ".wav" button puts a file in your downloads directory.

 

Not sure on the timing of your implementation of Web Audio in 2.0.  Am going with my own at least for recording.  I do not know how to read back in the saved .wav file.  Here is my AudioRecorder.ts file:

module MORPH {    /** has its origins from:  http://bytearray.org/wp-content/projects/WebAudioRecorder/ */    export class AudioRecorder{        private context : AudioContext;                public initialized = false;  // set in prepMic; will remain false if WebAudio or navigator.getUserMedia not supported        public playbackReady = false;        private recording = false;        private requestedDuration : number;        private startTime : number;        private recordCompletionCallback : () => void;                        // arrays of FloatArrays made during recording        private leftchannel  = new Array<Float32Array>();        private rightchannel = new Array<Float32Array>();        // consolidated versions of the buffer, after recording for playback or written to .WAV        private leftBuffer  : Float32Array;        private rightBuffer : Float32Array;        private recorder : ScriptProcessorNode = null;        private recordingLength = 0;        private volume : GainNode = null;        private audioInput = null;                private objectUrl : string;                private static instance : AudioRecorder;        public static getInstance() : AudioRecorder{            if (AudioRecorder.instance) return AudioRecorder.instance;                        AudioRecorder.instance = new AudioRecorder();                        var audioContext = window.AudioContext || window.webkitAudioContext;            if (audioContext) {                window.alert('Asking for audio recording permission in advance,\nto avoid asking when actually trying to record.');                            navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia;                if (navigator.getUserMedia){                    navigator.getUserMedia({audio:true}, AudioRecorder.prepMic, function(stream : any) { window.alert('Error capturing audio.' + stream); });                                     } else {                    window.alert('Navigator.getUserMedia not supported.');                 }                            }else{                window.alert('WebAudio not supported');            }                          return AudioRecorder.instance;        }                /**         *  static because it is in a callback for navigator.getUserMedia()         */        private static prepMic(stream : any){            AudioRecorder.instance.context = new (window.AudioContext || window.webkitAudioContext)();            AudioRecorder.instance.context.sampleRate = 44100;                        // creates a gain node            AudioRecorder.instance.volume = AudioRecorder.instance.context.createGain();            // creates an audio node from the microphone incoming stream            AudioRecorder.instance.audioInput = AudioRecorder.instance.context.createMediaStreamSource(stream);            // connect the stream to the gain node            AudioRecorder.instance.audioInput.connect(AudioRecorder.instance.volume);            /* From the spec: This value controls how frequently the audioprocess event is             dispatched and how many sample-frames need to be processed each call.             Lower values for buffer size will result in a lower (better) latency.             Higher values will be necessary to avoid audio breakup and glitches */            var bufferSize = 2048;            AudioRecorder.instance.recorder = AudioRecorder.instance.context.createScriptProcessor(bufferSize, 2, 2);            // cannot reference using 'this' inside of callback            AudioRecorder.instance.recorder.onaudioprocess = function(e){                if (!AudioRecorder.instance.recording) return;                var left = e.inputBuffer.getChannelData (0);                var right = e.inputBuffer.getChannelData (1);                // we clone the samples                AudioRecorder.instance.leftchannel.push (new Float32Array (left));                AudioRecorder.instance.rightchannel.push (new Float32Array (right));                AudioRecorder.instance.recordingLength += bufferSize;                // determine if the duration required has yet occurred                if (Mesh.now() - AudioRecorder.instance.requestedDuration >= AudioRecorder.instance.startTime) AudioRecorder.instance.recordStop();            };            // we connect the recorder            AudioRecorder.instance.volume.connect (AudioRecorder.instance.recorder);            AudioRecorder.instance.recorder.connect (AudioRecorder.instance.context.destination);             AudioRecorder.instance.initialized = true;        }                public recordStart(durationMS : number, doneCallback? : () => void){            if (this.recording){ BABYLON.Tools.Warn("already recording"); return; }            this.recording = true;            this.requestedDuration = durationMS;            this.startTime = Mesh.now();            this.recordCompletionCallback = doneCallback ? doneCallback : null;                        // delete previous merged buffers, if they exist            this.leftBuffer = this.rightBuffer = null;            this.playbackReady = false;        }        public recordStop() : void{            if (!this.recording) {BABYLON.Tools.Warn("recordStop when not recording"); return; }            this.recording = false;            // we flatten the left and right channels down            this.leftBuffer  = this.mergeBuffers (this.leftchannel );            this.rightBuffer = this.mergeBuffers (this.rightchannel);            this.playbackReady = true;                        this.clean();            if (this.recordCompletionCallback) this.recordCompletionCallback();        }        public playback() : void {            if (!this.playbackReady) {BABYLON.Tools.Warn("playback when not playbackReady"); return; }            var newSource = this.context.createBufferSource();            var newBuffer = this.context.createBuffer( 2, this.leftBuffer.length, this.context.sampleRate );                        newBuffer.getChannelData(0).set(this.leftBuffer);            newBuffer.getChannelData(1).set(this.rightBuffer);            newSource.buffer = newBuffer;            newSource.connect( this.context.destination );            newSource.start(0);        }        public saveToWAV(filename : string) : void{            if (!this.playbackReady) {BABYLON.Tools.Warn("save when not playbackReady"); return; }                        if (filename.length === 0){                window.alert("No name specified");                return;            }            else if (filename.toLowerCase().lastIndexOf(".wav") !== filename.length - 4){                filename += ".wav";            }                        var blob = new Blob ( [ this.encodeWAV() ], { type : 'audio/wav' } );                    // turn blob into an object URL; saved as a member, so can be cleaned out later             this.objectUrl = (window.webkitURL || window.URL).createObjectURL(blob);                        var link = window.document.createElement('a');            link.href = this.objectUrl;            link.download = filename;            var click = document.createEvent("MouseEvents");            click.initEvent("click", true, false);            link.dispatchEvent(click);                    }                private clean() : void {            if (this.objectUrl){                (window.webkitURL || window.URL).revokeObjectURL(this.objectUrl);                this.objectUrl = null;            }                        // reset the buffers for the new recording            this.leftchannel.length = this.rightchannel.length = 0;            this.recordingLength = 0;        }                    private mergeBuffers(channelBuffer : Array<Float32Array>) : Float32Array{            var result = new Float32Array(this.recordingLength);            var offset = 0;            var lng = channelBuffer.length;            for (var i = 0; i < lng; i++){                var buffer = channelBuffer[i];                result.set(buffer, offset);                offset += buffer.length;            }            return result;        }                private interleave() : Float32Array{            var length = this.leftBuffer.length + this.rightBuffer.length;            var result = new Float32Array(length);            var inputIndex = 0;            for (var index = 0; index < length; ){                result[index++] = this.leftBuffer[inputIndex];                result[index++] = this.rightBuffer[inputIndex];                inputIndex++;            }            return result;        }                private encodeWAV() : DataView{            // we interleave both channels together            var interleaved = this.interleave();            var buffer = new ArrayBuffer(44 + interleaved.length * 2);            var view = new DataView(buffer);                    // RIFF chunk descriptor            this.writeUTFBytes(view, 0, 'RIFF');            view.setUint32(4, 44 + interleaved.length * 2, true);            this.writeUTFBytes(view, 8, 'WAVE');            // FMT sub-chunk            this.writeUTFBytes(view, 12, 'fmt ');            view.setUint32(16, 16, true);            view.setUint16(20, 1, true);            // stereo (2 channels)            view.setUint16(22, 2, true);            view.setUint32(24, this.context.sampleRate, true);            view.setUint32(28, this.context.sampleRate * 4, true);            view.setUint16(32, 4, true);            view.setUint16(34, 16, true);            // data sub-chunk            this.writeUTFBytes(view, 36, 'data');            view.setUint32(40, interleaved.length * 2, true);                    // write the PCM samples            var lng = interleaved.length;            var index = 44;            var volume = 1;            for (var i = 0; i < lng; i++){                view.setInt16(index, interleaved[i] * (0x7FFF * volume), true);                index += 2;            }            return view;        }        private writeUTFBytes(view, offset, string){             var lng = string.length;            for (var i = 0; i < lng; i++){                view.setUint8(offset + i, string.charCodeAt(i));            }        }    }}
Link to comment
Share on other sites

Hi Jeff,

 

Impressive. ;)

 

What you are trying to build is a huge leap towards a valuable tool.  With more than 15 years of experience in feature film and game facial animation such as The Hobbit and GTA IV, I would tackle this as two different tasks that have to be separate functions, and possibly separate applications initially - as they are very different from each other.  For speech, if you narrow the available phenoms to approximately a dozen, you will be able to provide reliable quality speech and can focus on a feature set / API that is robust.  Then to begin with, there are a few methods to speech aside from keyframing, but video analysis and speech recognition are the most reliable in my opinion.  And, since it took us years to perfect video analysis at Weta for Avatar, my vote is speech recognition driving morph targets and at least one bone driving the jaw (or morph target - a bone provides better results and can also be used in the facial expressions for both vertical and lateral movement.)  This bone or morph target is used to to attenuate/amplify any phoneme.

 

For the voice recognition, take a look at the source for Google's speech recognition tool:

https://github.com/GoogleChrome/webplatform-samples/tree/master/webspeechdemo

 

I can't imagine that anyone needs all of the function of their Webspeech API, as more than 12 - 16 phenoms for real time or post analysis will produce very choppy results when applied.

 

Facial is far more difficult as there are primarily 4 viable methods to achieve this:

1. Keyframing (time consuming and choppy)

2. Audio analysis (poor results)

3. Puppeteering (real time)

4. Facial recognition (video analysis - currently prohibitive)

 

For many of the Trolls in The Hobbit, I built a real time digital and animatronic puppeting application - which took me approximately a week to begin to test in production.  Even all the way back to Nickelodeon's CatDog, I found the simplest path to good facial was to puppeteer their faces in real time.  This is as simple as connecting a joystick controller to morph targets on a mesh, and to set up conditional blending between morph targets - which is really basic math.

 

My goal is also to create a function in babylon.js to achieve quality facial animation with minimal effort.  At this time, I'm tasked with production deliveries, and am still stumbling through the tiny (and not so tiny) pitfalls of a new and unforgiving development framework (only a couple of weeks in.)  I chose babylon.js over three.js for many, many reasons, and am now certain it is the most straight forward and flexible framework available.  And as my co-partner/developer owns patents for multimedia streaming, we already have compiled streaming applications that we are in process of adapting to WebGL.  So our plate is "currently" full.  However, with years of experience in facial, I can provide a list of a dozen or so key phonemes that are essential to good speech, as well as key morph targets for facial puppeteering.  It could be a fun project.  :)

A tongue. eyeballs, and eyelids will also be required eventually to improve speech and facial to a believable threshold.  If I can assist in any way, please let me know.

 

Cheers,

 

David B.

Link to comment
Share on other sites

David B.

Thanks for responding.  I am doing this for a commercial, interactive, entertainment application.  Call it a "game".  If you are not sure what to make of that:  Mission Accomplished :).  Saw a news report back on 4/1 about Facebook buying Occulus, and an idea popped into my head.  I had not even heard of Occulus.  My app has nothing to do with Occulus Rift, but it was part of my early searches that pointed me to BJS.

 

This MORPH package is just one of the parts I am going to need.  With Gryff's pointing out of speech / shape keys, and helping me tremendously, I was able to get this far.  I have adapted my plans to also make characters talk.  Should point out, if you did not know, Meshes do not even need to be humans, much less talk, to use this morphing capability.

 

My background is heavily in Java / SQL & some OpenCL in the domain of finance / investing.  I have approached speaking like a couple of informal databases.  My database skills are quite good (It took less than 1 day to write a Java program to take a Arpabet flat webpage to a Javascript Module with a look up by index).  I also distinguish between sounds & mouth shapes (phonemes & vismes).  Currently, there is a 39 row Arpabet phoneme "table" which references a 11 row viseme "table".

 

MOPRH.Voice.makeSentence() takes an Arpabet token string.  It looks up each token in the phoneme table to get an index into the viseme table as well as the duration of the phoneme.  The viseme table has shapekey settings to use for each viseme.  The values for the keys are combined  to derive a final morph target.

 

Some of the table values need to be refined, but others seem to be dead on.  If you have data on visemes or how they should be mapped to Arpabet phonemes that would be great!

 

The big development you should know about concerns what MakeHuman is doing.  Apparently, they had Blender viseme shapekeys you could import.  They are close to coming out with a new version where this is supposed to be greatly improved.  The number of keys they are using were different from Gryff.  I am not good enough in Blender to make my own characters or good shapekeys.  I hope to do a database restructure to support this instead when it comes out.  You might want to get familiar with MakeHuman, if you are not.

 

Kind on hold on this for now.  Going to work on a couple of bugs with the Blender exporter, and publish v1.1 as is.  V1.1 works with BJS 1.14, but V1.2 is going to work with BJS 2.0. (FYI, I am not going to clog up the repository with the Arpabet module, cmudict.0.7.a.js, but will include the Java source / class file which builds it).

 

Jeff

Link to comment
Share on other sites

Hi Jeff,

 

Wow, I thought I was one of the only people who knew what arpabet phonemes were. :)  Almost all of the work we have done in voice recognition utilizes arpabet phonemes.  I took a look at the UC Berkely study you posted, and was concerned that it might provide too much info for anyone trying to animate speech.  I can see you have a good grasp of how to generate reasonable animated speech - far more than most people I've had the pleasure of working with.  It's no easy task, and you've the first person I've found to make a serious step forward to try and create a framework for analysis/mapping of speech in WebGL.

 

The main thought I had was what most people don't fully understand animated speech until they've done more work than necessary - i.e. cmudict.0.7.a.js contains far more arpabet phonemes than animators would ever use, as most people tend to want to use too many of these as morph targets.  However, it's good to reference these in audio recognition, and to pair them down to a third or less morph targets to be driven.

 

I come from a development background that couldn't be further away from HTML 5 / WebGL, so I'm a newbie in the community.  On this board, you've been someone who has pointed me in the right direction as I'm going through the birth pains. I'm currently working in Blender to accelerate my learning curve in WebGL, but I prefer Maya for most of my modelling, rigging, etc..  I have recently discovered MakeHuman, and it's certainly interesting.  Much of the work I've done the past few years for software plugins is in python - too many languages, too little time. :wacko:

 

I'm not certain if I can assist, but if you have a mesh and need morph targets, let me know and I can generate these for you to use - providing I better understand the design/constraints of your analysis process, and your vision for a result/output.  This is of extreme interest to me, and of course would be a valuable tool for the WebGL community.

 

Cheers,

 

David B.

Link to comment
Share on other sites

David,

I listed my "contributors", like the Berkley study, only to give them credit.  The user has no operational need to know this.  I grabbed the duration setting for each row in the Arpabet phoneme table from there.

 

As far the cmudict.0.7.a.js is concerned, you probably know this, but that big file only needs to be used in the dev tool itself, not in any game.  The outputs of this tool are tiny Arpabet sentences (Tool does not actually allow you to put them on the clipboard yet), and your .wav files.  The tool shows the intermediate Arpabet, precisely because it is an output that needs to be copied to the game.  As far as a list of Arpabet phonemes that I should ignore, I am all ears!

 

There is one nice consequence, for the low cost / hobby developer, of being a html 5 page.  You could email someone who would be willing to make recordings for you a list of sentences & speeds, and a link to the tool.  They could just email the .wav files back.

 

The slider controls:  loudness & speed are just settings you need to remember, so you can set them to the same value in the game, with your actual mesh.

 

You are going to like that MakeHuman & Blender are extensively written in Python.  MakeHuman does not look like it handles third party plugins, but Blender does.  MakeHuman has a Blender importer & there 2 variations of a blender exporter from Blender. 

 

If you are referring to javascript coding, I you might be better doing any coding in typescript & compiling to javascript.  It is much closer to python than javascript (Javascript is so screwed up in my opinion).  One issue is all sample code / tutorials seems to be written in Javascript though.  The Blender exporter variant I have in the Extensions repository actually outputs generated .js / or .ts source code files, not a .babylon.

 

Jeff

Link to comment
Share on other sites

Hi Jeff,

 

Here is a list of fundamental  face states and phonemes I use for audio analysis and real time animation:

 

Fundamental Facial / Mouth States

Default

Silence

Breath

Loud/Shout

 

Phonemes

AE

AO

AX

E

FV

H

IY

KG

L

M

N

OW

PB

SZ

TD

UH

UW

 

This can be overkill, depending on the characteristics of the .wav file you're working with.  Every recording device and every voice has different characteristics.  I never run .wav analysis without a waveform equalizer in line.  If the voice is recorded on a quality recording device (20hz - 18hz+), then you will want to push the frequencies from 1khz+ to (3khz and 5khz), depending on the result.  A 3 band parametric EQ works the best.  As audio recording varies so dramatically by device and by person, a 20hz to 20khz 3 band parametric EQ is the best - I was an audio engineer prior to wearing my pseudo developer's hat.

 

Another very key element for quality facial from waveform analysis has been to provide the user with an attenuation / amplification slider per Facial State / Phoneme, as the quality of facial animation generally comes from individual phoneme attenuation / amplification.  It's not enough to recognize a phoneme, but to then adjust for audio quality, dialect, idiosyncrasies in speech, etc., and this will dramatically affect output.  Adding these controls will improve animation quality by immeasurable amounts.

 

It took me a long time to perfect this for film, television, and games, as I went into it with complete ignorance.  You're way ahead of the curve and truly onto something.

 

I really appreciate you introducing me to Typescript - it's just what I've been looking for.  It seems simple enough to use, and I prefer to work in Visual Studio - depending on it's integration.

 

Cheers,

 

David B.

 

P.S. I made some corrections to this post, as I copy a bit of my posts from docs as well as use a text editor that will spell words based on the text in play, and spelled phonemes wrong for most of this post. :blink:

Link to comment
Share on other sites

David,

Thanks! Am beginning to work in the ignores.  Your statement that I was kind of doing multiple things at the same time from your first post has been marinating. I am in the process of addressing this and will be publishing a new version soon.  It will only be a start.  I do not expect to nail it the first time.  Stay tuned.

 

Your recording hints were interesting, but it would be really good if you could elaborate on "quality recording" from a computer sound card perspective.  Things like:

  • Using the sound card jacks vs through the computer via USB.  I bought a cheap sound jack headset, since I am running Linux.  The medium priced were USB, but no Linux support.  High priced were USB & had Linux drivers, but is this really a better Mic, or is all the extra in the headphones part?
  • Are all sound boards the same?  Are there some that are worth upgrading to for recording capabilities?  Are these all just commodity components now days?
  • PC vs Mac recording.  Any difference?  I have both.
  • Stereo vs mono.  My code writes stereo, but think it is worth the effort to make this switchable.

Jeff

Link to comment
Share on other sites

Am in the process of turning phonemes off that are not on the list.  What if the word is 'hot'?  The result is 'HH' 'AA' 'T'.  None is on your list, so no change would happen.  What were you doing in this case?

 

Here is my phoneme table.  The first argument of the Phonme constructor is an index into the VISMES table, the sencond is an index into an array used for duration.  The higher the index, the shorter the duration.  Comments to the right with examples.

        private static ARPABET_DICT = {            "." : new Phoneme(10, 0), // rest; durationIdx ignored, SPEECH_RATE used            "AA": new Phoneme( 0, 1), // VOWEL: hOt, wAnt, bOUGHt, Odd            "AE": new Phoneme( 0, 1), // VOWEL: At,             "AH": new Phoneme( 0, 4), // VOWEL: Up, Alone, hUt            "AO": new Phoneme( 2, 1), // VOWEL: Off, fAll, frOst, hAUl, drAW            "AW": new Phoneme( 2, 0), // VOWEL: cOW, OUt, mOUsE, hOUsE            "AY": new Phoneme( 0, 1), // VOWEL: fInd, rIdE, lIGHt, flY, pIE            "B" : new Phoneme( 7, 2), // CONS : Big, ruBBer            "CH": new Phoneme( 9, 1), // CONS : CHip, maTCH            "D" : new Phoneme( 4, 3), // CONS : Dog, aDD, fillED            "DH": new Phoneme( 6, 4), // CONS : THis, feaTHer, THen            "EH": new Phoneme( 0, 2), // VOWEL: bEd, brEAd            "ER": new Phoneme( 3, 1), // VOWEL: bURn, fIRst, fERn, hEARd, wORk, dollAR            "EY": new Phoneme( 0, 1), // VOWEL: bAcon, lAtE, dAY, trAIn, thEY, EIght, vEIn            "F" : new Phoneme( 5, 2), // CONS : Fish, PHone            "G" : new Phoneme( 4, 3), // CONS : Go, eGG            "HH": new Phoneme( 0, 4), // CONS : Hot, House            "IH": new Phoneme( 0, 2), // VOWEL: mIRRor, chEER, nEAR, If, bIg, wIn            "IY": new Phoneme( 1, 1), // VOWEL: shE, thEsE, bEAt, fEEt, kEY, chIEf, babY            "JH": new Phoneme( 9, 1), // CONS : Jet ,caGe, barGe, juDGE, Gym            "K" : new Phoneme( 4, 2), // CONS : Cat, Kitten, duCK, sCHool, oCCur            "L" : new Phoneme( 6, 2), // CONS : Leg, beLL            "M" : new Phoneme( 7, 2), // CONS : Mad, haMMer, laMB            "N" : new Phoneme( 4, 4), // CONS : No, diNNer, KNee, GNome            "NG": new Phoneme( 4, 1), // CONS : siNG, moNkey, siNk            "OW": new Phoneme( 2, 1), // VOWEL: nO, nOtE, bOAt, sOUl, rOW            "OY": new Phoneme( 2, 1), // VOWEL: cOIn, tOY            "P" : new Phoneme( 7, 4), // CONS : Pie, aPPle            "R" : new Phoneme( 4, 3), // CONS : Run, maRRy, WRite            "S" : new Phoneme( 4, 2), // CONS : Sun, mouSE, dreSS, City, iCE, SCienCE            "SH": new Phoneme( 9, 2), // CONS : SHip, miSSion, CHef, moTIon, speCIal            "T" : new Phoneme( 4, 4), // CONS : Top, leTTer, sToppED            "TH": new Phoneme( 4, 3), // CONS : THumb, THin, THing            "UH": new Phoneme( 2, 3), // VOWEL: bOOk, pUt, cOULd            "UW": new Phoneme( 2, 2), // VOWEL: hUman, UsE, fEW, tWO            "V" : new Phoneme( 5, 4), // CONS : Vet, giVe            "W" : new Phoneme( 8, 3), // CONS : Wet, Win, sWim, WHat            "Y" : new Phoneme( 1, 2), // CONS : Yes , onIon            "Z" : new Phoneme( 4, 2), // CONS : Zip, fiZZ, sneeZE, laSer, iS, waS, pleaSE, Xerox, Xylophone            "ZH": new Phoneme( 9, 1), // CONS : garaGE, meaSure, diviSion        };
Link to comment
Share on other sites

Hi Jeff,

 

My apologies, as I've been slammed with work before the holiday.  REALLY looking forward to Thanksgiving.  :D  There is no advantage to using USB over analogue microphones or inputs, as every microphone is analogue, and USB is simply a conversion to digital.  The analogue mic will be converted to a digital signal in any computer audio board, and all audio boards now are 32 bit and 64 bit - which for speech, 32 bit is just as good.  

 

As for microphones, the rule is that the size of the microphone (actually the diaphragm inside the microphone) is relative to audio quality, however for human speech, a headset microphone from Audiotechnica or Plantronics is fine.  I've used $300 Sennheiser headsets, and have found nominal improvement to frequency response in recording human speech - so a microphone in the $50 range plugged directly into the audio jack of almost any modern computer is very good for this application.  I would use a headset microphone without headphones (as headphones only add to the cost), and not a stationary microphone so that you don't have to be concerned about directional attributes.  If you use the stationary microphone in a laptop, the audio quality is reduced and you'll see a dramatic decrease in audio quality - which cannot be improved much.

 

One key piece of external hardware that I find important in a portable audio rig is a small mixer for audio quality control as well as other features.  I use a Mackie 402VLZ4 4 channel mixer, as it provides 48 volt phantom power to some microphones which require this.  So if you're shopping for microphones, make sure you check to see if they require 48 volt phantom power which will not work without a mixer which supports 48 volt phantom power.  Phantom power is a bonus, as the audio signal is consistently strong and simply a better microphone.  These usually run in the neighborhood of $100 for entry level which is fine for your application.  I would like to use a mixer such as the Behringer 802 mixer as it provides a single band of parametric EQ in addition to the low band and high band analogue EQs, but does not provide Phantom power - so it's a balance of budget and features.  I use an outboard digital parametric EQ, but these are quite expensive. :(  

The Mackie mixer I use will cost about $99, and the Behringer costs about $65.  Also, if you do place a mixer in-line, don't forget to order a 1/8" adapter for your microphone (some come with this) and a cable with a mono "1/4" jack on one end and a mono or stereo 1/8" jack on the other end.  Be aware that if you are using a mono input, the signal will be missing from one of the stereo channels your computer audio board is processing - which is fine if you're aware.  I'm sure this is remedial info, but I didn't want to skip anything.

 

I hope this helps, as audio signal quality and versatility is the most important element in the system, as everything is on a level playing field once the signal is converted by your audio board.

 

Cheers,

 

David B.

Link to comment
Share on other sites

Hi Jeff,

 

I wrote a reply to this today which seems to have been lost.  :(  I'll do my best to repeat the info.  First, my apologies for not responding sooner, but I've been in overdrive trying to produce work before the holidays.

 

Here's what I believe I posted earlier today - I should have used my text editor - which I'm still ignoring now after half a bottle of Wild Turkey and listening to LOUD music- a new Joe Walsh record (Analogue Man) - a Thanksgiving tradition!  :blink:(Wild Turkey)

 

So here goes!  Wow, I can't remember - The microphone is very important, and the size of the diaphragm in the microphone is somewhat proportional to audio recording quality.  However, for recording human voices, a cardioid headset microphone is sufficient.  There are two primary types - both analogue - phantom powered and low powered microphones.  I prefer a 48 volt phantom powered microphone, which requires a 48 volt power supply which some audio mixers provide (this is best), but you can also build your own using a couple 9 volt batteries.  I use Audiotechnica and Plantronics headset microphones low power($50 - $100 range) and phantom power($100 and up.)  I have also used Sennheiser headset microphones beginning at around $300 each, but these have shown nominal improvement.  Just make sure if your microphone requires phantom power, as it will produce a low level or no level if plugged into your computer without running through and audio mixer or box that provides phantom power.

 

As for audio boards, since all microphones are analogue, the conversion to digital happens on your audio board.  Unless you are using software such as Pro Tools which has strict hardware requirements, my opinion is that most outboard audio boards are a waste and actually provide less versatility.  As all audio boards are either 32 bit or 64 bit, anything that converts the signal to 32 bit or above is fine.  I don't really like USB components for audio, since manipulating the analogue audio signal is much more versatile and far less expensive before your audio board converts the signal to a digital waveform.  What I would avoid is a stationary microphone as the level is difficult to control, and never use the microphone on a laptop (unless you have no alternatives) as this is the very worst in frequency response.  

 

A key component is a small 2 or 4 channel audio mixer.  I use the Mackie 402VLZ4 audio mixer ($99) as it provides me with phantom power for the microphones I use.  And I place an out-board parametric EQ in-line prior to the output of the mixer - but an out-board parametric EQ is a bit expensive.  If I were to go with the least expensive, but highest quality audio output for voice analysis, I would use a Behringer 802 4 channel mixer ($65.)  It doesn't provide phantom power, but does have a single band parametric EQ, and both low and high frequency analogue EQ sweeps - which should do the job.

 

You'll also require a 1/8th" the 1/4" audio adapter for the microphone (many microphones come with this), and a mono or stereo cable with one end an 1/4" audio plug and at the other end an 1/8" inch plug - which is required to plug into your computer.  I should also point out that mono is fine, however as I'm sure you realize, only your left channel will produce an audio signal.  I use mono, as stereo is useless for this recording application.

 

I know I wrote more earlier today, and hope this post isn't lost to the ages as the last one.  Please let me know if there is any more I might do to assist in this area, and HAVE A GREAT THANKSGIVING!  I'm already doing so!   B)

 

Cheers,

 

David B.

 

Too much Holiday!  I didn't see the page addendum. :blink:

Link to comment
Share on other sites

Hi Jeff,

 

Regardless of what the studies show, I use an AX, but AH or AU for the vowel in such a word as HOT is fine.  My personal preferences are certainly not outlinedt in any studies, however in looking at the processing in many voice recognition applications, one of these should be supported.  AH is almost always supported.  If not, as long as the source is editable, you can define these identifiers and write them in yourself.  AA will never suffice for a word such as HOT.

 

Cheers,

 

David

Link to comment
Share on other sites

Hi Jeff,

 

Just to make my last point clear, my own personal experience has demonstrated to me that AA does not seem to be very reliable in practice - which is why I try and avoid using AA when I can use AO (I hastily wrote AU instead of AO in my last post.)  AA is generally defined in VR software too broadly, and often unconditionally picks up too broad of a range of phonemes.  I like to be as selective as possible, and there's considerable redundancy in the Arpabet as it's defined, and I find even more redundancy in practice; generally due to differences is dialect, culture, and pronunciation, and also using untrained VR software.

 

Cheers,

 

David

Link to comment
Share on other sites

David, Thanks.

 

I have a new version coming out soon, I hope.  I'll respond to you when it is ready.

 

In the meantime, I was wondering if I have to switch to an .mp3 format to get the upcoming Web Audio support for BJS to play my files?  (Am trying to get both mono & stereo output for .wav & it is difficult enough).  I see posts that <audio> tags for EI do not work with .wav files.  Is this also going to be the case here?  I do not have Windows, so I can not answer this myself.

 

Thanks,

 

Jeff

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...