CL > Forums > Audio: Puzzled

Home Forums Software General DiscussionThread

Audio: Puzzled

igor1960

Posted: 20 January 2010 12:21 PM

[ Ignore ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

I had sometime to play with PS3Eye audio and even wrote DirectShow filter that splits multichannel audio stream into separate mono channels correspondent to each channel. I even implemented algorithm that finds source direction and here I’ve stumbled upon strange thing.
I don’t know about other systems, but on my Windows XP if you goto Graphedit you might find 2 audio capture filters correspondent to
USB Camera B4.04.27.1:
1) First one is: Audio Capture Source filter (implemented by qcap.dll and located inside “Audio Capture Source” category);
2) Second one: WDM USB Audio device filter (implemented by ksproxy.ax and located inside “WDM Streaming Capture Devices” category).

So, first one (Audio Capture Source filter) provides huge bunch of WAVEFORMATEXs, starting with 8khz and up to 96khz, but best of them is just stereo (2 channels) format. Also, I’ve discovered that on XP I’m experiencing huge delay using this filter.

Second one (WDM USB Audio device) provides just one WAVEFORMATEX and it is 4Channel and 16khz ==> So I assumed this is the one that sends separate data for each of 4 microphones of PS3Eye.
So, using WDM USB Audio device driver I’m able to extract each of 4 channels and process channel 1 and 4 finding delay between them and thus estimating direction of the sound.

The problem is: only one available frequency 16khz. This gives just max 3 samples shift between channel 1 and channel 4. Would be great to have 44.1khz—that would give around 8 samples difference. Running UVCView shows that this is real and only 16khz steaming audio data is available on this USB port.

So, my questions are:
—How come Audio Capture Source filter (first one) is able to provide high audio capture frequencies from the device that natively (I think it is) is able to provide just 16khz?!!! Even we assume that they do that by converting 4 channels into 2 ==> then max frequency on each stereo channel should be 2*16khz=32khz;
—Maybe PS3Eye audio is able to provide more then 16khz digitization natively, however this capability is not exposed through WDM USB Audio device driver? If so, maybe someone know ways to obtain some other WDM driver?

So, it’s more like question to gurus (unless I’m completely stupid, which is extremely and highly possible).LOL

Profile

AlexP

Posted: 20 January 2010 01:05 PM

[ Ignore ] [ # 1 ]

Administrator

Total Posts: 585

Joined 2009-09-17

igor1960 - 20 January 2010 12:21 PM
I had sometime to play with PS3Eye audio and even wrote DirectShow filter that splits multichannel audio stream into separate mono channels correspondent to each channel. I even implemented algorithm that finds source direction and here I’ve stumbled upon strange thing.
I don’t know about other systems, but on my Windows XP if you goto Graphedit you might find 2 audio capture filters correspondent to
USB Camera B4.04.27.1:
1) First one is: Audio Capture Source filter (implemented by qcap.dll and located inside “Audio Capture Source” category);
2) Second one: WDM USB Audio device filter (implemented by ksproxy.ax and located inside “WDM Streaming Capture Devices” category).

So, first one (Audio Capture Source filter) provides huge bunch of WAVEFORMATEXs, starting with 8khz and up to 96khz, but best of them is just stereo (2 channels) format. Also, I’ve discovered that on XP I’m experiencing huge delay using this filter.

Second one (WDM USB Audio device) provides just one WAVEFORMATEX and it is 4Channel and 16khz ==> So I assumed this is the one that sends separate data for each of 4 microphones of PS3Eye.
So, using WDM USB Audio device driver I’m able to extract each of 4 channels and process channel 1 and 4 finding delay between them and thus estimating direction of the sound.

The problem is: only one available frequency 16khz. This gives just max 3 samples shift between channel 1 and channel 4. Would be great to have 44.1khz—that would give around 8 samples difference. Running UVCView shows that this is real and only 16khz steaming audio data is available on this USB port.

So, my questions are:
—How come Audio Capture Source filter (first one) is able to provide high audio capture frequencies from the device that natively (I think it is) is able to provide just 16khz?!!! Even we assume that they do that by converting 4 channels into 2 ==> then max frequency on each stereo channel should be 2*16khz=32khz;
—Maybe PS3Eye audio is able to provide more then 16khz digitization natively, however this capability is not exposed through WDM USB Audio device driver? If so, maybe someone know ways to obtain some other WDM driver?

So, it’s more like question to gurus (unless I’m completely stupid, which is extremely and highly possible).LOL

I didn’t investigate that much the audio part of the camera. But I’m planning to do that once I’m done with the camera part.
You bring up the good point and I’m pretty sure that camera does provide the higher audio sampling rate. What I think you get is two stereo channels that carry information from 4 microphones.

AlexP

Profile

igor1960

Posted: 20 January 2010 02:59 PM

[ Ignore ] [ # 2 ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

The reason found and problem solved!

OK, USB Audio device also appears among “WDM Streaming System Devices”.
That instance supports all bunch of formats.
And among them both 16khz-4channel and 48khz-2channel.

So, looks like either hardware of PS3Eye natively does only 16khz-4channel and/or microsfts ksproxy can only capable to work with that fromat and doesn’t expose higher frequencies at 4channels.
I doubt though that MSFT couldn’t do better, so my conclusion for now it’s hardware.

Profile

chuck

Posted: 21 January 2010 04:38 PM

[ Ignore ] [ # 3 ]

New Member

Total Posts: 23

Joined 2010-01-18

Hi, just for the fun of it i recorded all 4 streams in Reaper using ASIO4ALL, and have a picture of it where you can see that the delay is in fact 3 samples.
Some of you may find it boring, but I find it kinda fascinating

i snipped my fingers just to the right of the mic array, and on the first peak you can clearly see the delay between channel 1 and channel 4. woohoo

Image Attachments

Click thumbnail to see full-size image

Profile

igor1960

Posted: 21 January 2010 05:08 PM

[ Ignore ] [ # 4 ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

Yes, at 16khz you would get 3 samples difference.
However, at 44.1khz that would become around 8 samples:

Assuming 65mm distance between left and right mics:

65mm/1000/343 * 44100 = 8.35

where 343m/s is sound speed in the air

Profile

AlexP

Posted: 21 January 2010 06:42 PM

[ Ignore ] [ # 5 ]

Administrator

Total Posts: 585

Joined 2009-09-17

igor1960, I’m not sure why you think that having an audio signal of 16Khz is a bad thing. As chuck has shown, you can use it perfectly fine to determine the direction of the audio source.

AlexP

Profile

igor1960

Posted: 21 January 2010 07:54 PM

[ Ignore ] [ # 6 ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

Alex,
No, you missunderstood me: 16khz is fine. It gives 3points sampling resolution and is native frequency of the device.
What I was saying is that:
—USB Audio Capturing filter doesn’t expose 4 channels at 16khz. Instead it gives bunch of other frequencies but none of them for 4 channels. Meaning, that USB Audio Capturing filter somehow processes native 4 channels at 16khz and converts them to max. 2 channels.
As we don’t know rules of such conversion: I was under impression that there might be problems calculating delays.

However, as I said: I’ve tried implementing DirectShow filter that finds direction and it works relatively well on 4 channel 16khz exposed by WDM driver, as well as 40+khz provided by USB Audio Capturing filter.

Another point I was making is the question: what is native PS3Eye audio format? If it is more then 16khz, then we can do better then just finding “left-center-right” direction, but in fact do more precise direction calculation.

I just remember, that several years ago, if I’m not mistaken, Honeywell won government research contract that determines direction of the enemy gun shot.
So, basically having higher frequency and/or larger parralax between opposite mics and we are ready to fire back.LOL

If we would have higher available frequency another hypothetical application might be (and here I’m generating patentable idea):
—imagine that we create some DirectShow filter that compares left and right channels and only pass through samples where left and right samples are exactly equal (or very close to each other).
Effectively, that would give on output only samples that came directly from the center of the system… Basically, we will be capturing sound coming from very narrow geometrical channel.
The problem is: how to remove background noise.
But here we have 4 mics, not just 2, so we might think about this (I don’t know how yet).

Profile

AlexP

Posted: 21 January 2010 08:34 PM

[ Ignore ] [ # 7 ]

Administrator

Total Posts: 585

Joined 2009-09-17

igor1960 - 21 January 2010 07:54 PM
Alex,
No, you missunderstood me: 16khz is fine. It gives 3point sampling resolution and is native frequency of the device.
What I was saying is that:
—USB Audio Capturing filter doesn’t expose 4 channels at 16khz. Instead it gives bunch of other frequencies but none of them for 4 channels. Meaning, that USB Audio Capturing filter somehow processes native 4 channels at 16khz and converts them to max. 2 channels.
As we don’t know rules of such conversion: I was under impression that there might be problems calculating delays.

However, as I said: I’ve tried implementing DirectShow filter that finds direction and it works relatively well on 4 channel 16khz exposed by WDM driver, as well as 40+khz provided by USB Audio Capturing filter.

Another point I was making is the question: what is native PS3Eye audio format? If it is more then 16khz, then we can do better then just finding “left-center-right” direction, but in fact do more precise direction calculation.

I just remember, that several years ago, if I’m not mistaken, Honeywell won government research contract that determines direction of the enemy gun shot.
So, basically having higher frequency and/or larger parralax between opposite mics and we are ready to fire back.LOL

If we would have higher available frequency another hypothetical application might be (and here I’m generating patentable idea):
—imagine that we create some DirectShow filter that compares left and right channels and only pass through samples where left and right samples are exactly equal (or very close to each other).
Effectively, that would give on output only samples that came directly from the center of the system… Basically, we will be capturing sound coming from very narrow geometrical channel.
The problem is: how to remove background noise.
But here we have 4 mics, not just 2, so we might think about this (I don’t know how yet).

Well it is not true that you can get only few discrete directions of sound. Even if the signal is sampled at discrete intervals you can get much more finer precision by extrapolating (up-sampling) that signal, thus effectively increasing the precision many times. So even only 3 points could give you say 30 different positions or even 120 if you wish. And your idea about masking the noise from the sides and passing only the direct sound is very doable.

AlexP

Profile

igor1960

Posted: 21 January 2010 11:14 PM

[ Ignore ] [ # 8 ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

Alex,

Sorry to inform and maybe I missunderstood you, but I feel you are not right about infinite possible precission.
In this particular case we are talking about digitaly sampled signal in time domain, thats what frequency of audio signal is.
And obviously for such cases Kotelnikov Theorem should be applied and it states that precission of analog recosntruction will be limited by 2B, where B is frequency of our sampling.
Therefore, if you have 16kHz already digitized signal, you can possibly restore original analog to the precission of around 8khz.
As we have parallax between mics of just ~65mm, that at 16khz digital sampling gives us just 3 points resolution, that obviously not enough to perform interpolation with, as analog equivalent is just 1.5samples => converted to discrete array is just 1 analog sample precission.
1 analog sample might be enough to conclude left-center-right, but not enough to interpolate in between.
That’s one of the reason I was looking into possibility of having native PS3Eye audio returning higher then 16khz native frequency.

If we assume 40+khz coming from PS3Eye => that’s around twice then highest audio frequency of 20khz ==> so this would satisfy analog restoration with 20khz precission. More then 40khz would be even better.

Profile

AlexP

Posted: 22 January 2010 01:01 AM

[ Ignore ] [ # 9 ]

Administrator

Total Posts: 585

Joined 2009-09-17

igor1960 - 21 January 2010 11:14 PM
Alex,

Sorry to inform and maybe I missunderstood you, but I feel you are not right about infinite possible precission.
In this particular case we are talking about digitaly sampled signal in time domain, thats what frequency of audio signal is.
And obviously for such cases Kotelnikov Theorem should be applied and it states that precission of analog recosntruction will be limited by 2B, where B is frequency of our sampling.
Therefore, if you have 16kHz already digitized signal, you can possibly restore original analog to the precission of around 8khz.
As we have parallax between mics of just ~65mm, that at 16khz digital sampling gives us just 3 points resolution, that obviously not enough to perform interpolation with, as analog equivalent is just 1.5samples => converted to discrete array is just 1 analog sample precission.
1 analog sample might be enough to conclude left-center-right, but not enough to interpolate in between.
That’s one of the reason I was looking into possibility of having native PS3Eye audio returning higher then 16khz native frequency.

If we assume 40+khz coming from PS3Eye => that’s around twice then highest audio frequency of 20khz ==> so this would satisfy analog restoration with 20khz precission. More then 40khz would be even better.

I think you misunderstood me. And it is the Shannon sampling theorem (known as Nyquist–Shannon–Kotelnikov theorem) and the Nyquist rate you are talking about. But this is not a point here. We are not talking about signal restoration but signal correlation.
And yes, by up-sampling signal (for example using windows sinc filter), and then performing a signal cross-correlation you will find the peak of this function that will correspond to the sound angular direction. So for example, if you simply up-sample input signal by 2, your precision will improve and you will be able to detect fractional position of your sound source.
If you don’t believe me try it for yourself. Besides this is how many algorithms in digital signal processing work.

AlexP

Profile

igor1960

Posted: 22 January 2010 01:58 AM

[ Ignore ] [ # 10 ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

Alex, what you are saying about correlation is OK.

Actually, I might be wrong (which I’m frequent) and/or I just don’t know enough, but:

What I’m saying is: 16khz digitized signal is good enough to represent analog data of up to 8khz frequency.
Do you agree with me?
If you dissagree: run your 16khz data through FFT in any frequency analyzer, what would it show: it might show any max peak in between 0 and 8khz
If you will see anything above 8khz that would be harmonics, Right?
Now do upsampling as you propose, lets say by factor of 2. Lets now run FFT: would you see the same picture?
So, you multiplied number of samples, so what? Your precission didn’t increase.
You could upsample with any multiplicator, convert everything to double, implement who knows what algorithms and etc., but principal will stay the same: your original digitized source was 16khz and you will not be able to reliably extract more data from it, then it originaly contained.

Now back to audio spectrum:
16khz audio would precisely cover audio spectrum from 0 to 8khz, Right?
So, what about source with frequency that is higher then 8khz? Even voice goes to 12khz.
What you gonna do?
So, you have 2 neigbouring sampling points and you don’t know anything what happens in between them?
What if between them is source analog frequency of above then 16khz? And what if it is of sizeable amplitude. Where would you get those points.

On another hand here we have left/right channels where were know that signals are similar, just shifted and maybe you are right and it’s possible. Just my brains are recently composted.LOL

Profile

AlexP

Posted: 22 January 2010 03:16 PM

[ Ignore ] [ # 11 ]

Administrator

Total Posts: 585

Joined 2009-09-17

igor1960 - 22 January 2010 01:58 AM
Alex, what you are saying about correlation is OK.

Actually, I might be wrong (which I’m frequent) and/or I just don’t know enough, but:

What I’m saying is: 16khz digitized signal is good enough to represent analog data of up to 8khz frequency.
Do you agree with me?
If you dissagree: run your 16khz data through FFT in any frequency analyzer, what would it show: it might show any max peak in between 0 and 8khz
If you will see anything above 8khz that would be harmonics, Right?
Now do upsampling as you propose, lets say by factor of 2. Lets now run FFT: would you see the same picture?
So, you multiplied number of samples, so what? Your precission didn’t increase.
You could upsample with any multiplicator, convert everything to double, implement who knows what algorithms and etc., but principal will stay the same: your original digitized source was 16khz and you will not be able to reliably extract more data from it, then it originaly contained.

Now back to audio spectrum:
16khz audio would precisely cover audio spectrum from 0 to 8khz, Right?
So, what about source with frequency that is higher then 8khz? Even voice goes to 12khz.
What you gonna do?
So, you have 2 neigbouring sampling points and you don’t know anything what happens in between them?
What if between them is source analog frequency of above then 16khz? And what if it is of sizeable amplitude. Where would you get those points.

On another hand here we have left/right channels where were know that signals are similar, just shifted and maybe you are right and it’s possible. Just my brains are recently composted.LOL

To respond to your notes:
First of all we are not trying to increase the bandwidth of input signal and as you said (and you are right) we can only get to the theoretical max of 8kHz. Also what you’re saying about a possibility of high amplitude 16khz signal between two sampling points is not valid. If it was it would mean that our analog signal is not continuous and that will break many other things including the basic assumption in signal processing. Besides such a signal would not exist in our input anyways since if our sampling process is any good (which it is btw). To avoid aliasing, the analog signal will be first band-limited (low pass filtered) to 8kHz (or below) and then digitized.

Now let’s move to something that you are more familiar with: let’s look at the equivalent of what I’m talking about but this time in 2D domain. Let’s take a look at stereo vision and how it is done.
Let’s assume that we have left and right images of the scene already rectified. To find the depth information we apply SAD search algorithm across the image blocks to correlate them and find the best match. Once we have that, we will, based on their relative distances know the depth information. Do you see a similarity here?
Now obviously, since our image resolution is limited and we want to increase our precision what can we do (we don’t have a higher resolution cameras)?
The solution: by simply up-scaling the image and performing a search on this new larger image, our search will result in sub-pixel precision. Sounds familiar?
Here is another real-world example. Most of the videos you are watching on YouTube or even on your digital TV are compressed with modern video codecs that all use exact same sub-pixel SAD search as a part of their motion compensation algorithm. Why would they do that if you argue that this doesn’t work? Well they use it, because it does work and it really very well.

AlexP

Profile

igor1960

Posted: 26 January 2010 12:00 PM

[ Ignore ] [ # 12 ]

Jr. Member

Total Posts: 49

Joined 2010-01-15

Alex,
I completely agree with what you are saying. However, what I was trying to say is that here we have very low prallax system, where distance between left and right mics is just around 60mm.
I was implying that delay search would only work for sound source located very close to the system, so triangulation gives significant travel distance between left/right mics.
However, the longer the distance to the source the less difference in left/right sound distance (unless it’s perpendicular) and therefore, while direction determination might still be possible, precission of it will degrade.
Obviously, if system provides natively higher sampling rate: it makes sense to use it without upsampling.

Profile

sawyeriii

Posted: 18 March 2010 09:39 PM

[ Ignore ] [ # 13 ]

New Member

Total Posts: 5

Joined 2010-01-07

Irregardless of whether to up-sample or not to up-sample…

Alex have you had a chance to work on the audio drivers for CL labs PS3eye driver?

Signature

Robert L. Sawyer III

SawyerIII @ Assorted

Profile

Setting DirectShow Filter options by using IAMStreamConfig interface ›› ‹‹ Colorspace - YUV422 - for the eye

Home Forums Software General DiscussionThread