Tuesday, January 14, 2003
( 10:28 AM ) Matt
Gaze detection continued.
We use stereo image data (from our two eyes) to aid us in a lot of tasks. I just read a paper which uses stereo image data to distinguish between background and foreground. [Head tracking using stereo: D. Russakoff et. al. Machine Vision and Applications (2002) 13: 164-173] They use some hardware which extracts depth information from two video images. How accurate is this depth information? I did a quick and dirty bit of math, and came up with this: First, assume that our input source is a TV quality video signal. You can get about 300 to 600 horizontally discernible pixels. Let's say a typical field of view for a video camera is about 60 degrees. A little bit of trig shows us that if we have our stereo cameras placed 1 foot apart, and we are 2 feet from the cameras, then 1 inch of depth corresponds to a 1 degree difference in angle between the cameras. This is about 5 to 10 pixels. So, if we are able to exactly map images pixel for pixel to each other, then we may be able to achieve a depth resolution of 0.2 inches at best.
In addition to separating figure from ground like in Russakoff's work, we can also gather some 3D information about the object being viewed. For gaze detection, the angular orientation of the face is critical. If the outer edges of the eyes are 6 inches apart, and we have an accuracy of 0.2 inches of depth perception, then that gives us a 2 degree accuracy of face orientation. If the computer screen takes up 30 degrees of your field of view, then the accuracy of gaze detection is going to be at best 1/15th of the screen. There's also a lot of noise which isn't going to help. How can we make this better?
We have a very low spatial resolution to work with, but we are acquiring 30 frames per second from the camera. This is a pretty high temporal resolution. When you move the mouse to a particular place on the screen, you get it in the ballpark pretty quickly, but if it's a small button, you might spend half a second getting it spot on. Half a second is 15 frames. How could we increase the spatial resolution based on a sampling from a sequence of temporally spaced frames?
Have you noticed that when you press pause while watching a video, the pause frame never looks quite as good as the moving video? Even if the TV and VCR are really nice, you still can't really make out as much detail. That's because your eye merges in the information in the previous frames to help understand the current one. We can have the computer do the same thing.
More to be continued...# -
Comments: Post a Comment