|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
A History of Motion
First, a question. What is 'video' and is it different from a sequence of frames? I would suggest that this question can be understood by beginning with a slide show, where each frame is 100% different from the next (obviously not a video) and progressively reducing the inter-frame difference bit by bit until you have a collection of identical frames.
I would suggest that a collection of frames become a video when there is a high degree of inter-frame redundancy; i.e. when it become advantageous to treat it like a video. Computer scientists have been dealing with issues of temporal redundancy since the begining of Computer Vision in the 1960's. However, it was not until the late 1970's that these issues began to split off into a coherent and seperate topic of research. One of the key figures in this early work is J. K. Aggarwal of University of Texas at Austin. In 1979, Aggarwal, with N.I. Badler, held a workshop on "Computer Analysis of Time-Varying Imagery". This was followed in 1980 with a Special Issue on Motion and Time-Varying Imagery of PAMI and a book in 1988 called Motion Understanding : Robot and Human Vision. Any new topic goes through frequent changes of name in its early years. In his 1984 summary of the workshop, Aggarwal begins with run down of the multiple alasias of the field; "time-varying imagery", "image sequence processing" and "dynamic scene analysis", to which we may add more recent terms such as "spatio-temporal video analysis" (or 'spatial temporal') and the illogical "image motion analysis". Two of the key sub-topics, which have remained issues to this day, are the broad, and related, problems of motion segmentation and dealing with object occlusion. Segmentation was defined as
This area since been subdivided into the process of Motion Detection or motion segmentation from static (quasi-static or noisy) backgrounds, and the Tracking or correspondence problem. A frame sequence becomes video when it contains both an interesting moving foreground and an uninteresting background. This foreground should be trackable using frame-to-frame correspondence (ie. the frame rate is high enough). The background should be distinguisable from the foreground based on motion. Using this definition of the task at hand, we can now better describe the nature of 'video' for our purpose.
The topic developed slowly through the 1980's and early 1990's and also split into a number of application orintated sub-topics. Included in this are a spectrum of moving camera applications from ego-motion to Pan-Tilt-Zoom CCTV cameras to static CCTV with shake to an assumption of perfect stability. The topic of motion detection and tracking in a quasi-stable camera (the scenario of Visual Surveillance) developed in the 1990's. As late as 1991, Pattern Recognition ran a "Glossary of computer vision terms" which contained only 3 motion related terms out of a total of 308. ("Time varying imagery", "optical flow", and "structure from motion"). Motion processing in the 1980's was dominated by optical flow calculations using spatial feature detection and iterative refining for noise and ambiguity reduction. As Aggarwal saw it in 1988, the primary paradyms at that time were what he called "feature based approaches" and "optical flow based approaches". Both these methods are foreground based (ie, the are frame-by-frame operations using instantenous appearance information only) and differ mainly in terms of scale. By "feature based", Aggarwal means "a set of relatively sparse, but highly discriminatory, two dimentional features" which may be extracted and matched frame to frame (solving the corresponence problem) using constraints such as a rigid body assumption. "Optical flow" relies on a very large number of less discriminatory, usually point features, ie. single pixel values. To counter the extreme difficulties and ambiguities of this approach, a number of constrainst and assumptions are made. The most important and most primal assumption in all visual tracking is the 'object assumption' These assumptions can be turned into a set of equations describing the allowable motion of each point. However, simultaneous solution to these equations is costly and so an iterative approach is usually employed. Optical flow's strenght of simplicity of feature point is also its fundamental weakness. As the features are solely point brightness based, the method is extremely vunerable to lighting change. A shift in the direction or intensity of lighting will cause apparent optical flow even when no true motion is present (Nguyen 1996)
Also during the 1980's a simple alternative to optical flow, Frame Differencing, was being explored. The technique is based on the simple and cheap method of subtracting the current frame from its neighbour, and thresholding to produce a binary motion detection mask. Research focused on innovative and adaptive thresholding methods to counter the great difficulties of noise and partial object segmentation. Other work-arounds for noise problems included taking the difference of thresholded edge maps, as these are more robust to illumination changes, however, this made the task of determining the noise level, and thus calculating the threshold level, more difficult. Applications included change detection for video compression (and is still used to this day in the MPEG4 codex) and visual motion alarms. Long & Yang's influencial 1990 paper set out a number of methods involving the computation of a running average of pixel values in order to achive 'Stationary Background Generation'. The paper also addresses the issue of an imperfect background (and mentions the effect later known as the 'transient background problem') and proposes a morphological solution. They tested the method on both indoor and outdoor scenes, however the videos were quite short (54-71 frames) and so it is likely that they didn't experience many problems using the 'mean' as their statistic. Prior to this paper a number of 'frame differencing' approaches were reported, including Lee & Hsieh and Anderson, however it is easy to understand why the seemingly obvious step of extending differencing to true statistical background modelling arose only in the 1990's by noting the comment from the final page of Long & Yang's paper: References Early motion work Optical flow Frame differencing Background modelling The Prehistory of Motion All creatures with the ability to see relie on motion aquity to a greater or lesser extent. In humans the visual task for the organism can be described as the determination of a figure-ground relationship. It has been shown that birds have the ability to define and distinguish patterns and objects using only motion information. Although most birds seem to have high visual acuity, hawks, penguins and insectivorous birds are strictly dependent on motion cues for detecting prey at more or less great distances. Hawks can spot their small prey from distances of more than a few hundred meters. Similarly, one is always surprised seeing a bird that sits on a branch of a tree, suddenly flying to a particular point over a lake or pond, pick up an insect, and return. In neither case would a motionless object at such distances have been discriminated from the background. And the fact that a lot of species that are hunted as potential prey, have developed the behavior of 'freezing' as a very successful anti-predator strategy, seems evidence in itself for the importance of motion information in figure-ground discrimination. Neurophysiological evidence tends in the same direction: it has long been known that there are cells in the pigeon visual system that respond to relative motion between objects and background (e.g., Frost & Nakayama, 1983). The ability to detect and react to motion was one of the earliest parts of biological vision to evolve. Research suggests that it actually evolved in primative creatures before the ability to recognise objects based on appearance. How is this possible, if the correspondence problem first needs to be solved before motion can be detected? Further to this, a great deal of information can be gleamed from the curious symptoms of neurological deseases. Damage to the striate cortex (V1) in humans can lead to a condition known as blindsight, in which the patient is not consciously aware, but can locate a moving object in the affected visual hemifield. This spared ability in the "blind" area is presumably the result of the processing ability of the collothalamic visual system. In addition, the extrastriate cortex is the sight of numerous important visually-responsive areas involved in color (V4), form (IT), and motion perception (MT). In Humans the answer lies in the Dorsal region of the Visual Cortex. There lies several million neurons which each stand watch over thier individual sector of the visual field (receptive field). Each neuron behaves as a band pass filter, responding to particular speeds and directions. The filter profile is spatio-temporal.
Work on profiling the responce of visual neurons, started by Hubel and Wiesel in the 1960's, has been carried out by Young and others over the past 25 years. Initially, only the spatial responces of feature detector neurons were studies, but recent work has moved onto decoding the responce of motion detector neurons also.
Young's publications Other References Books Motion Detection in Computers There are two primary and opposite approaches to the motion detection problem which are commonly used in literature. The statistical background modelling approach (Background approach), and the wide class of model based feature and object tracking techniques (Foreground approach). Background Modelling
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||