FACE DETECTION USING NEURAL NETWORK
Aug 2nd, 2007 by admin
This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm with Boosting in Hierarchical Feature Spaces to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000).

(Note: This Paper was presented in ICSIP., Signalspot Please Download the paper
for proper formatting, images,equations and symbols)
KEY WORDS
face detection, boosting, human sensing
-
INTRODUCTION
Face to Face communication is a real time process operating at a time scale. The level of uncertainty at this time scale is considerable, making it necessary for humans & machines to rely on sensory rich perceptual primitives rather than slow symbolic inference process. Because of real time bandwidth & environmental constraints, video processing has to deal with much lower resolution & image quality, when compared photograph processing. Video images can be easily acquired & they can capture the motion of person , so its make possible to track people until they are in a position convenient for recognition.
This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid visual detection. A frontal face detection system which achieves detection and false positive rates which are equivalent to the best published results (Sung and Poggio, 1998; Rowley et al., 1998; Osuna et al., 1997a; Schneiderman and Kanade, 2000; Roth et al., 2000). This system achieves high frame rates working only with the information present in a single gray scale image.
There are three main contributions of this face detection framework. Here is introduction of these ideas briefly below and then describe them in detail in subsequent chapters.The first contribution of this paper is a new image representation called an integral image that allows for very fast feature evaluation. Here use a set of features which are reminiscent of Haar Basis functions In order to compute these features very rapidly at many scales here introduce the integral image representation for images (the integral image is very similar to the summed area table used in computer graphics (Crow, 1984) for texture mapping). The integral image can be computed from an image using a few operations per pixel. Once computed, any one of these Haar-like features can be computed at any scale or location in constant time.
The second contribution of this paper is a simple and efficient classifier that is built by selecting a small number of important features from a huge library of potential features using AdaBoost. Within any image sub-window the total number of Haar-like features is very large, far larger than the number of pixels. In order to ensure fast classifi-
cation, the learning process must exclude a large majority of the available features, and focus on a small set of critical features. Feature selection is achieved using the AdaBoost learning algorithm by constraining each weak classifier to depend on only a single feature & hierarchical feature . As a result each stage of the boosting process, which selects a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effective learning algorithm and strong bounds on generalization performance.
The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image a face might occur. More complex processing is reserved only for these promising regions. The key measure of such an approach is the “false negative” rate of the attentional process. It must be the case that all, or almost all, face instances are selected by the attentional filter. Here describe a process for training an extremely simple and efficient classifier which can be used as a “supervised” focus of attention operator. A face detection attentional operator can be learned which filter out over 50% of the image will while preserving 99% of the faces (as evaluated over a large dataset). This filter is exceedingly efficient; it can be evaluated in 20 simple operations per location/scale (approximately 60 microprocessor instructions).
Those sub-windows which are not rejected by the initial classifier are processed by a sequence of classifiers, each slightly more complex than the last. If any classifier rejects the sub-window, no further processing is performed. The structure of the cascaded detection process is essentially that of a degenerate decision tree.
An extremely fast face detector will have broad practical applications. These include user interfaces, image databases, and teleconferencing. This increase in speed will enable real-time face detection applications on systems where they were previously infeasible. In applications where rapid frame-rates are not necessary, this system will allow for significant additional post processing and analysis. In addition to this system can be implemented on a wide range of small low power devices, including hand-helds and embedded processors.
-
FEATURE
Face detection procedure classifies images based on the value of simple features. There are many motivations for using features rather than the pixels directly. The most common reason is that features can act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data. For this system there is also a second critical motivation for features: the feature-based system operates much faster than a pixel-based system. The simple features used are reminiscent of Haar basis functions which have been used by Papageorgiou et al. (1998).
Fig. 1 Example rectangle features shown relative to the enclosing detection window
The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or vertically adjacent (see Fig. 2.1). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles.
The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles. Two-rectangle features are shown in (A) and (B). Figure (C) shows a three-rectangle feature, and (D) a four-rectangle feature. Unlike the Haar basis, the set of rectangle features is over complete.
2.1. Integral Image
Rectangle features can be computed very rapidly using an intermediate representation for the image which we call the integral image. The integral image at location x, y contains the sum of the pixels above and to the left of x, y, inclusive:
where, ii (x, y) is the integral image and i (x, y) is the original image.
3. LEARNING CLASSIFICATION FUNCTIONS
Given a feature set and a training set of positive and negative images, any number of machine learning approaches could be used to learn a classification function. More recently Roth et al. (2000) have proposed a new and unusual image representation and have used the Winnow learning procedure.
Even though each feature can be computed very efficiently, computing the complete set is prohibitively expensive. The hypothesis, which is borne out by experiment, is that a very small number of these features can be combined to form an effective classifier. The main challenge is to find these features. In this system a variant of AdaBoost is used, both to select the features and to train the classifier (Freund and Schapire, 1995). In its original form, the AdaBoost learning algorithm is used to boost the classification performance of a simple learning algorithm (e.g., it might be used to boost the performance of a simple perceptron). It does this by combining a collection of weak classification functions to form a stronger classifier. In the language of boosting the simple learning algorithm is called a weak learner. So, for example the perceptron learning algorithm searches over the set of possible perceptrons and returns the perceptron with the lowest classification error. The learner is called weak because we do not expect even the best classification function to classify the training data well (i.e. for a given problem the best perceptron may only classify the training data correctly 51% of the time). In order for the weak learner to be boosted, it is called upon to solve a sequence of learning problems. After the first round of learning, the examples are re-weighted in order to emphasize those which were incorrectly classified by the previous weak classifier. The final strong classifier takes the form of a perceptron, a weighted combination of weak classifiers followed by a threshold.
The formal guarantees provided by the AdaBoost learning procedure are quite strong. Freund and Schapire proved that the training error of the strong classifier approaches zero exponentially in the number of rounds. More importantly a number of results were later proved about generalization performance. The key insight is that generalization performance is related to the margin of the examples, and that AdaBoost achieves large margins rapidly. The conventional AdaBoost procedure can be easily interpreted as a greedy feature selection process. Consider the general problem of boosting, in which a large set of classification functions are combined using a weighted majority vote. The challenge is to associate a large weight with each good classification function and a smaller weight with poor functions. AdaBoost is an aggressive mechanism for selecting a small set of good classification functions which nevertheless have significant variety. Drawing an analogy between weak classifiers and features, AdaBoost is an effective procedure for searching out a small number of good “features” which nevertheless have significant variety.
One practical method for completing this analogy is to restrict the weak learner to the set of classification functions each of which depend on a single feature. In support of this goal, the weak learning algorithm is designed to select the single rectangle feature which best separates the positive and negative examples (this is similar to the approach in the domain of image database retrieval). For each feature, the weak learner determines the optimal threshold classification function, such that the minimum number of examples are misclassified. A weak classifier (h(x, f, p,?)) thus consists of a feature ( f ), a threshold (?) and a polarity (p) indicating the direction of the inequality:
Here x is a 20 × 20 pixel sub-window of an image.
Fig. 2 The boosting algorithm for learning a query online
T hypotheses are constructed each using a single feature. The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors.
The weak classifiers that we use (thresholded single features) can be viewed as single node decision trees. Such structures have been called decision stumps in the machine learning literature.
The key advantage of AdaBoost as a feature selection mechanism, over competitors such as the wrapper method, is the speed of learning. Using AdaBoost a 200 feature classifier can be learned in O(MNK) or about 1011 operations. One key advantage is that in each round the entire dependence on previously selected features is efficiently and compactly encoded using the example weights. These weights can then be used to evaluate a given weak classifier in constant time. The weak classifier selection algorithm proceeds as follows. For each feature, the examples are sorted based on feature value. The AdaBoost optimal threshold for that feature can then be computed in a single pass over this sorted list. For each element in the sorted list, four sums are maintained and evaluated: the total sum of positive example weights T +, the total sum of negative example weights T -, the sum of positive weights below the current example S+ and the sum of negative weights below the current example S-. The error for a threshold which splits the range between the current and previous example in the sorted list is:
or the minimum of the error of labeling all examples below the current example negative and labeling the examples above positive versus the error of the converse. These sums are easily updated as the search proceeds.The final application demanded a very aggressive process which would discard the vast majority of features.
For the task of face detection, the initial rectangle features selected by AdaBoost are meaningful and easily interpreted. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks (see Fig. 3).
Fig.3 The first and second features selected by AdaBoost
The two features are shown in the top row and then overlayed on a typical training face in the bottom row. The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on the observation that the eye region is often darker than the cheeks. The second feature compares the intensities in the eye regions to the intensity across the bridge of the nose.
This feature is relatively large in comparison with the detection sub-window, and should be somewhat insensitive to size and location of the face.
-
THE ATTENTIONAL CASCADE
This section describes an algorithm for constructing a cascade of classifiers which achieves increased detection performance while radically reducing computation time. The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all positive instances. Simpler classifiers are used to reject the majority of sub-windows before more complex classifiers are called upon to achieve low false
positive rates. Stages in the cascade are constructed by training classifiers using AdaBoost. Starting with a two-feature strong classifier, an effective face filter can be obtained by adjusting the strong classifier threshold to minimize false negatives. The initial AdaBoost threshold, is designed to yield a low error rate on the training data. A lower threshold yields higher detection rates and higher false positive rates. Based on performance measured using a validation training set, the two-feature classifier can be adjusted to detect 100% of the faces with a false positive rate of 50%. The overall form of the detection process is that of a degenerate decision tree, what we call a “cascade” (see Fig. 4).
Fig.4 Schematic depiction of a the detection cascade.
A series of classifiers are applied to every sub-window. The initial classifier eliminates a large number of negative examples with very little processing. Subsequent layers eliminate additional negatives but require additional computation. After several stages of processing the number of sub-windows have been reduced radically. Further processing can take any form such as additional stages of the cascade (as in detection system) or an alternative detection system.
A positive result from the first classifier triggers the evaluation of a second classifier which has also been adjusted to achieve very high detection rates. A positive result from the second classifier triggers a third classifier, and so on. A negative outcome at any point leads to the immediate rejection of the sub-window. The structure of the cascade reflects the fact that within any single image an overwhelming majority of sub-windows are negative. As such, the cascade attempts to reject as many negatives as possible at the earliest stage possible. While a positive instance will trigger the evaluation of every classifier in the cascade, this is an exceedingly rare event. Much like a decision tree, subsequent classifiers are trained using those examples which pass through all the previous stages. As a result, the second classifier faces a more difficult task than the first. The examples which make it through the first stage are “harder” than typical examples. The more difficult examples faced by deeper classifiers push the entire receiver operatingcharacteristic (ROC) curve downward. At a given detection rate, deeper classifiers have correspondingly higher false positive rates.
The process by which each element of the cascade is trained requires some care. The AdaBoost learning procedure attempts only to minimize errors, and is not specifically designed to achieve high detection rates at the expense of large false positive rates. One simple, and very conventional, scheme for trading off these errors is to adjust the threshold of the perceptron produced by AdaBoost. Higher thresholds yield classifiers with fewer false positives and a lower detection rate. Lower thresholds yield classifiers with more false positives and a higher detection rate. It is not clear, at this point, whether adjusting the threshold in this way preserves the training and generalization guarantees provided by AdaBoost. The overall training process involves two types of tradeoffs. In most cases classifiers with more features will achieve higher detection rates and lower false positive rates. At the same time classifiers with more features require more time to compute. In principle one could define an optimization framework in which
-
the number of classifier stages,
-
the number of features, ni, of each stage,
-
the threshold of each stage are traded off in order to minimize the expected number of features N given a target for F and D. Unfortunately finding this optimum is a tremendously difficult problem.
In practice a very simple framework is used to produce an effective classifier which is highly efficient. The user selects the maximum acceptable rate for fi and the minimum acceptable rate for di . Each layer of the cascade is trained by AdaBoost (as described in Table 3.1) with the number of features used being increased until the target detection and false positive rates are met for this level. The rates are determined by testing the current detector on a validation set. If the overall target false positive rate is not yet met then another layer is added to the cascade. The negative set for training subsequent layers is obtained by collecting all false detections found by running the current detector on a set of images which do not contain any instances of faces.
This algorithm is given more precisely below.
Algorithm:-
• User selects values for f , the maximum acceptable false positive rate per layer and d, the minimum acceptable detection rate per layer.
• User selects target overall false positive rate, Ftarget .
• P = set of positive examples
• N = set of negative examples
• F0 = 1.0; D0 = 1.0
• i = 0
• while Fi > Ftarget
– i? i + 1
– ni = 0; Fi = Fi-1
– while Fi > f × Fi-1
* ni ?. ni + 1
* Use P and N to train a classifier with ni features using AdaBoost
* Evaluate current cascaded classifier on validation set to determine Fi and Di .
* Decrease threshold for the ith classifier until the current cascaded classifier has a detection rate of at least d × Di-1 (this also affects Fi )
-
N . ? Ø
-
If Fi > Ftarget then evaluate the current cascaded detector on the set of non-face images and put any false detections into the set N
There is a hidden benefit of training a detector as a sequence of classifiers which is that the effective number of negative examples that the final detector sees can be very large. One can imagine training a single large classifier with many features and then trying to speed up its running time by looking at partial sums of features and stopping the computation early if a partial sum is below the appropriate threshold. One drawback of such an approach is that the training set of negative examples would have to be relatively small (on the order of 10,000 to maybe 100,000 examples) to make training feasible. With the cascaded detector, the final layers of the cascade may effectively look through hundreds of millions of negative examples in order to find a set of 10,000 negative examples that the earlier layers of the cascade fail on. So the negative training set is much larger and more focused on the hard examples for a cascaded detector.
5.RESULTS
The training set is given of with frontal faces & are only roughly aligned. This was done by having a person place a bounding box around each face just above the eyebrows and about half-way between the mouth and the chin. This bounding box was then enlarged by 50% and then cropped and scaled to 20 by 20 pixels.
By observing the performance of this face detector on a number of test images we have noticed a few different failure modes. The face detector was trained on frontal, upright faces. The faces were only very roughly aligned so there is some variation in rotation both in plane and out of plane. Informal observation suggests that the face detector can detect faces that are tilted up to about ±15 degrees in plane and about ±45 degrees out of plane (toward a profile view). The detector becomes unreliable with more rotation than this. Also noticed that harsh backlighting in which the faces are very dark while the background is relatively light sometimes causes failures. It is interesting to note that using a nonlinear variance normalization based on robust statistics to remove outliers
improves the detection rate in this situation. The problem with such a normalization is the greatly increased computational cost within in integral image framework. Finally, this face detector fails on significantly occluded faces. If the eyes are occluded for example, the detector will usually fail. The mouth is not as important and so a face with a covered mouth will usually still be detected.
The result are tested without cascading the classifier as
Fig 5. Result of locating single Face within picture
Fig 6 Result of Locating Multiple Faces wirhin picture
Fig 7. Locating Non-Frontal Faces
Fig 8 Result of Locating Faces in Low Light Conditions
7.CONCLUSIONS
Here a presented an approach for face detection which minimizes computation time while achieving high detection accuracy. The approach was used to construct a face detection system which is approximately 15 times faster than any previous approach. Preliminary experiments, which will be described elsewhere, show that highly efficient detectors for other objects, such as pedestrians or automobiles, can also be constructed in this way. Thus it shows the effectiveness algorithm & improvements.
References
Books:-
1. Milan Sonka, Vaclav Hlavac, Roger Boyle
“Image Processing, analysis and Machine Vision”, Tata McGraw-Hill ISBN 981-240-061-3
Web – References:-
1. Rowley, Baluja, and Kanade: “Neural Network-Based Face Detection” IEEE Patt. Anal. Mach. Intell., 20:22–38.
2. Viola & Jones : “Robust Real-Time Face Detection” International Journal of Computer Vision 57(2), 137-154,2004.
Attached Files:



Loading ...
