http://ilab.usc.edu/siagian/Research/Gist/Gist.html
Gist/Context of a Scene
We describe and validate a simple context-based scene recognitionalgorithm using a multiscale set of early-visual features, whichcapture the “gist” of the scene into a low-dimensional signaturevector. Distinct from previous approaches, the algorithm presentsthe advantage of being biologically plausible and of having lowcomputational complexity, sharing its low-level features with amodel for visual attention that may operate concurrently on avision system.
We compare classification accuracy using scenes filmed at threeoutdoor sites on campus (13,965 to 34,711 frames per site).Dividing each site into nine segments, we obtain segmentclassification rates between 84.21% and 88.62%. Combining scenesfrom all sites (75,073 frames in total) yields 86.45% correctclassification, demonstrating generalization and scalability of theapproach.
Index Terms:Gist of a scene, saliency,scene recognition, computational neuroscience, imageclassification, image statistics, robot vision, robotlocalization.
Papers
Source Codes and Dataset
Thecode is integrated to theiLabNeuromorphic Vision C++ Toolkit.In order to gain code access, please follow the downloadinstructions there.Special instruction to access the gist code can befoundhere.
The dataset can be foundhere.
Introduction
Significant number of mobile-robotics approaches addresses thisfundamental problem by utilizing sonar, laser, or other rangesensors [Fox1999,Thrun1998a]. They are particularly effectiveindoors due to many spatial and structural regularities such asflat walls and narrow corridors. In the outdoors, however, thesesensors become less robust given all the protrusions and surfaceirregularities [Lingemann2004]. For example, a slight change inpose can result in large jumps in range reading because of treetrunks, moving branches, and leaves.These difficulties with traditional robot sensors have promptedresearch towards vision. Within Computer Vision, lighting(especially in the outdoors), dynamic backgrounds, andview-invariant matching become major hurdles toovercome.
Object-based approaches [Abe1999,Thrun1998b] recognize physicallocations by identifying sets of pre-determined landmark objects(and their configuration) known to be present at a location. Thistypically involves intermediate steps such as segmentation, featuregrouping, and object recognition. Such layered approach is prone tocarrying over and amplifying low-level errors along the stream ofprocessing.
It should also be pointed out that this approach may beenvironment-specific in that the objects are hand-picked asselecting reliable landmarks is an openproblem.
Region-based approaches [Katsura2003,Matsumoto2000,Murrieta-Cid2002] uses segmented image regions and theirrelationships to form a signature of a location. This requiresrobust segmentation of individual regions, which is hard forunconstrained environment such as a park where vegetationdominates.
Context-based approaches ([Renniger and Malik 2004],[Ulrich andNourbakhsh 2000],[Oliva and Torralba 2001],[Torralba 2003]), on theother hand, bypass the above traditional processing steps andconsider the input image as a whole and extract a low-dimensionalsignature that summarizes the image's statistics and/or semantics.One motivation for such approach is that it is more robustsolutions because random noise, which may catastrophicallyinfluence local processing, tends to average outglobally.
Despite recent advances in computer vision and robotics, humansstill perform orders of magnitude better in outdoors localizationand navigation than the best available systems. And thus, it isinspiring to examine the low-level mechanisms as well as thesystem-level computational architecture according to which humanvision is organized (figure 1).
Figure 1. Biological Vision Model
Parallel with attention guidance and mechanisms for saliencycomputation, humans demonstrate ability in capturing the "gist" ofa scene; for example, following presentation of a photograph forjust a fraction of a second, an observer may report that it is anindoor kitchen scene with numerous colorful objects on thecountertop [Potter1975,Biederman82,Tversky1983,Oliva1997]. Suchreport at a first glance (brief exposures of 100ms or below) ontoan image is remarkable considering that it summarizes thequintessential characteristics of an image, a process previouslyexpected to require much analysis such as general semanticattributes (e.g., indoors, outdoors, office, kitchen), recognitionof places with a restricted spatial layout [Epstein_Kanwisher00]and a coarse evaluation of distributions of visual features (e.g.,highly colorful, grayscale, several large masses, many smallobjects) [Sanocki_Epstein97,Rensink00].
The idea that saliency and gist runs in parallel is furtherstrengthened in a psychophysics experiment that humans can answerspecific questions even when the subject's attention issimultaneously engaged by another concurrent visual discriminationtask [Li_etal02]. From the point of view of desired results, gistand saliency appear to be complementary opposites: finding salientlocations requires finding those image regions which stand out bysignificantly differing from their neighbors, while computing gistinvolves accumulating image statistics over the entire scene. Yet,despite these differences, there is only one visual cortex in theprimate brain, which must serve both saliency and gistcomputations. Part of our contribution is to make the connectionbetween these two crucial components of biological mid-levelvision. To this end, we here explicitly explore whether it ispossible to devise a working system where the low-level featureextraction mechanisms - coarsely corresponding to cortical visualareas V1 through V4 and MT - are shared as opposed to computedseparately by two different machine vision modules. The divergencecomes at a later stage, in how the low-level vision features arefurther processed before being utilized. In our neural simulationof posterior parietal cortex along the dorsal or ``where'' streamof visual processing [Ungerleider_Mishkin82], a saliency map isbuilt through spatial competition of low-level feature responsesthroughout the visual field. This competition quiets down locationswhich may initially yield strong local feature responses butresemble their neighbors, while amplifying locations which havedistinctive appearances. In contrast, in our neural simulation ofinferior temporal or the ``what'' stream of visual processing,responses from the low-level feature detectors are combined toproduce the gist vector as a holistic low-dimensional signature ofthe entire input image. The two models, when run in parallel, canhelp each other and provide a more complete description of thescene in question.
While exploitation of the saliency map has been extensivelydescribed previously for a number of vision tasks[Itti_etal98pami,Itti_Koch00vr,Itti_Koch01nrn,Itti04tip], wedescribe how our algorithm compute gist in an inexpensive manner byusing the same low-level visual front-end as the saliency model. Inwhat follows, we use the term gist in a more specific sense thanits broad psychological definition (what observers can gather froma scene over a single glance), by formalizing it as a relativelylow-dimensional scene representation which is acquired over veryshort time frames and use it to classify scenes as belonging to agiven category. We extensively test the gist model in threechallenging outdoor environments across multiple days and times ofdays, where the dominating shadows, vegetation, and otherephemerous phenomena are expected to defeat landmark-based andregion-based approaches. Our success in achieving reliableperformance in each environment is further generalized by showingthat performance does not degrade when combining all threeenvironments. These results support our hypothesis that gist canreliably be extracted at very low computational cost, using verysimple visual features shared with an attention system in anoverall biologically-correct framework.
Design and Implementation
The core of our present research focuses on the process ofextracting the gist of an image using features from severaldomains, calculating its holistic characteristics but still takinginto account coarse spatial information. The starting point for theproposed new model is the existing saliency model of Itti et al.[Itti_etal98pami], freely available on theWorld-Wide-Web.Please see theiLabNeuromorphic Vision C++ Toolkitfor all thesource code.
Visual Feature Extraction
Inthe saliency model, an input image is filtered in a number oflow-level visual feature channels - color, intensity, orientation,flicker and motion - at multiple spatial scales. Some channels,like color, orientation, or motion, have several sub-channels, onefor each color type, orientation, or direction of motion. Eachsub-channel has a nine-scale pyramidal representation of filteroutputs. Within each sub-channel, the model performscenter-surround operations between filter output at differentscales to produce feature maps. The different feature maps for eachtype allows the system to pick up regions at several scales withthe added lighting invariance. The intensity channel output for theillustration image of figure below shows different-sized regionsbeing emphasized according to their respective center-surroundparameter.
Figure 2. Gist Model
We incorporate information from the orientation channel, employingGabor filters to the greyscale input image at four different anglesand at four spatial scales for a subtotal of sixteen sub-channels.We do not perform center-surround on the Gabor filter outputsbecause these filters already are differential by nature. The colorand intensity channel combine to compose three pairs of coloropponents derived from Ewald Hering's Color Opponency theories[Turner1994], which identify color channels' red-green andblue-yellow opponency pairs along with intensity channel'sdark-bright opponency. Each of the opponent pairs are used toconstruct six center-surround scale combinations. These eighteensub-channels along with the sixteen Gabor combinations add up to atotal of thirty-four sub-channels altogether. Because the presentgist model is not specific to any domain, other channels such asstereo could be used as well.
Gist Feature Extraction
Afterthe center-surround features are computed, each sub-channelextracts a gist vector from its corresponding feature map. We applyaveraging operations (the simplest neurally-plausible computation)in a fixed four-by-four grid sub-regions over the map. Observe asub-channel in figure below for visualization of the process. Thisis in contrast with the winner-take-all competition operations usedto compute saliency; hence, saliency and gist emphasize twocomplementary aspects of the data in the feature maps: saliencyfocuses on the most salient peaks of activity while gist estimatesoverall activation in different image regions.
Figure 3. Gist Extraction
PCA/ICA Dimension Reduction
The total number of raw gist feature dimension is 544, 34 featuremaps times 16 regions per map (figure below). We reduce thedimensions using Principal Component Analysis (PCA) and thenIndependent Component Analysis (ICA) with FastICA to a morepractical number of 80 while still preserving up to 97% of thevariance for a set in the upwards of 30,000 campus scenes.
Scene Classification
Forscene classification, we use a three-layer neural network (withintermediate layers of 200 and 100 nodes), trained with theback-propagation algorithm. The complete process is illustrated infigure 2.Testing and Results
Wetest the system using thisdataset.The result for each site is shown in Tables 1 to 6, in columnar andconfusion matrix format. Table 7 and 8 will be explained below. Fortable 1, 3, 5 and 7, The term "False +" or false positive forsegment x means the percentage of incorrect segment x guesses giventhat the correct answer is another segment, while "False-" or falsenegative is the number of incorrect guesses given that the correctanswer is segment x.
The system is able to classify the ACB segments with an overall87.96% correctness while AnF is marginally lower (84.21%). If welook at the challenges presented by the scenes in the second site(dominated by vegetation) it is quite an accomplishment to onlylose less than 4 percent in performance with no calibration done inmoving from the first environment to the second. Increase in lengthof segments also do not markedly affect the results as FDF(86.38%), which is have the longest lengths among the experimentsare better than AnF. As a performance reference, when we test thesystem with a set of data taken back-to-back with training data,the classification rate are about 89 to 91 percent. On the otherhand, when lighting condition of a testing data are not included intraining, the error would triple to thirty to forty percent whichsuggest that lighting coverage in the training phase iscritical.
Ahmanson Center for Biological Science (ACB)
A video of a test run for Ahmanson Center for Biological Sciencecan be viewedhere
Associate and Founders Park (AnF)
A video of a test run for Associate and Founders Park can beviewedhere
Frederick D. Fagg park (FDF)
A video of a test run for Frederick D. Fagg park can beviewedhere
Combined Sites
Asa way to gauge the system's scalability, we combine scenes from allthree sites and train it to classify twenty seven differentsegments. We use the same procedure as well as training and testingdata (175,406 and 75,073 frames, respectively). The only differenceis in the neural-network classifier, the output layer now consistsof twenty-seven nodes. The number of the input and hidden nodesremains the same. During training we print the confusion matrixperiodically to analyze the process and find that the networkconverges from inter-site classification before going further andeliminate the intra-site errors. We organize the results intosegment-level (Table 7) and site-level (Table 8) statistics. Forsegment-level classification, the overall success rate is 84.61%,not much worse than the previous three experiments. Notice alsothat the success among the individual sites changes as well. Fromthe site-level confusion matrix (table 8), we see that the systemcan reliably pin the scene to the correct site (higher than 94percent). This is encouraging because the classifier can providevarious levels of outputs. That is, when the system is unsure aboutthe actual segment location, it can at least rely on being at theright site.Model Comparisons
we also compared our model with three other models:
They are reported inVSS2008 poster
Discussion
Wehave shown that the gist features succeed in classifying a largeset of images without the help of temporal filtering (one-shotrecognition), which reduce noise significantly [Torralba2003]. Interms of robustness, the features are able to handle translationaland angular change. Because they are computed from large imagesub-regions, it takes a large translational shift to affect thevalues. As for angular stability, the natural perturbation of acamera carried through a bumpy road during training seems to aidthe demonstrated invariance. In addition, the gist features arealso invariant to scale because the majority of the scenes(background) are stationary and the system is trained with allviewing distances. The combined-sites experiment shows that thenumber of differentiable scenes can be quite high. Twenty sevensegments can make up a detailed map of a large area. Lastly, thegist features achieve a solid illumination invariance when trainedwith different lighting conditions.A drawback of the current system is that it cannot carry outpartial background matching for scenes in which large parts areoccluded by dynamic foreground objects. As mentioned earlier thevideos are filmed during off-peak hours when few people (orvehicles) are on the road. Nevertheless, they can still createproblems when moving too close to the camera. In our system, theseimages can be taken out using the motion cues from the not yetincorporated motion channel as a preprocessing filter, detectingsignificant occlusion by thresholding the sum of the motion channelfeature maps [Itti04tip]. Furthermore, a wide-angle lens (withsoftware distortion correction) can help to see more of thebackground scenes and, in comparison, decrease the size of themoving foreground objects.
Conclusion
Thecurrent gist model is able to provide high-level contextinformation (a segment within a site) from various large anddifficult outdoor environments despite using coarse features. Wefind that scenes from differing segments contrast in a globalmanner and gist automatically exploit them and thus reduce a needfor detailed calibration in which a robot has to rely on the ad-hocknowledge of the designer for reliable landmarks. And because theraw features can be shared with the saliency model, the system canefficiently increase localization resolution. It can use salientcues to create distinct signature of individual scenes, finer pointof reference, within segment that may not be differentiable by gistalone. The salient cues can even help guide localization for thearea between segments which we did not try toclassify.Copyright © 2000 by theUniversity of Southern California, iLab and Prof. Laurent Itti