Video Anomaly Detection

Tackling the problem of automatically detecting unusual activity in video sequences

MERL Researcher: Michael Jones (Computer Vision).
Joint work with: Bharathkumar Ramachandra (North Carolina State University).

Search MERL publications by keyword: Computer Vision, anomaly detection, video analysis, action detection

This research tackles the problem of automatically detecting unusual activity in video sequences. To solve the problem, an algorithm is first given video sequences from a fixed camera showing normal activity. A model representing normal activity is created and used to evaluate new video sequences from the same fixed camera. Any parts of the testing video that do not match the model formed from normal video are considered anomalous.

Figure 1: Example frame from the Street Scene dataset and an example anomaly detection (red tinted pixels) found by our algorithm (a jaywalker). The blue square represents the ground truth labeled anomaly.

We describe two variations of a novel algorithm for video anomaly detection which we evaluate along with two previously published algorithms on the Street Scene dataset (described later).

The new algorithm is very straightforward and is based on dividing the video into spatio-temporal regions which we call video patches, storing a set of exemplars to represent the variety of video patches occuring in each region, and then using the distance from a testing video patch to the nearest neighbor exemplar as the anomaly score.

Algorithm Details

First, each video is divided into a grid of spatio-temporal regions of size H x W x T pixels with spatial step size s and temporal step size 1 frame. In the experiments we choose H=40 pixels, W=40 pixels, T=4 or 7 frames, and s = 20 pixels. See Figure 2 for an illustration.

Figure 2: Illustration of a grid of regions partitioning a video frame and a video patch encompassing 4 frames. This figure shows nonoverlapping regions, but in our experiments we use overlapping regions.

The baseline algorithm has two phases: a training or model-building phase and a testing or anomaly detection phase. In the model-building phase, the training (normal) videos are used to find a set of video patches (represented by feature vectors described later) for each spatial region that represent the variety of activity in that spatial region. We call these representative video patches, exemplars. In the anomaly detection phase, the testing video is split into the same regions used during training and for each testing video patch, the nearest exemplar from its spatial region is found. The distance to the nearest exemplar serves as the anomaly score.

The only differences between the two variations are the feature vector used to represent each video patch and the distance function used to compare two feature vectors.

The foreground (FG) mask variation uses blurred FG masks for each frame in a video patch as a feature vector. The FG masks are computed using a background (BG) model that is updated as the video is processed (see Figure 3). The BG model used in the experiments is a very simple mean color value per pixel.

Figure 3: Example blurred FG masks which are concatenated and vectorized into a feature vector. a) and c) show two video patches consisting of 7 frames cropped around a spatial region. b) and d) show the corresponding blurred FG masks..

The flow-based variation uses optical flow fields computed between consecutive frames in place of FG masks. The flow fields within the region of each video patch frame are concatenated and then vectorized to yield a feature vector twice the length of the feature vector from the FG mask baseline (due to the dx and dy components of the flow field). In our experiments we use the optical flow algorithm of Kroeger et al. (ECCV 2016) to compute flow fields.

In the model building phase, a distinct set of exemplars is selected to represent normal activity in each spatial region. Our exemplar selection method is straightforward. For a particular spatial region, the exemplar set is initialized to the empty set. We slide a spatial-temporal window (with step size equal to one frame) along the temporal dimension of each training video to give a series of video patches which we represent by either a FG-mask based feature vector or a flow-based feature vector depending on the algorithm variation as described above. For each video patch, we compare it to the current set of exemplars for that spatial region. If the distance to the nearest exemplar is less than a threshold then we discard that video patch. Otherwise we add it to the set of exemplars.

The distance function used to compare two exemplars depends on the feature vector. For blurred FG mask feature vectors, we use L2 distance. For flow-field feature vectors we use normalized L1 distance.

Given a model of normal video which consists of a different set of exemplars for each spatial region of the video, the anomaly detection is simply a series of nearest neighbor lookups. For each spatial region in a sequence of T frames of a testing video, compute the feature vector representing the video patch and then find the nearest neighbor in that region's exemplar set. The distance to the closest exemplar is the anomaly score for that video patch.

This yields an anomaly score per overlapping video patch. These are used to create a per-pixel anomaly score matrix for each frame.

More details of our algorithm can be found in our paper cited below.

New Dataset: Street Scene

In order to evaluate our video anomaly detection algorithm, we have created a new dataset containing video of a street scene in Cambridge, MA.

The Street Scene dataset consists of 46 training video sequences and 35 testing video sequences taken from a static USB camera looking down on a scene of a two-lane street with bike lanes and pedestrian sidewalks. See Figure 1 for a typical frame from the dataset. Videos were collected from the camera at various times during two consecutive summers. All of the videos were taken during the daytime. The dataset is challenging because of the variety of activity taking place such as cars driving, turning, stopping and parking; pedestrians walking, jogging and pushing strollers; and bikers riding in bike lanes. In addition, the videos contain changing shadows, and moving background such as a flag and trees blowing in the wind.

There are a total of 203,257 color video frames (56,847 for training and 146,410 for testing) each of size 1280 x 720 pixels. The frames were extracted from the original videos at 15 frames per second.

The 35 testing sequences have a total of 205 anomalous events consisting of 17 different anomaly types. A complete list of anomaly types and the number of each in the test set can be found in our paper.

Ground truth annotations are provided for each testing video in the form of bounding boxes around each anomalous event in each frame. Each bounding box is also labeled with a track number, meaning each anomalous event is labeled as a track of bounding boxes. Track lengths vary from tens of frames to 5200 which is the length of the longest testing sequence. A single frame can have more than one anomaly labeled.

Figure 4: ROC curves for track-based criterion for different methods.

Figures 4 and 5 show ROC curves for our baseline methods as well as a dictionary-based method (C. Lu, J. Shi and J. Jia, "Abnormal event detection at 150 fps in Matlab", ICCV 2013) and an auto-encoder method (M. Hasan, J. Choi, J. Neumann, A. Roy-Chowdhury, and L. Davis, "Learning temporal regularity in video sequences", CVPR 2016) on Street Scene using the newly proposed track-based and region-based criteria. The numbers in parentheses for each method in the figure legends are the areas under the curve for false positive rates from 0 to 1. Clearly, the dictionary and auto-encoder methods perform poorly on Street Scene. Our baseline methods do much better although there is still much room for improvement.

Figure 5: ROC curves for region-based criterion for different methods.


This dataset can be freely downloaded for research purposes from:

Region and Track-based Ground Truth files for other datasets:

We have created ground truth annotation files for the UCSD Ped1 and Ped2 datasets as well as CUHK Avenue dataset that can be used with the region-based and track-based evaluation criteria proposed in the Street Scene paper. Links to the zip file contain ground truth files for all 3 datasets can be found here:

If you use or refer to this dataset, please cite the referenced paper.


MERL Publications

  •  Ramachandra, B., Jones, M.J., "Street Scene: A new dataset and evaluation protocol for video anomaly detection", IEEE Winter Conference on Applications of Computer Vision (WACV), DOI: 10.1109/​WACV45572.2020.9093457, February 2020, pp. 2569-2578.
    BibTeX TR2020-017 PDF Data
    • @inproceedings{Jones2020feb2,
    • author = {Ramachandra, Bharathkumar and Jones, Michael J.},
    • title = {Street Scene: A new dataset and evaluation protocol for video anomaly detection},
    • booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
    • year = 2020,
    • pages = {2569--2578},
    • month = feb,
    • doi = {10.1109/WACV45572.2020.9093457},
    • url = {}
    • }