The xView2 AI Challenge


The xView2 Data Challenge was hosted by the Defense Innovation Unit (DIU) with the goal of identifying buildings and rating the amount of damage they sustained from a natural disaster using satellite images that were taken before and then after the disaster. Currently after a natural disaster satellite and aerial images are annotated for building damage manually in a time intensive and laborious process that takes weeks. Using computer vision and machine learning algorithms has the possibility of cutting this process from taking weeks to taking hours thus greatly expediting recovery responses.

The Dataset

The data used for the challenge was the xBD dataset. The xBD dataset contains 850,736 annotated buildings which spans 45,362 $\text{km}^{2}$ of satellite imagery. For training models, there are 9,168 pre-disaster/post-disaster $1024 \times 1024$high resolution color images.

img The images capture 19 natural disasters of 5 different types from all over the world.

img In the post disaster images each building is assigned a class label based on how badly it was damaged.

img One of the reasons this challenge was difficult was the high amount of variability in the data. The dataset in general is highly biased towards the “no damage” class as can be seen here.

img When considering the number of images that contain at least one damaged building, the number of such images also varies greatly depending on the natural disaster.

img Another area of variance in the dataset is the number of buildings affected by each disaster.

img Further adding to the difficulty of the task is the fact that the visual indicators of damage, say between minor damage and major damage, can be quite subtle making it difficult for models to distinguish between the two.

The Challenge

For each pre/post-disaster image pair we had to produce two resulting PNG files. The first PNG file needed to contain the localization predictions, i.e. where the buildings are in the image, by having a 1 in a pixel if there is a building in the corresponding pre-disaster image and a 0 if there is no building in that pixel. The second PNG file needed to contain the damage classification predictions where each pixel has an integer value between 0 and 4, reflecting the damage level prediction for the corresponding pixel in the post disaster image.

The evaluation metric for the challenge was a weighted $F_{1}$-score of the localization and damage classification predictions.

The $F_{1}$ score measures a balance between the model’s precision and its recall. This was a more appropriate evaluation metric than say just the accuracy since a model that predicted “no building” at every pixel would still be around 80% accurate.

Our Approach

For this challenge we utilized the 2.0 release of the TensorFlow machine learning framework which was very newly released when we began our work. The major differences we noticed in TensorFlow 2.0 versus previous versions are:

  • Eager execution is enabled by default. This provides for a more intuitive workflow that is easier to debug and reason about.
  • Keras is the preferred method for defining models. Keras is a high-level, declarative DSL for defining models that allows for easy and fast prototyping.
  • is the recommend path for building data pipelines. Using the API lets TensorFlow optimize data pipelines to improve training performance.

Our main algorithm of choice during the competition was the U-Net ronneberger15_u convolutional neural network (CNN) architecture, which is a popular algorithm for semantic image segmentation.

img The U-Net architecture augments the fully connected CNN by having a series of contracting layers that capture context followed by a symmetric group of expanding layers that allow the model to learn accurate segmentation boundaries. The contracting layers are called the down-sampler or encoder and the expanding layers are called the up-sampler or the decoder.

Our initial baseline approach was to try and do both the building localization and damage classification tasks in a single U-Net model. Not too surprisingly, this approach did not perform well due to the model having to express both the coarse grain task of separating buildings from the background combined with the finer grain task of rating the building damage. Thus we decided to split the problem into the two subproblems of localization and damage classification.


For the localization problem our goal was to classify each pixel in the pre-disaster image as either “building” or “no building”.

Data Pipeline

First we used the and tf.imamge libraries to create a data pipeline to load each pair of pre/post disaster images as a pair of tensors each with dimension (1024, 1024, 3) that contained the RGB values for each pixel in the image. Next in our pipeline we concatenated the two tensors into one tensor of dimension (1024, 1024, 6) with the idea being that even though the post-disaster image can be dramatically different from the pre-disaster image, there is still information added in deciding what is a building versus not a building. We then applied several data augmentation techniques, such as rotations and reflections, at random which allowed us to expand the number of training examples we had for each epoch.

Model Description

The localization model was a single U-Net model setup to do binary semantic image segmentation. The optimal model had 9 down-sampling layers and 9 up-sampling layers. The model was trained using cross entropy as the lost function and training took approximately 5 days to complete.

Damage Classification

Data Pipeline

Once again we utilized the and tf.imamge libraries to create tensors of dimension (1024, 1024, 6) representing the concatenated pre/post disaster images. Then we used the output of the localization model to mask the tensor so that all non-building pixels had a zero value. Next the tensors were randomly cropped to be of dimension (256, 256, 6) and only crops that contained at least 20% of non-zero pixels were used in training.

Model Description

The problem of classifying building damage has a useful property to exploit, namely there is an ordinal relationship between the damage level classes, i.e.

Damage ClassEncoded Value
No Building0
No Damage1
Minor Damage2
Major Damage3

Therefore a model should be penalized more when the predicted class is farther away than the actual class. As an example a prediction of “minor damage” when the ground truth class is “destroyed” should be penalized more than a prediction of “major damage”. This is an example of an ordinal regression problem and we explored several techniques for using neural networks to solve it including the one found in niu16_ordin_regres_multip_output_cnn_age_estim as well as a version that uses soft labels diaz19_soft_label_ordin_regres. Our optimal solution ended up using an ensemble of U-Net models similar to what was described in cheng08. To do this we trained 3 U-Net binary classifiers such that the first model predicts the probability that the class label is greater than 1, i.e. $P(class > 1)$, the second model predicts $P(class > 2)$, and the third model predicts $P(class > 3)$. Instead of encoding the class labels using the usual one-hot encoder, we encoded the target values into vectors according to the following ordinal scheme:

Damage ClassOriginal EncodingOne-hot EncodingOrdinal Encoding
No Building0[1, 0, 0, 0, 0][0, 0, 0, 0]
No Damage1[0, 1, 0, 0, 0][1, 0, 0, 0]
Minor Damage2[0, 0, 1, 0, 0][1, 1, 0, 0]
Major Damage3[0, 0, 0, 1, 0][1, 1, 1, 0]
Destroyed4[0, 0, 0, 0, 1][1, 1, 1, 1]

The output layer of the ensemble uses a sigmoid activation function to produce a vector of length 4 and in order to make a prediction from this vector we scan the values and stop when the value is below a threshold (0.5 in our case) or there are no more values in the vector. The index $i$ of the last value that is bigger than the threshold is the predicted damage class.


As noted previously, the metric used for this competition was a combination of the localization and damage classification $F_{1}$ scores. Our best submission received the following $F_{1}$ scores on the competition validation dataset:

Localization $F_{1}$0.81
Damage Classification $F_{1}$0.66
Total $F_{1}$0.71

When looking at our internal testing set the majority of our localization errors came from images that had areas of high building density. On the damage classification problem, our U-Net ensemble model had the most difficulty in distinguishing minor damage from major damage.

Bibliography [gupta19_xbd] Gupta, Hosfelt, Sajeev, Sandra, Patel, Goodman, , Doshi, Heim, Choset, & Gaston, Xbd: a Dataset for Assessing Building Damage From Satellite Imagery, arXiv, (2019). link. [ronneberger15_u] Ronneberger, Fischer, Brox & Thomas, U-net: Convolutional networks for biomedical image segmentation, 234-241, in in: International Conference on Medical image computing and computer-assisted intervention, edited by (2015) [niu16_ordin_regres_multip_output_cnn_age_estim] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, & Gang Hua, Ordinal Regression with Multiple Output CNN for Age Estimation, in in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), edited by (2016) [diaz19_soft_label_ordin_regres] Diaz & Marathe, Soft Labels for Ordinal Regression, in in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), edited by (2019) [cheng08] Jianlin Cheng, Zheng Wang, Gianluca & Pollastri, A neural network approach to ordinal regression, in in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), edited by (2008)

Colin Alstad
Colin Alstad
Data Scientist, Space Tech and Innovations

I am a data scientist and mathematician currently working on innovation projects related to the commercial space industry.