Predicting Wildfire Size Through Supervised Classification

Andreas Hindman

Oscar Avatare

Sharan

Saurav Kharb

Jhangiani

ABSTRACT

In tandem, an article by FiveThirtyEight in the summer of

Over the past thirty years, the damage caused by fires and

2018 reported, "Wildfires In The U.S. Are Getting Bigger."

money spent on preventing them has skyrocketed, leading

Accompanied with the article were a series of graphics,

to a need for better prevention and analysis of fires that

visible below.

have already occurred. As seen with the recent fires in

California, this need has only become more pressing.

Therefore, we examined fires from 2010 in California to see

if we could create a machine learning model to assess if

weather and geographic data could be used to accurately

predict the size of fires in California. Like previous

researchers, we found that there was little correlation

between the size of a fire and the weather features

surrounding it, making this a difficult task. However, we

were able to utilize a random forest model to predict the

size of fires in California in 2010 with an accuracy of 63%.

INTRODUCTION

Recent media coverage seems to echo the nation's growing

concern over the impact of wild fires. Cumulatively, the top

10 fires in California have set aflame to millions of acres

and damages thousands of structures.1 Depending on their

location, some of the larger California wild fires have

claimed the lives of dozens. In mid-November of 2018, the

Camp Fire in northern California killed 85 people.2,6 The

alarming death tolls alone provide enough reason to attempt

to predict the potential impact of wild fires. Environmental

and economic damages are also a concern for homeowners,

residents, insurers, governments, and nearby communities.

For these reasons, it is no surprise that efforts to predict the

behavior and effects of wildfires have already been made.

A 2016 report by the U.S. Fire Administration shows a

general decrease in fire-related death trend.3 This could be

attributed to advancements in predictive and preventative

technology, medical technology and techniques, or safer

construction regulations/guidelines. However, previous

attempts at predicting the behavior of wildfires have

suffered from a lack of funding. This means that the

practical applications of the models described in literature

are lacking.4 Additionally, much of this research (and it's

associated models) pre-dates the scientific research showing

that "wildfire activity increased suddenly and markedly in

the mid-1980s, with higher large-wildfire frequency, longer

wildfire durations, and longer wildfire seasons." — In fact,

wildfire activity in the Western United States forests is

thought to have increased in recent decades due to

"increased spring and summer temperatures and an earlier

spring snowmelt."5

The devastating and unpredictable nature of wildfires puts

the lives of thousands of homeowners, firefighters,

respondents, at risk.

These graphics tell us a scary story, where the size of the

study wasn’t climate related. The other study is important

fire is becoming more and more important as each fire is

because the increasing wildland-urban interface creates

more dangerous than the one before.

more damage that fires can cause, as well as providing

additional area in which the fire can burn that is artificially

The purpose of this project is to better predict the size of

created.

wildfires in California. Due to the changing climate, the

problems caused by wildfires in the western United States

METHODS

and California specifically are more pertinent than ever.5

As outlined before, the problem we selected to address was

Accurate prediction models will allow communities to

the rapid increase in size of extreme wildfires. We decided

evacuate individuals and families at the appropriate time

to approach the problem by leveraging past data to try and

and place. As visible from the graphics above, it is

predict the size of the wildfire.

imperative to solve this issue. This project limits the scope

In order to address the problem at hand, we broke the

of the models to California wildfires, but the features used

process down into 4 major steps: Data Selection, Data

by the model are not specific the region. These models and

Cleaning and Preparation, Feature Selection, Insights from

techniques could be applied to any geographical locations

Exploration, and Modeling

that experience wildfires, assuming similar data is

attainable. Models or technologies that provide accurate

Data Selection:

information regarding where and when to deploy

Our first step in our process was to find a dataset that we

containment efforts will have the potential to save

felt contained features and values that we felt would be

hundreds, if not thousands, of lives.

beneficial. Heavy discussion went into deciding the data

RELATED WORK

that would be most valuable to a model such as this one.

Related work to this project includes a University of

With the obvious effects of weather and climate data on the

California

Berkeley

study

size, spread, and impact of wildfires, it was evident that we

(https://forests.berkeley.edu/sites/forests.berkeley.edu/files/

should get some sort of climate data in tandem with

Starrs_2018_Environ._Res._Lett._13_034025.pdf)

that

historical fire data with information on size, locations, and

examined vegetation types in California, and used Poisson

other factors that are important to predict. However, after

regression to model annual fire probability between 1950

struggling to find data that was consistent across multiple

and 2015. This paper found that the impact of climate

locations, we decided that it would be sufficient to solely

variables such as average maximum temperature, average

use California data in order to build the model out. We

annual precipitation, and annual topsoil moisture had

ultimately ended up selecting two separate data sets:

relatively little impact of the probability of a fire occurring.

Our primary data source for wildfire related data contains

However, they showed that there was continually increasing

records of 1.88 Million U.S. wildfires from the United

wildfire probability across all ownerships, firefighting

States Department of Agriculture's Forest Service. This data

agencies, reserve statuses, and vegetation types, and that

publication contains a spatial database of wildfires that

these factors were more relevant in predicting future fires

occurred in the United States from 1992 to 2015. It is the

than climate variables, although given the pace of climate

third update of a publication originally generated to support

change this is expected to be different in future studies.

the national Fire Program Analysis

(FPA) system. The

Another study from the University of Wisconsin-Madison

wildfire records were acquired from the reporting systems

(https://static1.squarespace.com/static/545a90ede4b026480

of federal, state, and local fire organizations.

c02c5c7/t/5bbb8d22f9619ae1c47579af/1539018019587/Kr

To supplement our models, we also used California climate

amer_2018_WUI_IJWF.pdf) examined the cost of and

data that was requested from NOAA. This curated dataset

damage caused by fires in the Untied States. Over the past

contains various features related to temperature,

30 years, this has dramatically increased due to the

precipitation, humidity, snowfall, and other climate

increasing areas of wildland-urban interface, where

variables for California in 2010.

buildings and wildland vegetation meet. They found that

most threatened and destroyed buildings in the US were

Data Cleaning and Preparation:

within the wildland-urban interface, but this varied

After we had selected our data, we faced the monumental

considerably among states. They concluded that to prevent

challenge of combining the two datasets and then further

this issue from increasing in the future, that pre-emptive

preparing the data for the model. With regards to the

outreach could improve the likelihood of building survival

combination of the disparate datasets, there were 3 mutual

and reduce the human and financial costs of structure loss.

columns that we leveraged to join our data sets: Latitude,

This relates to our work because we were examining fire

Longitude, and Date. However, it was quickly evident that

size, and the climate factors that cause fires. These studies

the process would not be as easy as it seemed as longitude

relate to ours in this regard by seeing which factors are the

and latitude coordinates for the fire location and the climate

most predictive when examining fires, which to both our

data would never be exactly the same. In order to combat

surprise and the surprise of the researchers in the other

this, we developed a method of finding the longitude and

latitude location representing the climate location closest to

From the results, we empirically selected a couple of

the wildfire's location using a buffer on each coordinate to

features that made the most practical impact from a

increase chance of overlap of the coordinates.

humanistic standpoint and model standpoint, and continued

on to modeling.

Once the two datasets were combined, we were tasked with

dealing with a series of missing values. Missing values are

Insights from Exploration:

scattered (mostly) randomly throughout the dataset with the

Once we had cleaned and prepared data, it was finally time

exception of certain climate variables, some of which

to start implementing our model. Before we did so, there

contain mostly missing values. These missing values could

were a couple of important insights to gain from our

impact the statistical models if they mask any correlation.

exploration of our data:

One option we considered to handle these was to remove

rows with missing values entirely. However, while this

Fire size is presented in two ways in our dataset:

would be simple and make the data as pure as possible, this

FIRE_SIZE describes the continuous size in acreage while

would result in most of our rows being deleted and would

FIRE_SIZE_CLASS categorizes the fire on a scale from 'A'

not allow for sufficient data to develop our model on. As

to 'G', where 'A' classifies the smallest fires and 'G' being

long as an insignificant portion of the rows contain missing

attributed to the largest fires. Here is a breakdown of how

values, the models should not be affected dramatically. In

wildfires are classified based on acreage:

an effort to maximize our effective data and to limit the

'A' = '0-0.25 acres'

'E' = '300-999 acres'

scope of the project, we used a subset of the dataset that had

sufficient data.

'B' = '0.26-9.9 acres '

'F' = '1000-4999 acres'

A number of the features in the original wildfire dataset

'C' = '10.0-99.9 acres'

'G' = '5000+ acres'

could be used as predictor variables, as latitude/long

'D' = '100-299 acres'

coordinates, county, time of year, while others were

effectively different ways of identifying fires

(multiple

For this project, we decided to focus on a classifying model

ID's). We removed additional identifier columns, as well as

to predict the category that a fire might fall into based on

columns that did not have pertinent information. Part of our

pertinent variables. Due to the highly

(left) skewed

research question is to determine whether there is a

distribution of continuous wildfire size data, a regression

correlation between climate variables and fire size.

model would be more difficult to interpret. The

Specifically, wanted to see how temperature, precipitation,

classification scale takes the logarithmic distribution into

and humidity affect fire size, as well as any other climate

account and will lead to a more interpretable model. Below

features that can be accessed.

are two separate plots, depicting the visible difference

between FIRE_SIZE and FIRE_SIZE_CLASS.

Since we were able to get geographical climate data for

2010, and since 2010 had relatively few missing values, we

only used the 2010 subset of wildfire data. After that, we

used recursive feature elimination (RFE) to determine the

most effective set of features. Then, we cross-referenced the

top features with the set of columns based on the proportion

of missing values. Of the top features that were selected by

RFE, the highest proportion of missing values was 0.22, so

we removed climate features that had more than

25%

missing values.

Finally, we were able to handle the rest of the missing

values using forward fill. Since the remaining missing

values were continuous climate variables scattered (mostly)

randomly throughout the dataset without long streaks, this

method of handling missing values was appropriate.

Feature Selection:

As touched on briefly previously, in order to adequately

manage such a big dataset as well as run our model as

efficiently and accurately as possible, we ran a feature

selection model to determine which features are most

pertinent and relevant. Given the clean dataset, we then ran

a recursive feature elimination algorithm using a

RandomForestClassifier as the base model to determine

give us a list of the top ten features.

From these two distributions, we can see that the

While many of the climate variables correlate to each other

categorical size is significantly more interpretable because

(which is to be expected), there is relatively little

the categories are logarithmic and take the original

correlation with fire size, meaning that our climate

distribution into account. We can also make insights based

variables have a low chance of providing a rigorous model.

on these distributions: most wildfires can be categorized as

If this is the case, it will be revealed by the accuracy of our

'A' or 'B' fires. This means that relatively larger fires (class

models.

'C' and above, in this case) only occur in a fraction of

Modeling:

wildfire incidents. If the relatively few larger fires correlate

with the environmental variables that we have provided,

Given these insights, it was finally time for us to implement

then an accurate and interpretable model can also be

models to try and predict the size of a wildfire. Our decision

produced.

to use the Random Forest Classifier and the Gradient

Booster Classifier was guided by the SciKit-learn

In addition to climate variables, we predicted that the cause

documentation. Following this graphic, we landed on the

of a fire might be a predictor for fire size and impact.

decision to use SVC or Ensemble Classifiers, through

Shown below is a distribution of wildfire causes.

which we ultimately ended up on Gradient Booster

Classifier and Random Forest Classifier.

Random Forest Classifier:

We started with the Random Forest classifier. For each

model, the process was similar. We began with a simple

grid search to determine the best parameters to use in our

model, and leveraging the results from the grid search, we

trained the model.

Gradient Booster Classifier:

A similar process was conducted for the Gradient Booster

classifier, where we ran a grid search and leveraged the

results to train the model.

Contrary to our original hypothesis, the most frequent

(known) causes of wildfires are equipment use and

lightning while the least frequent causes are structure fires

RESULTS

and fireworks. However, we want to know which of these

Upon finalizing our models and testing their accuracy, we

result in the most damaging (or largest) fires.

were disappointed to see the outcome. The accuracy for the

Random Forest Model was a dismal, 0.6343, with the

In addition to the aforementioned feature selection, we used

Gradient Booster scoring a similar, 0.6295.

a correlation matrix to visualize which climate and fire

variables have a strong correlation with fire size, as seen

below.

DISCUSSION

https://www.usfa.fema.gov/data/statistics/fire_deat

h_rates.html

From the feeble results from our model, we were no doubt

https://www.fs.fed.us/rm/pubs_series/int/gtr/int_gt

disappointed and conducted a bit of investigation into what

r030.pdf

could have gone wrong. Ultimately, we went back to stage

http://science.sciencemag.org/content/313/5789/94

1 and observed the correlation matrix to visualize the

strength of relationships between variables to determine if

https://www.usatoday.com/story/news/2018/12/03/

there was an issue in the data selection process.

camp-fire-death-toll-california-deadliest-

While many of the climate variables correlate to each other

wildfire/2199035002/

(which is to be expected), there is relatively little

https://fivethirtyeight.com/features/wildfires-in-

correlation with fire size. This lack of correlation was

the-u-s-are-getting-bigger/

evident through the accuracy (or lack thereof) of our model.

This leads us to recognize that our fundamental flaw was

not within the model or the model selection, but rather, the

data selected had little correlation with the resulting

features, thus it was unable to accurately predict anything of

substance.

FUTURE WORK

Future work that could be done in relation to this dataset or

this topic are to either use the fire data and create a tool to

predict future fires based on the weather conditions that we

found, or to integrate other datasets into this current project

to allow for a more in-depth analysis of the factors that

cause these fires. For example, we would want to find better

data that is correlated with fires and thus be able to create

some sort of predictive tool to see where fires are most

likely to occur with better data. This would require more

granular data about the landscape where the fire occurred,

as well as data about state firefighting budgets, the funding

associated with individual counties and towns, and the

weather data of these individual areas as well. As we saw

from the first study we examined in the related work

section, this would require integrating in data about

firefighting, vegetation management, and land ownership.

Land ownership is crucial to understand, since fighting fires

is divided between federal and non-federal land. Likewise,

federal agencies typically only fight fires on federal lands,

while state agencies focus on non-federal lands, preventing

a clear comparison of the possible influence of federal and

state firefighting.

Even on the dataset we had, there were ways to examine the

data in a more granular or nuanced manner such as dividing

up fires by region or county and seeing if any specific

factors in these areas stood out against the fires in the state

as a whole, examining different factors in terms of weather

related to how fires started, and seeing if these factors

changed over the course of the year.

REFERENCES

https://www.iii.org/fact-statistic/facts-statistics-

wildfires

https://www.theguardian.com/us-

news/video/2018/nov/13/camp-fire-deadliest-

wildfire-california-history-video-report