Matt

Cleaning the National UFO Reporting Center Dataset

UFO Graffiti A few weeks ago, I started playing with the NUFORC dataset. So that means that I did some data exploration and created a simple dashboard using Shiny.

As I worked on the data visualization, I realized that the data was loading fairly slowly and that some readers might not be patient enough to wait for the thing to load.

The more I looked at the data, the more I realized that it was a tad, let's say, messy... Case in point, one of the entries for the "duration" field, which represents how long the UFO was visible, was just "you butt holes". Ummmm, not a time value. So I decided to clean up and optimize this UFO dataset.

This is not the sexiest job in the world, but it can be pretty satisfying to get a dataset nice and usable. Here's what I did:

Download Dataset

The first thing that we need to do is acquire the dataset, which is basically what I did last week but with a few tweaks.

library(tidyverse)
library(data.world)

datadir <- sprintf("%s/%s", here::here(), "03.clean.dataset/data")
saved_cfg <- save_config(Sys.getenv("data.world_apikey"))
# API key stored in .Renviron file
set_config(saved_cfg)

nuforc_reports_download <- query(
qry_sql("
SELECT * FROM nuforc_reports"),
dataset = "https://data.world/timothyrenner/ufo-sightings"
) %>%
mutate(key = row_number()) %>%
select(key, occurred = date_time, state, city, duration, shape,
latitude = city_latitude, longitude = city_longitude,
text, date_posted = posted)

save(nuforc_reports_download,
file = sprintf("%s/%s", datadir, "nuforc_reports_download.rdata"))

I added a primary key field and tried to clean up some of the column names in addition to saving the dataset. Saving the dataset is nice because it's good to have a reference later on (I will be filtering and changing rows).

Keep Only Valid US States

While I was working on the dashboard, I found a ton of invalid geographic codes. Some of these were actual places, but I couldn't verify that in any systematic way. And honestly, I didn't expect data from outside the United States. So what I decided to do was filter out any records that were not from locations in the US.

# Use state dataset to only include valid US States

nuforc_reports <-
nuforc_reports_download %>%
filter(state %in% state.abb)

Notice that I assigned the results of this to a new dataset. I did this because I wanted to keep the dataset I modify seperate from the dataset I downloaded. The lists of states state.abb is a built-in R dataset that lists all the US states.

Fix Dates

Just like in my original data exploration, I want to make sure that the dates I need are in the right format. For this, I just used the lubridate package to get this done.

library(lubridate)

nuforc_reports <-
nuforc_reports %>%
mutate(occurred = ymd_hms(occurred),
date_posted = ymd(date_posted))

Fix the Duration Field

This one was a dozy. NUFORC's data collection didn't appear to have any data validation. In particular, the duration field was all over the place. This field is meant to record how long the UFO contact lasted but people can write in whatever they want.

Most of the entries are ok but written out in English. For example, the best ones were written out like "15 minutes" while others could be written like "fifthteen minuuutes" and some were flat out wrong like "you butt holes" or "-----------".

My strategy here was to go step by step ultimately ending in two additional columns: one for the numerical value and one to act as a category. I want the numerical values so that I can do data analysis. How long are most sightings? The category levels are: seconds, minutes, hours, and longer. The idea here is that I will need a way to harmonize the numerical values at some point. By that, I mean that I will want to transform all the times to be on the same scale. So I can compare hours to minutes.

Here is the code, hopefully by following the tidyverse metholody you will be able to follow along. But yeah, it's a little long. This was a harder than normal problem.

# Fix duration field

duration_dataset <-
nuforc_reports %>%
select(key, duration) %>%
mutate(
pattern = str_replace_all(duration, "\\d", "#"),
numbers = str_extract_all(duration, "\\d+"),
words = str_replace_all(duration, "\\d", "")
) %>%
filter(
!is.na(duration),!duration %in% c(
"?",
"unknown",
"-------------------------",
"----",
"--",
"-",
"you butt holes"
)
) %>%
mutate(words = str_to_lower(words),
words = str_extract_all(words, "[a-z]+"),
word_category = map_chr(.x = words, .f = function(x)({

s <- NA

for (val in x)
if (val %in% c("minutes", "mins", "min","minute",
"mintues","minuts","minuets","minutess",
"minites","minuites","minets","mimutes","minues",
"miniutes","miutes","minits","mintes","minuted",
"minutea","minutos","mnutes","ninutes","mintutes",
"mim","mm","mns","mint","minuet","miinutes",
"minnutes","minures","minutew","miuntes","mniutes",
"m","munites","mi","mims","minitues","minnute","minuates",
"minuetes","minuits","minuntes","minurwa","minustes","minuters",
"minutis","minuutes","monutes","ms","tominutes","minuit","minutue",
"minu","miniuts","minns","minea","tomins","mints","minutestriangle",
"minuite","imin","iminute","menutes","miin","miites","mina","minents",
"minet","minetes","miniute","minonds","mnute","muntes","munutes","muinte",
"miuets","miunets","miunute","miute","miuts","mlnutes","minurtes","minut",
"minuteds","minutee","minutees","minutese","minuteswhile","minutets"
,"minutez","minuties","minutres","minuttes","minutues","mintue","mintute",
"minsorlonger","inutes"
))
s = "minutes"

for (val in x)
if (val %in% c("seconds", "sec", "secs","second","secounds",
"secconds","milliseconds","moment","secods",
"secomds","secons","seckonds","momentary",
"fast","brief","short","secondss",
"minsec","sconds","secinds","secnds",
"secondes","seonds","moments","econds",
"millisecond","aboutseconds","secodns","seconda",
"secondds","seconnds","secpnds","secunds","toseconds",
"quick","quickly","approxsec","aproxsecs","seco",
"tosecs","tosec","ceconds","deconds","desonds",
"secon","milisec","milisecod","sseconds","ssecs",
"sesconds","seounds","segs","segundos","senconds","seocds","seoonds",
"secands","secants","seceonds","secionds","seconcs","secondsss",
"seconfs","secopnd","secounts","secsonds","secthen","secx",
"sceonds","scounds","secaond","mseconds","seeconds","blink",
"secends","seconts","instantaneous","instant","flash"

))
s = "seconds"

for (val in x)
if (val %in% c("hours","hour","hrs","hr","hous","tohour",
"hiours","hm","horas","houres","hourish","hoursmin",
"onehour","houl","housr","nours","houra","hrmin"

))
s = "hours"

for (val in x)
if (val %in% c("days","day","year","years","month","months","lifes","life","week",
"summer","nite","winter","daily","allways","ongoing","weeks",
"daytime","months","weeks","every","months","nights","yrs",
"wks","night"))
s = "day_or_more"
s

})),
numbers_length = map_int(.x = numbers, .f = function(x)({
length(x)
}))) %>%
filter(numbers_length <= 2) %>% # REMOVE LONG SERIES OF NUMBERS SINCE THEY WERE TOO MESSED UP
mutate(numbers = map_dbl(.x = numbers, .f = function(x)({
n <- NA
l <- length(x)
if(l == 2)
n <- (as.numeric(x[1]) + as.numeric(x[2])) / 2
if(l == 1)
n <- as.numeric(x[1])
n

}))) %>%
select(key, duration_time = numbers, duration_unit = word_category)

nuforc_reports <-
nuforc_reports %>%
left_join(duration_dataset, by = "key")

Yikes, you may need to grab a drink after that code block. Let's take a first look at what we can see as a result of all this work. I'm going to count the time levels we created so we can kinda eyeball the kind of durations these UFO sightings fall into.

nuforc_reports %>% 
count(duration_unit) %>%
arrange(duration_unit)

This is what we get:

duration_unit     n
<chr>         <int>
1 day_or_more     955
2 hours          7428
3 minutes       68396
4 seconds       32291
5 NA            12762

Most sightings in our reports lasted a very short amount of time (minutes and seconds). The longer sightings though are also interesting to note since they may represent more unusual occurences. From this preliminary analysis though, we can see imagine that when we dig deeper into this data that the average duration time will hover close to the smaller units.

Wrap It Up

The remaining data seems pretty usable. This code is hard to really represent in a blog post. Context helps in these cases, so if you are interested in following along with this work feel free to go directly to the UFO/UAP Data Science Project.

Oh and one more thing: at the very end of this I did save the dataset.

save(nuforc_reports,
file = sprintf("%s/%s", datadir, "nuforc_reports.rdata"))

When you want to use this in your project you do the reverse with the load function.

load("nuforc_reports.rdata")

Note that you will need to point the load function to the actual location of your dataset.

So that is it for now, with this homework complete maybe I will be able to move on to some more detailed analysis and visualizations!