Exploring the National UFO Reporting Center Dataset
Well Jacques, hold my beer because I'm going to take a look at the NUFORC dataset and I may even do a correlation here and there. Sure, maybe this will not solve the mystery of the UFO/UAP phenomenon but it will be a fun ride. Who knows, maybe we will find some insight regardless of Jacques caveat.
What is the NUFORC Dataset?
The National UFO Reporting Center collects a clearinghouse of UFO reports for the entire United States and makes this data available on their website. Anyone who witnesses a UFO can contact NUFORC and submit information about their experience. NUFORC has been in operation since 1974.
NUFORC makes summaries of these reports available, but they are not in a format that can be easily analyzed. Luckily, Tim Renner scrapes the the NUFORC website and publishes a tidy version of this data at UFO Sightings on data.world. I will be using an R package named data.world to connect to Tim's dataset in RStudio. NOTE RStudio is an open source data analysis tool that you can use to explore datasets like this one.
Connect to the NUFORC Dataset in R
In order to connect to datasets hosted on data.world, you will need a free data.world account and an API key that you can get from your data.world profile. In R, the process of using this API is simplified by the "data.world" package.
You will have to install this package directly from Github so there are a few more steps in involved. Here is an article from data.world explaining everything but I will also include the code that I used below.
Install the data.world package using the code below:
## MAKE SURE DEVTOOLS IS INSTALLED
install.packages("devtools")
## INSTALL DATA.WORLD FROM GITHUB
library(devtools)
install_github("datadotworld/data.world-r",
build_vignettes = TRUE)
When you want to import a dataset from data.world you will have to supply the API key that you can get from your profile on the data.world website and then set the configuration like this:
library(data.world)
api <- ## ENTER YOUR API KEY HERE WITH QUOTES
saved_cfg <- save_config(api)
set_config(saved_cfg)
To download the dataset, we need to use send the data.world API a SQL statement and a reference to the URL of the dataset.
nuforc_reports <- query(
qry_sql("
SELECT * FROM nuforc_reports"),
dataset = "https://data.world/timothyrenner/ufo-sightings"
)
save(nuforc_reports, file="nuforc_reports.rdata")
I also immediately save the dataset since it does take quite a while to download and I don't want to wait around everytime that I have to load this data.
First Look at the NUFORC Dataset
Before doing any detail analysis or visualizations of a dataset, I like to open up the hood and take a look. Sometimes it's easiest to just click open the dataset in RStudio, this give you a nice Excel like interface that you can filter and sort.
Another thing I like to do is use the glimpse
function from the tidyverse package to get a super-condensed view of the data. Here's how to do that.
library(tidyverse)
load("nuforc_reports.rdata")
glimpse(nuforc_reports)
Also, take a note that I loaded the dataset because I'm assuming this code will be rerun a bunch of times after I download the original. Here is what you will see:
Rows: 136,937
Columns: 13
$ summary <chr> "Viewed some red lights in the sky appearing to be moving north and slower than an airplane.",…
$ city <chr> "Visalia", "Cincinnati", "Springfield", "Aurora", "Madison", "Bejuma, Edo. Carabobo, Caracas (…
$ state <chr> "CA", "OH", "NJ", "IL", "WI", NA, "AZ", "KY", NA, "QC", "ON", "DE", "WA", "CA", "CA", "IL", "M…
$ date_time <dttm> 2021-12-15 21:45:00, 2021-12-16 09:45:00, 2020-05-13 03:37:00, 2020-04-26 21:23:00, 2020-04-2…
$ shape <chr> "light", "triangle", "light", "light", "unknown", "light", "disk", "light", "formation", "fire…
$ duration <chr> "2 minutes", "14 seconds", "7 seconds", ":45 seconds", "4 to 5 minutes", "4 minutes", "2 minut…
$ stats <chr> "Occurred : 12/15/2021 21:45 (Entered as : 12/15/2021 9:45 PM) Reported: 12/15/2021 10:30:54 …
$ report_link <chr> "http://www.nuforc.org/webreports/165/S165881.html", "http://www.nuforc.org/webreports/165/S16…
$ text <chr> "Viewed some red lights in the sky appearing to be moving north and slower than an airplane. S…
$ posted <date> 2021-12-19, 2021-12-19, 2020-05-15, 2020-06-25, 2020-06-25, 1999-02-16, 2014-09-05, 2014-09-0…
$ city_latitude <dbl> 36.35665, 39.17450, 40.69870, 41.76136, 43.06651, NA, 33.42126, 37.73390, NA, 45.51409, 44.150…
$ city_longitude <dbl> -119.34794, -84.48136, -74.32960, -88.31529, -89.38736, NA, -111.77280, -83.08360, NA, -73.585…
$ city_location <chr> "POINT(-119.34793664122137 36.356650127226466)", "POINT(-84.48136299381886 39.17450311744155)"…
Take a moment and drink this in. We can already learn a lot about this dataset. We know the scope of the reporting (over 136,000 reports). Perhaps more importantly, we can the kind of data in each column of the dataset.
You can see what kind of data each column represents by looking at what is called the data dictionary. Our dataset's data dictionary is is hosted on data.world. I have reproduced key parts of the data dictionary below for our reference.
NUFORC Data Dictionary
Field | Type | Description |
---|---|---|
summary | string | Summary of the report. Usually the first few sentences. |
city | string | The city of the sighting. |
state | string | The 2 character state code of the sighting. |
date_time | datetime | The date and time of the sighting in ISO 8601 (local time). |
shape | string | The shape of the sighting |
duration | string | The duration of the sighting in no particular format. |
stats | string | Summary stats about the sighting (when it occurred, when it was posted, etc.). |
report_link | url | A link to the original report on the NUFORC site. |
text | string | The text of the sighting report. |
posted | datetime | When the sighting was posted to the NUFORC site. |
city_latitude | decimal | The latitude of the city in which the sighting occurred. |
city_longitude | decimal | The longitude of the city in which the sighting occurred. |
city_location | geopoint | The geocoded location of the city in which the sighting occurred |
Once I have an idea of what a dataset contains, I will start to explore the data. You can use visualizations to do this or you can just filter and sort the data. One trick is to ask yourself questions about the data.
For instance, I'm curious about the scope of the reports. When was the first report and how many reports do we see year after year? What types of reports do we see? How detailed are they? This step goes on until you are satisfied and you will come back to it as you think of new ways to use this data.
Number of UFO Reports by Year
In order to wrap my head around this dataset I wanted to take a look at the number of reports we have over the years. This gives me an idea of the overall shape of the data. To do that I used the ggplot2 package in R to create an area plot based on counting the numbers of reports over the years based on the date_time column.
library(tidyverse)
load("nuforc_reports.rdata")
# CLEAN DATA
library(lubridate)
nuforc_reports <-
nuforc_reports %>%
mutate(date_time = ymd_hms(date_time),
posted = ymd(posted))
# FREQUENCY BY DATE
library(ggplot2)
nuforc_reports %>%
ggplot(aes(as.Date(date_time))) +
geom_area(stat = "bin", bins=50) +
scale_x_date(date_labels = "%Y",
date_breaks = "10 year") +
labs(x="Date Reported",
y="Number of UFO Reports",
title="NUFORC Reports By Date") +
theme_minimal() +
theme(axis.text = element_text(size = 12),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
title = element_text(size = 20))
In addition to the plot, you can see that I did a little bit of data cleaning just so I can make sure the two date fields are in a standard format. Here is what you get from this area plot.
This shows a large uptick in reports starting in 1995. The earliest report was in 1969 and the most recent was in December 2021. It's not super clear from the plot but if you use tidyverse code to get a precise readout.
nuforc_reports %>%
summarise(min = min(date_time, na.rm = TRUE),
median = median(date_time, na.rm = TRUE),
max = max(date_time, na.rm = TRUE)) %>%
rownames_to_column() %>%
gather(var, value, -rowname) %>%
spread(rowname, value)
This procedure will produce the results below where you can see that the first report was in 1969.
1 max 2021-12-19 23:00:00
2 median 2011-06-01 22:30:00
3 min 1969-01-01 10:00:00
This is interesting because in addition to seeing the first and most recent (min and max) dates we can see that the reported sightings skew to the present day since the median (middle point) of all the reported dates is a lot closer to 2022 than 1969. This is almost certainly a function of UFO being easier to report with the adoption of the web vs needing to call in reports by phone so obviously we wouldn't want to read to much into this observation. Still it might be interesting to overlay later dates with UFO events that got media attention such as the famous TicTok UAP.
What Kinds of Shapes Are People Seeing?
So what are people reporting that they see? Since we have a category shape
we can count the number of reports for each shape and plot them using code like this:
d <-
nuforc_reports %>%
group_by(shape) %>%
count(shape) %>%
filter(!is.na(shape),
shape != "unknown",
shape != "changed",
n > 20) %>%
arrange(desc(n))
ggplot(d, aes(x = shape, y = n)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x="Number of Times Reported",
y="UFO Shape",
title="UFO Shapes Observed") +
theme_minimal() +
theme(axis.text = element_text(size = 12),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
title = element_text(size = 20))
After running this code in R you get this handy visualization of UFO shapes.
As you can see, the most common shape is a light. Also note that I did filter this list a little bit by removing shapes where less than 20 people observed them. I would be tempted to consolidate some of these. For instance, a disk, circle, and sphere are all essentially the same thing and I might think of grouping those together. Same for triangle like shapes. But I'm not 100% yet that I would do that.
Most Detailed Report
Ok, so I have a confession here. I did figure out the most lengthly report but I did not anticipate just how long this report is so I can't reproduce it here (it's longer than this entire article). But, I can show you how to find it in R:
nuforc_reports %>%
mutate(text_length = str_length(text)) %>%
select(text_length, text) %>%
arrange(desc(text_length)) %>%
head(1)
Here is the beginning of that very long report:
Wow, so the first deep dive reveals a UFO/Religion connection. I would want to investigate some analysis tools that would allow me to process, cluster, and organize all of this text information.
What to do Next?
This article is getting a bit long so I'm going to wrap it here, but I could continue down this route for days. Other than simply looking through this data, I am going to start to identify some fields that could be used to gather deeper insights. Maybe find some ways to efficiently mine the various text fields. Even though Jacques doesn't recommend it, I may use some of these fields to attempt to find correlations and trends. In parallel to that I am going to start building a dashboard of some kind that I can share here. Stay tuned here and I will share anything that would work well for this blog.