Where is the Wild West: Web Scraping, Data Science, and GIS using Jupyter Notebooks

May 14, 2019 — Stephen Hudak

There are a lot of hypocritical people that complain about modern life while benefiting enormously from it every day… I know this is true because I am one of those people.

In a lot of ways people had it better in the past. They could discover new continents, invent the theory gravity, and drive without seatbelts.

The American expansion into the West was one of those golden periods of time. Cowboys riding around, miles and miles of open range, tuberculosis—what more could anyone ask for?

I am a fan of Western films. Those films captured the West with all the cool shootouts and uniquely western landscapes. Sure, they are romanticized but that is what I like. While watching a spaghetti western I wondered where these events were supposed to be taking place.

The question in my head became, “According to Western films, where is the West?” We can answer that. We have the technology. After a few false starts I got a working process together:

  1. Find a bunch of lists of Western films on Wikipedia
  2. Scrape the film titles
  3. Run those film titles by Wikipedia to see if we can find their plots
  4. Use a natural language processor (NLP) to pull place names from the plots
  5. Run the place names through Wikipedia again to remove junk place names
  6. Geocode the place names and load it all into a spatially enabled dataframe
  7. Map it

Let’s take the example of the 1952 film Cripple Creek. First, we get the title.

Next, we use the title to find the plot on its Wikipedia page.

After running the plot through the NLP, we get the following “GPE” entities.

Next, we geocode these addresses to get X and Y coordinates and read them into a pandas dataframe.

Finally, we map it.

We see Cripple Creek, CO and Texas mapped from the plot of the film Cripple Creek.

The notebook is broken up as follows:

Import packages

List of Wiki Lists

Scrape the Wiki Entries for Titles

Search Wiki Entries for Plots

Search Plots with Natural Language Processor for Place Names

Use Wiki Again to Validate Place Names

Geocode Place Names and Load DataFrame

Remove Bad Geocodes

Map the West

This is a map of the West according to our definition and process.

A lot of the hotspots seem to occur from many mentions of states like Texas, Colorado, Kansas, and California. But a lot of cities and towns made it through as well.

To start 1995 titles were collected. From these we found over 1750 movie plots. This became about 3750 potential place names. Those got filtered down to around 1750 by checking their Wikipedia entries. Finally, 1000 or so points made it to the end to be plotted.

Plotting the centroid, we find {36.35037924014938, -106.2693241648988} which sits neatly in north-central New Mexico.

If you ever make it on Jeopardy and the answer “The centroid of the American West as defined by geocoding place names mentioned in the plots of Western films between 1920 and 1969” comes up do not respond “Carson National Forest” or you will get it wrong. You need to answer in the form of a question on Jeopardy.

The Jupyter notebook can be found here.

We Wrote the Book

The Indispensible Guide to ArcGIS Online

Download It for Free

What do you think?

Leave a comment, and share your thoughts

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


This site uses Akismet to reduce spam. Learn how your comment data is processed.