Improving the whiskey distillery data set
I have previously used a data set describing the characteristics of whiskeys to draw radar plots. Here, I present how I cleaned and augmented the original data from the University of Strathclyde, resulting in an improved version of the whiskey data set.
Loading the whiskey data set
The original data set can be loaded from the web in the following way:
library(RCurl)
# load data as character
f <- getURL('https://www.datascienceblog.net/data-sets/whiskies.txt')
# read table from text connection
df <- read.csv(textConnection(f), header=T)
Fixing the post codes
Since there are tab characters and spaces in the post codes, we will clean those up:
head(df)
## RowID Distillery Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey
## 1 1 Aberfeldy 2 2 2 0 0 2 1 2
## 2 2 Aberlour 3 3 1 0 0 4 3 2
## 3 3 AnCnoc 1 3 2 0 0 2 0 0
## 4 4 Ardbeg 4 1 4 4 0 0 2 0
## 5 5 Ardmore 2 2 2 0 0 1 1 1
## 6 6 ArranIsleOf 2 3 1 1 0 1 1 1
## Nutty Malty Fruity Floral Postcode Latitude Longitude
## 1 2 2 2 2 \tPH15 2EB 286580 749680
## 2 2 3 3 2 \tAB38 9PJ 326340 842570
## 3 2 2 3 2 \tAB5 5LI 352960 839320
## 4 1 2 1 0 \tPA42 7EB 141560 646220
## 5 2 3 1 1 \tAB54 4NH 355350 829140
## 6 0 1 1 2 KA27 8HJ 194050 649950
df$Postcode <- gsub(" *\t*", "", df$Postcode)
Annotating the locations of the distilleries
A blog post by Koki Ando gives a nice overview of how UTM data can be handled. In the following code snippet, we use the raster
and sp
packages to create a SpatialPoints
object from latitude/longitude coordinates in UTM format. Then, we add UK as a reference point system by specifying +init=epsg:27700" (see epsg.io for other reference coordinates). Finally, we call spTransform
with WGS84 (+init=epsg:4326) in order to set the world geodetic system, which is used for GPS.
# transform UTM coordinates to longitude/latitude in degrees
geo.df <- df[, c("Latitude", "Longitude")]
colnames(geo.df) <- c("lat", "long") # switch for plotting
library(raster)
# create 'SpatialPoints' object
coordinates(geo.df) <- ~lat + long
# add coordinate reference system (CRS) for UK
proj4string(geo.df) <- CRS("+init=epsg:27700")
# transform to new coordinate system
# NB: getting rgdal working on old systems is tough due to libgdal dependency
library(rgdal)
geo.df <- spTransform(geo.df, CRS("+init=epsg:4326"))
map.df <- data.frame("Distillery" = df[, "Distillery"], geo.df)
df <- cbind(df, map.df[, c("lat", "long")])
Other annotations
To annotate the regions in which the distilleries are situated, I manually assigned regions by relying on a list of Scottisch distilleries available at Wikipedia. I also fixed some spelling errors in the distillery names.
The improved whiskey data set
The improved whiskey data set is available here.
Comments
There aren't any comments yet. Be the first to comment!