STAT 29000: Project 5 — Fall 2021
Motivation: awk
is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk
.
Context: This is the first project where we introduce awk
. awk
is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
Scope: awk, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
While the UNIX tools we’ve used up to this point are very useful, awk
enables many new capabilities, and can even replace major functionality of other tools.
In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using awk
.
Similarly, we’ve used head
to print the header line. Use awk
to do the same.
Similarly, we’ve used wc
to count the number of lines in the dataset. Use awk
to do the same.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
In a previous question, we used sort
in combination with uniq
to find the stores with the most number of sales.
Use awk
to find the 10 stores with the most number of sales. In a previous solution, our output was minimal — we had a count and a store number. This time, take some time to format the output nicely, and use the store number to find the count (not store name).
Sorting an array by values in |
You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to print the store names with the output. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Calculate the total number of sales (in USD) by county. Do this using any UNIX commands you have available. Then, do this using only awk
.
gsub(/\$/, "", $2) |
The |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Use awk
and piping to create a new dataset with the following columns, for every store, by month:
-
month_number
: the month number (01-12) -
year
: the year (4-digit year, e.g., 2015) -
store_name
: store name -
volume_sold
: total volume sold -
sold_usd
: total amount sold in USD
Call the new dataset sales_by_store.csv
.
Feel free to use the store name as a key for simplicity. |
Then, you can access the year using |
You can use multiple values as a key in
Here, |
Dr Ward walks you through a method of solution for this problem, in the video
|
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Use awk
to count how many times each store has sold more than $500,000 in a month. Output should be similar to the following. Sort the output from highest count to lowest.
store_name,count
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |