STAT 19000: Project 11 — Fall 2021
Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you’ve learned so far this semester to solve data-driven problems. In previous projects, we’ve directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you’d like.
Context: You’ve learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of YouTube data.
Scope: R
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/youtube/*
Questions
Question 1
In project 5, we used a loop to combine the various countries youtube datasets into a single dataset called yt
(for YouTube).
Now, we’ve provided you with code below to create such a dataset, with a subset of the countries we want to look at.
library(lubridate)
countries <- c('US', 'DE', 'CA', 'FR')
# Choose either the for loop or the sapply function for creating `yt`
# EITHER use a for loop to create the data frame `yt`
yt <- data.frame()
for (c in countries) {
filename <- paste0("/depot/datamine/data/youtube/", c, "videos.csv")
dat <- read.csv(filename)
dat$country_code <- c
yt <- rbind(yt, dat)
}
# OR use an sapply function to create the data frame `yt`
myDFlist <- lapply( countries, function(c) {
dat <- read.csv(paste0("/depot/datamine/data/youtube/", c, "videos.csv"))
dat$country_code <- c
return(dat)} )
yt <- do.call(rbind, myDFlist)
# convert columns to date formats
yt$trending_date <- ydm(yt$trending_date)
yt$publish_time <- ymd_hms(yt$publish_time)
# extract the trending_year and publish_year
yt$trending_year <- year(yt$trending_date)
yt$publish_year <- year(yt$publish_time)
Take a look at the tags
column in our yt
dataset. Create a function called count_tags
that has an argument called tag_vector
. Your count_tags
function should be the count of how many unique tags the vector, tag_vector
contains.
Take a look at the |
You can test your function with the following code.
tag_test <- yt$tags[2]
tag_test
count_tags(tag_test)
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Create a new column in your yt
dataset called n_tags
that contains the number of tags for the corresponding trending video.
Make sure to use your count_tags
function. Which YouTube trending video has the highest number of unique tags for videos that are trending either in the US or Germany (DE)? How many tags does it have?
Make sure to use the |
Begin by creating the new column |
It should be |
Relevant topics: sapply, which.max, indexing, subset
-
Code used to solve this problem.
-
Output from running the code.
-
The title of the YouTube video with the highest number of tags, and the number of tags it has.
Question 3
Is there an association between number of tags in a video and how many views it gets?
Make a scatterplot with number of views
in the x-axis and number of tags (n_tags
) in the y-axis. Based on your plot, write 1-2 sentences about whether you think number of tags and number of views are associated or not.
Hmmm, is a scatterplot a good choice to be able to see an association in this case? If so, explain why. If not, create a better plot for determining this, and explain why your plot is better, and try to explain if you see any association.
|
Relevant topics: sapply, which.max, indexing, subset
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaining if you think number of views and number of tags a youtube video has are associated or not, and why.
Question 4
Compare the average number of views and average number of comments that the YouTube trending videos have per trending country.
Is there a different behavior between countries? Are the comparisons fair? To check if we are being fair, take a look at how many youtube trending videos we have per country.
Relevant topics: tapply, mean
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences comparing trending countries based on average number of views and comments.
-
1-2 sentences explaining if you think we are being fair in our comparisons, and why or why not.
Question 5
How would you compare the YouTube trending videos across the different countries?
Make a comparison using plots and/or summary statistics. Explain what variables are you looking at, and why you are analyzing the data the way you are. Have fun with it!
There are no right/wrong answers here. Just dig in a little bit and see what you can find. |
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaining your logic.
-
1-2 sentences comparing the countries.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |