Drawing on the program shared by jstout.us in their article "Parsing YouTube History with Beautiful Soup," I have embarked on a journey of self-discovery through the analysis of my video viewing habits since early 2015. This is about 8 years of my personal data!
I have harnessed the power of Google Takeout to extract my YouTube watch history data, converting it into a structured format ready for exploration. Utilizing Python and libraries such as NLTK (Natural Language Toolkit) and Plotly Express, I am well-equipped to conduct a comprehensive analysis to uncover intriguing insights.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.tag import pos_tag
from nltk.sentiment import SentimentIntensityAnalyzer
import plotly.express as px
# Download NLTK resources (only required once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
df = pd.read_csv('watch-history.csv')
df.head()
[nltk_data] Downloading package punkt to /Users/henrybenn/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /Users/henrybenn/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package vader_lexicon to [nltk_data] /Users/henrybenn/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date!
Unnamed: 0 | channel_id | channel_title | video_id | video_title | watched | |
---|---|---|---|---|---|---|
0 | 0 | UCKONdcQCTP3aozARj1ntKhw | warpdotdev | XWQY8LgkiXM | Warp Official Demo | Everything You Need To Kn... | 2023-08-03T01:57:17+00:00 |
1 | 1 | UCDybamfye5An6p-j1t2YMsg | Data With Mo | CmOAXW24y2Y | TABLEAU PORTFOLIO PROJECT | Add this interacti... | 2023-08-03T01:16:29+00:00 |
2 | 2 | UCgrnOgTDK894aZBBY9uytEQ | Mr Fredericks. | 1zGaTE2AmsU | Intelligent Drum & Bass - Selected Works (1994... | 2023-08-03T00:54:29+00:00 |
3 | 3 | UCwV_0HmQkRrTcrReaMxPeDw | loltyler1 | kh8ry1yOmZM | TYLER1: IT’S BLUE !!! | 2023-08-02T23:12:19+00:00 |
4 | 4 | UCNqkzhl-L-G__QRZDALc93Q | Watashi 81 | 09ZSKE38lTU | Donna Summer - She Works Hard For The Money | 2023-08-02T20:36:25+00:00 |
Looking good so far. I will now remove any unnecessary columns.
# Removing unnecessary columns
df = df.drop(['Unnamed: 0','channel_id','video_id'],axis=1)
df.sample()
channel_title | video_title | watched | |
---|---|---|---|
2734 | funbags82 | UK Garage - DJ Deekline & MC Hyperactive - Sex... | 2022-05-02T20:53:41+00:00 |
Below I use NLTK (Natural Language Toolkit) to calculate the frequency of words in all the video titles in my youtube watch history.
watch_history_tokens = df['video_title'].apply(word_tokenize)
# Calculate frequency distribution of words
all_words = [word.lower() for tokens in watch_history_tokens for word in tokens]
fdist1 = FreqDist(all_words)
# Most common words
most_common_words = fdist1.most_common(20)
# Part-of-speech tagging
pos_tags = [pos_tag(tokens) for tokens in watch_history_tokens]
# Convert FreqDist to a DataFrame
freq_dist_df = pd.DataFrame(list(fdist1.items()), columns=['Word', 'Frequency'])
# Sort the DataFrame by frequency in descending order
freq_dist_df.sort_values(by='Frequency', ascending=False, inplace=True)
# Create a bar chart using Plotly Express
fig = px.bar(freq_dist_df.head(30), x='Word', y='Frequency', labels={'Frequency': 'Frequency Count'})
# Set the plot title and axis labels
fig.update_layout(title_text='Top 30 Most Common Words', xaxis_title='Words', yaxis_title='Frequency')
# Show the plot
fig.show()
At first glance, this looks like some inconclusive data. After further analysis however, the data here makes more sense. A significant part of my youtube watches are actually music related, and most (if not all) songs on youtube are labeled as the following <artist> - <song> (remix name)
.
The https
, //www.youtube.com/watch
, ?
, and :
are actually all from the same type of video entry. In the data that Google Takeout gives you, if the video has since been deleted then all you get is the original youtube link to that video. It looks like there are around 2,900 videos that I have watched since 2015 that have been deleted!
It is not surprising that official
and video
are next to each other in this list, since a large portion of music videos had the naming suffix official video
.
channel_freq = df.groupby('channel_title').count().sort_values(by='watched', ascending=False).head(30).drop('video_title',axis=1)
# Create a bar chart using Plotly Express
fig = px.bar(channel_freq, x='watched', y=channel_freq.index, labels={'Frequency': 'Frequency Count'})
# Set the plot title and axis labels
fig.update_layout(title_text='Top 30 Youtube Channels', xaxis_title='Frequency', yaxis_title='Channel Name',height = 800)
# Show the plot
fig.show()
Its fascinating to me that videogamedunkey is still my most viewed channel. It has been a long time since I've thought about this creator.
I am not surprised about any of the entries here, we have a fair representation of the historic interests of my youth:
vid_freq = df.groupby('video_title').count().sort_values(by='watched', ascending=False).head(30).drop('channel_title',axis=1)
# Create a bar chart using Plotly Express
fig = px.bar(vid_freq, x='watched', y=vid_freq.index, labels={'Frequency': 'Frequency Count'})
# Set the plot title and axis labels
fig.update_layout(title_text='Top 30 youtube videos', xaxis_title='Frequency', yaxis_title='Video Title', height = 800)
# Show the plot
fig.show()
As expected, the most replayed videos are all songs or DJ mixes. This explains the word count frequency.
The only outlier here is Defcon Gabber. I can't easily explain why, but this video perfectly captures the early australian festival culture. 2007-2009 vintage. I beleive I would often use this video as a talking point to explain the origins of australian festival culture (Pioneered by characters such as Zyzz) and also 'Lad' culture.
I was never much of a large festival person myself - I would normally keep to house parties and smaller gatherings in my 'DJ era' - but festivals have been an undeniable cultural influence in contemporary australian culture.
import datetime
from datetime import datetime
df['watched'] = pd.to_datetime(df['watched'])
df['watched'].values
array(['2023-08-03T01:57:17.000000000', '2023-08-03T01:16:29.000000000', '2023-08-03T00:54:29.000000000', ..., '2015-07-03T03:47:02.000000000', '2015-07-03T03:36:06.000000000', '2015-07-03T01:32:17.000000000'], dtype='datetime64[ns]')
df['hour_watched'] = df['watched'].dt.tz_convert('Australia/Brisbane').dt.hour
hour_freq = df.groupby('hour_watched').count().sort_values(by='hour_watched').drop(['channel_title','video_title'],axis=1)
hour_freq = hour_freq.reset_index()
# Create a bar chart using Plotly Express
fig = px.line(hour_freq, x='hour_watched', y='watched', labels={'Frequency': 'Frequency Count'})
# Set the plot title and axis labels
fig.update_layout(title_text='Youtube activity', xaxis_title='Hour of Day', yaxis_title='Frequency', height = 800)
# Show the plot
fig.show()
From this graph I can tell that I was somewhat of a nightowl during my college and late teen years. I'm glad I never often was watching youtube after 3am. If this graph was remade from my current habits, I beleive it would look a little different.
I was watching a the most youtube around 10am - 3pm, a break around dinner, and back at it again at night. I think a lot of this can be attributed to having background music on during prime productivity hours since this is what a large portion of my youtube content has been.
This has been an intriguing exploratory data analysis into 8 years of my youtube watching history.
With the help of jstout.us and using computer skills, I sorted through all my video-watching records. I used tools to understand which videos I liked the most, what topics caught my attention, and even how those videos made me feel.
Looking at all this data, I realized how my interests changed over time. It was like looking back at a photo album of my favorite moments, but in digital form. I could see what made me laugh, think, and feel happy through the videos I watched.
As I continue to explore YouTube in the future, I'll keep these insights in mind, using them to help me find more videos that I'll enjoy and learn from.