EDA / Statistical Analysis of Engagement for CUBRT on Instagram¶
Prerequisites¶
Instaloader was used to download all the posts along with their metadata from instagram, and followers count at intervals of 7 days were calculated manually using Excel and rate of growth from Instagram using the formula:
past_followers = Current Followers / (1 + (Percentage Growth / 100 ))
Date of Analysis: Thursday, 15th June 2023 Current Followers: 1664
curacing
team folder contains the data downloaded from Instaloader
!ls
EDA.html curacingteam filename.csv temp.json EDA.ipynb experiments post_analysis.md
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import glob
import lzma
from sklearn.linear_model import LinearRegression
from datetime import timedelta, datetime
Conversion to Pandas DF¶
Instaloader downloads the metadata as a JSON file and then compresses it. Here, some of the features are selected and all of the files are loaded and parsed
global_df = []
def to_xlsx(file):
info = {}
with open(file) as f:
data = json.load(f)
try:
info["post_type_name"] = data["node"]["__typename"]
except KeyError:
info["post_type_name"] = "null"
try:
info["post_text"] = data["node"]["edge_media_to_caption"]["edges"][0]["node"]["text"]
except (IndexError, KeyError):
info["post_text"] = "null"
try:
info["account_name"] = data["node"]["owner"]["id"]
except KeyError:
info["account_name"] = "null"
try:
info["shortcode"] = "https://www.instagram.com/p/"+data["node"]["shortcode"]
except KeyError:
info["shortcode"] = 0
try:
dt = data["node"]["taken_at_timestamp"]
info["timestamp"] = datetime.fromtimestamp(dt)
except KeyError:
info["timestamp"] = 0
try:
info["like_count"] = data["node"]["edge_media_preview_like"]["count"]
except KeyError:
info["like_count"] = 0
try:
info["comment_count"] = data["node"]["edge_media_to_comment"]["count"]
except KeyError:
info["comment_count"] = 0
try:
info["video_view_count"] = data["node"]["video_view_count"]
except (IndexError, KeyError):
info["video_view_count"] = 0
try:
info["comments_disabled"] = data["node"]["comments_disabled"]
except KeyError:
info["comments_disabled"] = "null"
# AFEGITS
try:
info["full_name"] = data["node"]["owner"]["full_name"]
except KeyError:
info["full_name"] = "null"
try:
info["is_professional_account"] = data["node"]["owner"]["is_professional_account"]
except KeyError:
info["is_professional_account"] = "null"
try:
info["is_business_account"] = data["node"]["owner"]["is_business_account"]
except KeyError:
info["is_business_account"] = "null"
try:
info["is_verified"] = data["node"]["owner"]["is_verified"]
except KeyError:
info["is_verified"] = "null"
try:
info["is_video"] = data["node"]["is_video"]
except KeyError:
info["is_video"] = "null"
try:
info["category_name"] = data["node"]["owner"]["category_name"]
except KeyError:
info["category_name"] = "null"
return info # returning the dict with all data
json_files = glob.glob("./curacingteam/*.json.xz") # Path To JSON FILES
ammount = len(json_files)
print(f"Found {ammount} posts")
Found 208 posts
for json_file in json_files:
with open("./temp.json","w") as f:
f.write(lzma.open(json_file,mode='rt',encoding='utf-8').read())
info = to_xlsx('./temp.json') # This is the main function
df = pd.DataFrame({
"post_date": info["timestamp"],
"account_id": info["account_name"],
"full_name": info["full_name"],
"text": info["post_text"],
"post_shortcode": info["shortcode"],
"like_count": info["like_count"],
"comment_count": info["comment_count"],
"is_video": info["is_video"],
"video_view_count": info["video_view_count"],
"comments_policy": info["comments_disabled"],
"is_professional": info["is_professional_account"],
"is_business": info["is_business_account"],
"is_verified": info["is_verified"],
"person_category": info["category_name"],
"post_type": info["post_type_name"]
}, index=[1])
global_df.append(df)
final = pd.concat(global_df)
#final.to_csv("filename.csv", index=False) # Your Filename
print("Converted...")
Converted...
final
post_date | account_id | full_name | text | post_shortcode | like_count | comment_count | is_video | video_view_count | comments_policy | is_professional | is_business | is_verified | person_category | post_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2022-11-02 21:34:27 | 4312829205 | CU Boulder Racing Team | Attending Atlanta Motorsports Park for the @ch... | https://www.instagram.com/p/CkdvA6oLIAa | 147 | 5 | False | 0 | False | True | True | False | Amateur Sports Team | GraphSidecar |
1 | 2022-01-26 08:33:50 | 4312829205 | CU Boulder Racing Team | Applications are now open for both the Enduran... | https://www.instagram.com/p/CZLXGjaLQ4p | 79 | 1 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
1 | 2021-08-13 04:20:42 | 4312829205 | CU Boulder Racing Team | “Scary as s**t but very very fun” - Ryan after... | https://www.instagram.com/p/CSfeLQVrQzJ | 65 | 0 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
1 | 2023-05-21 19:39:40 | 4312829205 | CU Boulder Racing Team | The 2023 Formula SAE Competition in Michigan i... | https://www.instagram.com/p/Csgg3OWLRIt | 902 | 16 | False | 0 | False | True | True | False | Amateur Sports Team | GraphSidecar |
1 | 2022-05-10 01:19:24 | 4312829205 | CU Boulder Racing Team | Hey folks, we're still running our crowdfundin... | https://www.instagram.com/p/CdWYDkcOA0i | 64 | 0 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1 | 2018-04-24 01:06:23 | 4312829205 | CU Boulder Racing Team | The anticipation before a race 💨\n•\n•\n•\n#cu... | https://www.instagram.com/p/Bh7Md85gSoa | 45 | 2 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
1 | 2020-12-25 10:12:40 | 4312829205 | CU Boulder Racing Team | Merry Christmas and Happy Holidays to all! We ... | https://www.instagram.com/p/CJNS09Frl_U | -1 | 3 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
1 | 2023-01-24 21:06:18 | 4312829205 | CU Boulder Racing Team | As we begin to focus our time on the manufactu... | https://www.instagram.com/p/CnzZxWuLJDL | 87 | 0 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
1 | 2023-02-27 21:28:41 | 4312829205 | CU Boulder Racing Team | Teamwork is a vital part of any group and our ... | https://www.instagram.com/p/CpK_W_prd5j | 96 | 2 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
1 | 2018-10-17 00:48:46 | 4312829205 | CU Boulder Racing Team | The close up! Also, the funding campaign is st... | https://www.instagram.com/p/BpAWW0xgp_v | 29 | 1 | False | 0 | False | True | True | False | Amateur Sports Team | GraphImage |
208 rows × 15 columns
Data Interpolation¶
Since we only have the followers count for 7 day intervals for the past 90 days, we need to interpolate the data. A quick linear regression model is okay in this case
# Input dictionary
dic = {0: 1664, 7: 1661, 14: 1657, 21: 1639, 28: 1589, 30: 1568, 37: 1538, 44: 1532, 51: 1534, 58: 1529, 65: 1522, 72: 1515, 79: 1511, 90: 1482, 2636: 10}
# Extract keys and values from the dictionary
keys = np.array(list(dic.keys())).reshape(-1, 1)
values = np.array(list(dic.values()))
# Create a LinearRegression model
model = LinearRegression()
# Fit the model to the data
model.fit(keys, values)
# Predict the number of followers for a new day
new_day = 0
predicted_followers = model.predict([[new_day]])
print("Predicted number of followers for day", new_day, ":", predicted_followers[0])
Predicted number of followers for day 0 : 1593.2997615182705
df = final
# Exclude rows with 'post_date' of type int
df = df[~df['post_date'].apply(lambda x: isinstance(x, int))]
# Then convert to datetime
df['post_date'] = pd.to_datetime(df['post_date'])
# Let's assume 'today' is the latest date in the dataframe
#today = df['post_date'].max()
today = datetime.strptime('15/06/23 15:25:00', '%d/%m/%y %H:%M:%S')
def magic_function(row):
days = (today - row['post_date']).days
return model.predict([[days]])[0]
"""# Convert the date to the number of days before today
df['days_before_today'] = (today - df['post_date']).dt.days
# Convert the dictionary to a DataFrame for interpolation
dic_df = pd.DataFrame(list(dic.items()), columns=['days_before_today', 'followers'])
# Sort the DataFrame based on 'days_before_today'
dic_df = dic_df.sort_values('days_before_today')
# Interpolate the followers for the days not present in the dictionary
dic_df = dic_df.set_index('days_before_today').reindex(range(dic_df.days_before_today.min(), dic_df.days_before_today.max() + 1)).interpolate().reset_index()
# Merge the two DataFrames on 'days_before_today'
df = pd.merge(df, dic_df, on='days_before_today', how='left')
df['followers'] = df['followers'].replace(np.nan, 1000)"""
df['followers'] = df.apply(magic_function, axis=1) #df.apply(lambda row : model.predict([[(today - row['post_date']).dt.days]])[0])
# Rename the column to 'followers_at_time'
df = df.rename(columns={'followers': 'followers_at_time'})
Engagement¶
Engagement is calculated by taking the sum of the likes, video view counts and comments, divided by followers at that time, and then whole multiplied by 100
# Calculate the engagement_percentage
df['engagement_percentage'] = (df['like_count'] + df['comment_count'] + df['video_view_count']) / df['followers_at_time'] * 100
processed_df = df
Analysis¶
By Post Type¶
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
sns.scatterplot(data=processed_df, x="post_date", y="engagement_percentage", hue="post_type", ax=axs[0])
sns.histplot(data=processed_df, x="post_type", hue="post_type", shrink=.8, alpha=.8, legend=False, ax=axs[1])
f.tight_layout()
Even though videos are our most well performing post type, we don't post enough
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
sns.scatterplot(data=processed_df, x="post_date", y="like_count", hue="post_type", ax=axs[0])
sns.histplot(data=processed_df, x="post_type", hue="post_type", shrink=.8, alpha=.8, legend=False, ax=axs[1])
f.tight_layout()
There hasn't been a huge increase in the number of likes we get per post. Sure, it is steadily increasing, but we have had this account since 2017. Although, our frequency has definitely gotten better since 2022!
There also is a big jump that happened because of the 2023 FSAE Competition. 2022 did not show such a big jump.
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
sns.scatterplot(data=processed_df, x="post_date", y="comment_count", hue="post_type", ax=axs[0])
sns.histplot(data=processed_df, x="post_type", hue="post_type", shrink=.8, alpha=.8, legend=False, ax=axs[1])
f.tight_layout()
We need to figure out how to get people to comment more, as this is factored in when engagement is calculated