+ + + + + + + + +

EDA / Statistical Analysis of Engagement for CUBRT on Instagram¶

Prerequisites¶

Instaloader was used to download all the posts along with their metadata from instagram, and followers count at intervals of 7 days were calculated manually using Excel and rate of growth from Instagram using the formula:

past_followers = Current Followers / (1 + (Percentage Growth / 100 ))
+

Date of Analysis: Thursday, 15th June 2023 +Current Followers: 1664

+ +

curacing team folder contains the data downloaded from Instaloader

+ + + + + + + +

!ls
+

+ +

EDA.html         curacingteam     filename.csv     temp.json
+EDA.ipynb        experiments      post_analysis.md
+

+ + + + + + + + +

import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+import json
+import glob
+import lzma
+from sklearn.linear_model import LinearRegression
+from datetime import timedelta, datetime
+

+ +

Conversion to Pandas DF¶

Instaloader downloads the metadata as a JSON file and then compresses it. Here, some of the features are selected and all of the files are loaded and parsed

+ + + + + + + +

global_df = []
+
+def to_xlsx(file):
+    info = {}
+
+    with open(file) as f:
+        data = json.load(f)
+        try:
+            info["post_type_name"] = data["node"]["__typename"]
+        except KeyError:
+            info["post_type_name"] = "null"
+        try:
+            info["post_text"] = data["node"]["edge_media_to_caption"]["edges"][0]["node"]["text"]
+        except (IndexError, KeyError):
+            info["post_text"] = "null"
+        try:
+            info["account_name"] = data["node"]["owner"]["id"]
+        except KeyError:
+            info["account_name"] = "null"
+        try:
+            info["shortcode"] = "https://www.instagram.com/p/"+data["node"]["shortcode"]
+        except KeyError:
+            info["shortcode"] = 0
+        try:
+            dt = data["node"]["taken_at_timestamp"]
+            info["timestamp"] = datetime.fromtimestamp(dt)
+        except KeyError:
+            info["timestamp"] = 0
+        try:
+            info["like_count"] = data["node"]["edge_media_preview_like"]["count"]
+        except KeyError:
+            info["like_count"] = 0
+        try:
+            info["comment_count"] = data["node"]["edge_media_to_comment"]["count"]
+        except KeyError:
+            info["comment_count"] = 0
+        try:
+            info["video_view_count"] = data["node"]["video_view_count"]
+        except (IndexError, KeyError):
+            info["video_view_count"] = 0
+        try:
+            info["comments_disabled"] = data["node"]["comments_disabled"]
+        except KeyError:
+            info["comments_disabled"] = "null"
+
+        # AFEGITS
+
+        try:
+            info["full_name"] = data["node"]["owner"]["full_name"]
+        except KeyError:
+            info["full_name"] = "null"
+        try:
+            info["is_professional_account"] = data["node"]["owner"]["is_professional_account"]
+        except KeyError:
+            info["is_professional_account"] = "null"
+        try:
+            info["is_business_account"] = data["node"]["owner"]["is_business_account"]
+        except KeyError:
+            info["is_business_account"] = "null"
+        try:
+            info["is_verified"] = data["node"]["owner"]["is_verified"]
+        except KeyError:
+            info["is_verified"] = "null"
+        try:
+            info["is_video"] = data["node"]["is_video"]
+        except KeyError:
+            info["is_video"] = "null"
+        try:
+            info["category_name"] = data["node"]["owner"]["category_name"]
+        except KeyError:
+            info["category_name"] = "null"
+
+        return info # returning the dict with all data
+

+ +

+ + + + + + +

json_files = glob.glob("./curacingteam/*.json.xz") # Path To JSON FILES
+ammount = len(json_files)
+print(f"Found {ammount} posts")
+

+ +

Found 208 posts
+

+ + + + + + + + +

for json_file in json_files:
+    with open("./temp.json","w") as f:
+        f.write(lzma.open(json_file,mode='rt',encoding='utf-8').read())
+    info = to_xlsx('./temp.json')  # This is the main function
+    
+    df = pd.DataFrame({
+        "post_date": info["timestamp"],
+        "account_id": info["account_name"],
+        "full_name": info["full_name"],
+        "text": info["post_text"],
+        "post_shortcode":  info["shortcode"],
+        "like_count": info["like_count"],
+        "comment_count": info["comment_count"],
+        "is_video": info["is_video"],
+        "video_view_count": info["video_view_count"],
+        "comments_policy": info["comments_disabled"],
+        "is_professional": info["is_professional_account"],
+        "is_business": info["is_business_account"],
+        "is_verified": info["is_verified"],
+        "person_category": info["category_name"],
+        "post_type": info["post_type_name"]
+    }, index=[1])
+
+    global_df.append(df)
+
+final = pd.concat(global_df)
+#final.to_csv("filename.csv", index=False) # Your Filename
+print("Converted...")
+

+ +

Converted...
+

+ + + + + + + + +

final
+

+ +

+ + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	post_date	account_id	full_name	text	post_shortcode	like_count	comment_count	is_video	video_view_count	comments_policy	is_professional	is_business	is_verified	person_category	post_type
1	2022-11-02 21:34:27	4312829205	CU Boulder Racing Team	Attending Atlanta Motorsports Park for the @ch...	https://www.instagram.com/p/CkdvA6oLIAa	147	5	False	0	False	True	True	False	Amateur Sports Team	GraphSidecar
1	2022-01-26 08:33:50	4312829205	CU Boulder Racing Team	Applications are now open for both the Enduran...	https://www.instagram.com/p/CZLXGjaLQ4p	79	1	False	0	False	True	True	False	Amateur Sports Team	GraphImage
1	2021-08-13 04:20:42	4312829205	CU Boulder Racing Team	“Scary as s**t but very very fun” - Ryan after...	https://www.instagram.com/p/CSfeLQVrQzJ	65	0	False	0	False	True	True	False	Amateur Sports Team	GraphImage
1	2023-05-21 19:39:40	4312829205	CU Boulder Racing Team	The 2023 Formula SAE Competition in Michigan i...	https://www.instagram.com/p/Csgg3OWLRIt	902	16	False	0	False	True	True	False	Amateur Sports Team	GraphSidecar
1	2022-05-10 01:19:24	4312829205	CU Boulder Racing Team	Hey folks, we're still running our crowdfundin...	https://www.instagram.com/p/CdWYDkcOA0i	64	0	False	0	False	True	True	False	Amateur Sports Team	GraphImage
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1	2018-04-24 01:06:23	4312829205	CU Boulder Racing Team	The anticipation before a race 💨\n•\n•\n•\n#cu...	https://www.instagram.com/p/Bh7Md85gSoa	45	2	False	0	False	True	True	False	Amateur Sports Team	GraphImage
1	2020-12-25 10:12:40	4312829205	CU Boulder Racing Team	Merry Christmas and Happy Holidays to all! We ...	https://www.instagram.com/p/CJNS09Frl_U	-1	3	False	0	False	True	True	False	Amateur Sports Team	GraphImage
1	2023-01-24 21:06:18	4312829205	CU Boulder Racing Team	As we begin to focus our time on the manufactu...	https://www.instagram.com/p/CnzZxWuLJDL	87	0	False	0	False	True	True	False	Amateur Sports Team	GraphImage
1	2023-02-27 21:28:41	4312829205	CU Boulder Racing Team	Teamwork is a vital part of any group and our ...	https://www.instagram.com/p/CpK_W_prd5j	96	2	False	0	False	True	True	False	Amateur Sports Team	GraphImage
1	2018-10-17 00:48:46	4312829205	CU Boulder Racing Team	The close up! Also, the funding campaign is st...	https://www.instagram.com/p/BpAWW0xgp_v	29	1	False	0	False	True	True	False	Amateur Sports Team	GraphImage

208 rows × 15 columns

+ + + +

Data Interpolation¶

Since we only have the followers count for 7 day intervals for the past 90 days, we need to interpolate the data. A quick linear regression model is okay in this case

+ + + + + + + +

# Input dictionary
+dic = {0: 1664, 7: 1661, 14: 1657, 21: 1639, 28: 1589, 30: 1568, 37: 1538, 44: 1532, 51: 1534, 58: 1529, 65: 1522, 72: 1515, 79: 1511, 90: 1482, 2636: 10}
+
+# Extract keys and values from the dictionary
+keys = np.array(list(dic.keys())).reshape(-1, 1)
+values = np.array(list(dic.values()))
+
+# Create a LinearRegression model
+model = LinearRegression()
+
+# Fit the model to the data
+model.fit(keys, values)
+
+# Predict the number of followers for a new day
+new_day = 0
+predicted_followers = model.predict([[new_day]])
+
+print("Predicted number of followers for day", new_day, ":", predicted_followers[0])
+

+ +

Predicted number of followers for day 0 : 1593.2997615182705
+

+ + + + + + + + +

df = final
+
+# Exclude rows with 'post_date' of type int
+df = df[~df['post_date'].apply(lambda x: isinstance(x, int))]
+
+# Then convert to datetime
+df['post_date'] = pd.to_datetime(df['post_date'])
+
+# Let's assume 'today' is the latest date in the dataframe
+#today = df['post_date'].max()
+today = datetime.strptime('15/06/23 15:25:00', '%d/%m/%y %H:%M:%S')
+
+def magic_function(row):
+    
+    days = (today - row['post_date']).days
+    return model.predict([[days]])[0]
+
+"""# Convert the date to the number of days before today
+df['days_before_today'] = (today - df['post_date']).dt.days
+
+# Convert the dictionary to a DataFrame for interpolation
+dic_df = pd.DataFrame(list(dic.items()), columns=['days_before_today', 'followers'])
+
+# Sort the DataFrame based on 'days_before_today'
+dic_df = dic_df.sort_values('days_before_today')
+
+# Interpolate the followers for the days not present in the dictionary
+dic_df = dic_df.set_index('days_before_today').reindex(range(dic_df.days_before_today.min(), dic_df.days_before_today.max() + 1)).interpolate().reset_index()
+
+# Merge the two DataFrames on 'days_before_today'
+df = pd.merge(df, dic_df, on='days_before_today', how='left')
+
+df['followers'] = df['followers'].replace(np.nan, 1000)"""
+
+df['followers'] = df.apply(magic_function, axis=1) #df.apply(lambda row : model.predict([[(today - row['post_date']).dt.days]])[0])
+
+# Rename the column to 'followers_at_time'
+df = df.rename(columns={'followers': 'followers_at_time'})
+

+ +

Engagement¶

Engagement is calculated by taking the sum of the likes, video view counts and comments, divided by followers at that time, and then whole multiplied by 100

+ + + + + + + +

# Calculate the engagement_percentage
+df['engagement_percentage'] = (df['like_count'] + df['comment_count'] + df['video_view_count']) / df['followers_at_time'] * 100
+

+ +

+ + + + + + +

processed_df = df
+

+ +

Analysis¶

By Post Type¶

+ + + + + + +

f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
+sns.scatterplot(data=processed_df, x="post_date", y="engagement_percentage", hue="post_type", ax=axs[0])
+sns.histplot(data=processed_df, x="post_type", hue="post_type", shrink=.8, alpha=.8, legend=False, ax=axs[1])
+f.tight_layout()
+

+ +

+ + + +

Even though videos are our most well performing post type, we don't post enough

+ + + + + + + +

f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
+sns.scatterplot(data=processed_df, x="post_date", y="like_count", hue="post_type", ax=axs[0])
+sns.histplot(data=processed_df, x="post_type", hue="post_type", shrink=.8, alpha=.8, legend=False, ax=axs[1])
+f.tight_layout()
+

+ +

+ + + +

There hasn't been a huge increase in the number of likes we get per post. Sure, it is steadily increasing, but we have had this account since 2017. Although, our frequency has definitely gotten better since 2022!

There also is a big jump that happened because of the 2023 FSAE Competition. 2022 did not show such a big jump.

+ + + + + + + +

f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
+sns.scatterplot(data=processed_df, x="post_date", y="comment_count", hue="post_type", ax=axs[0])
+sns.histplot(data=processed_df, x="post_type", hue="post_type", shrink=.8, alpha=.8, legend=False, ax=axs[1])
+f.tight_layout()
+

+ +

+ + + +

We need to figure out how to get people to comment more, as this is factored in when engagement is calculated

+ + + + + + + +

+ +