diff options
author | navanchauhan <navanchauhan@gmail.com> | 2022-05-22 12:03:28 -0600 |
---|---|---|
committer | navanchauhan <navanchauhan@gmail.com> | 2022-05-22 12:03:28 -0600 |
commit | d382b50c111f2f2867a4af0176285d0cea7b591a (patch) | |
tree | afbf871b800e367d2639fef1802381e2676e29ef | |
parent | 885747d29973e3a1b04a68bc40a3e72ca0b711e7 (diff) |
added new post movie recommender
-rw-r--r-- | Content/posts/2022-05-21-Similar-Movies-Recommender.md | 400 | ||||
-rw-r--r-- | Resources/assets/flixrec/filter.png | bin | 0 -> 242231 bytes | |||
-rw-r--r-- | Resources/assets/flixrec/home.png | bin | 0 -> 160255 bytes | |||
-rw-r--r-- | Resources/assets/flixrec/multiple.png | bin | 0 -> 251294 bytes | |||
-rw-r--r-- | Resources/assets/flixrec/results.png | bin | 0 -> 280362 bytes | |||
-rw-r--r-- | docs/feed.rss | 408 | ||||
-rw-r--r-- | docs/index.html | 195 | ||||
-rw-r--r-- | docs/posts/2022-05-21-Similar-Movies-Recommender.html | 438 | ||||
-rw-r--r-- | docs/posts/index.html | 17 |
9 files changed, 1367 insertions, 91 deletions
diff --git a/Content/posts/2022-05-21-Similar-Movies-Recommender.md b/Content/posts/2022-05-21-Similar-Movies-Recommender.md new file mode 100644 index 0000000..fbc9fdb --- /dev/null +++ b/Content/posts/2022-05-21-Similar-Movies-Recommender.md @@ -0,0 +1,400 @@ +--- +date: 2022-05-21 17:56 +description: Building a Content Based Similar Movies Recommender System +tags: Python, Transformers, Movies, Recommender-System +--- + +# Building a Simple Similar Movies Recommender System + +## Why? + +I recently came across a movie/tv-show recommender, [couchmoney.tv](https://couchmoney.tv/). I loved it. I decided that I wanted to build something similar, so I could tinker with it as much as I wanted. + +I also wanted a recommendation system I could use via a REST API. Although I have not included that part in this post, I did eventually create it. + + +## How? + +By measuring the cosine of the angle between two vectors, you can get a value in the range [0,1] with 0 meaning no similarity. Now, if we find a way to represent information about movies as a vector, we can use cosine similarity as a metric to find similar movies. + +As we are recommending just based on the content of the movies, this is called a content based recommendation system. + +## Data Collection + +Trakt exposes a nice API to search for movies/tv-shows. To access the API, you first need to get an API key (the Trakt ID you get when you create a new application). + +I decided to use SQL-Alchemy with a SQLite backend just to make my life easier if I decided on switching to Postgres anytime I felt like. + +First, I needed to check the total number of records in Trakt’s database. + +```python +import requests +import os + +trakt_id = os.getenv("TRAKT_ID") + +api_base = "https://api.trakt.tv" + +headers = { + "Content-Type": "application/json", + "trakt-api-version": "2", + "trakt-api-key": trakt_id +} + +params = { + "query": "", + "years": "1900-2021", + "page": "1", + "extended": "full", + "languages": "en" +} + +res = requests.get(f"{api_base}/search/movie",headers=headers,params=params) +total_items = res.headers["x-pagination-item-count"] +print(f"There are {total_items} movies") +``` + +``` +There are 333946 movies +``` + +First, I needed to declare the database schema in (`database.py`): + +```python +import sqlalchemy +from sqlalchemy import create_engine +from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey, PickleType +from sqlalchemy import insert +from sqlalchemy.orm import sessionmaker +from sqlalchemy.exc import IntegrityError + +meta = MetaData() + +movies_table = Table( + "movies", + meta, + Column("trakt_id", Integer, primary_key=True, autoincrement=False), + Column("title", String), + Column("overview", String), + Column("genres", String), + Column("year", Integer), + Column("released", String), + Column("runtime", Integer), + Column("country", String), + Column("language", String), + Column("rating", Integer), + Column("votes", Integer), + Column("comment_count", Integer), + Column("tagline", String), + Column("embeddings", PickleType) + +) + +# Helper function to connect to the db +def init_db_stuff(database_url: str): + engine = create_engine(database_url) + meta.create_all(engine) + Session = sessionmaker(bind=engine) + return engine, Session +``` + +In the end, I could have dropped the embeddings field from the table schema as I never got around to using it. + +### Scripting Time + +```python +from database import * +from tqdm import tqdm +import requests +import os + +trakt_id = os.getenv("TRAKT_ID") + +max_requests = 5000 # How many requests I wanted to wrap everything up in +req_count = 0 # A counter for how many requests I have made + +years = "1900-2021" +page = 1 # The initial page number for the search +extended = "full" # Required to get additional information +limit = "10" # No of entires per request -- This will be automatically picked based on max_requests +languages = "en" # Limit to English + +api_base = "https://api.trakt.tv" +database_url = "sqlite:///jlm.db" + +headers = { + "Content-Type": "application/json", + "trakt-api-version": "2", + "trakt-api-key": trakt_id +} + +params = { + "query": "", + "years": years, + "page": page, + "extended": extended, + "limit": limit, + "languages": languages +} + +# Helper function to get desirable values from the response +def create_movie_dict(movie: dict): + m = movie["movie"] + movie_dict = { + "title": m["title"], + "overview": m["overview"], + "genres": m["genres"], + "language": m["language"], + "year": int(m["year"]), + "trakt_id": m["ids"]["trakt"], + "released": m["released"], + "runtime": int(m["runtime"]), + "country": m["country"], + "rating": int(m["rating"]), + "votes": int(m["votes"]), + "comment_count": int(m["comment_count"]), + "tagline": m["tagline"] + } + return movie_dict + +# Get total number of items +params["limit"] = 1 +res = requests.get(f"{api_base}/search/movie",headers=headers,params=params) +total_items = res.headers["x-pagination-item-count"] + +engine, Session = init_db_stuff(database_url) + + +for page in tqdm(range(1,max_requests+1)): + params["page"] = page + params["limit"] = int(int(total_items)/max_requests) + movies = [] + res = requests.get(f"{api_base}/search/movie",headers=headers,params=params) + + if res.status_code == 500: + break + elif res.status_code == 200: + None + else: + print(f"OwO Code {res.status_code}") + + for movie in res.json(): + movies.append(create_movie_dict(movie)) + + with engine.connect() as conn: + for movie in movies: + with conn.begin() as trans: + stmt = insert(movies_table).values( + trakt_id=movie["trakt_id"], title=movie["title"], genres=" ".join(movie["genres"]), + language=movie["language"], year=movie["year"], released=movie["released"], + runtime=movie["runtime"], country=movie["country"], overview=movie["overview"], + rating=movie["rating"], votes=movie["votes"], comment_count=movie["comment_count"], + tagline=movie["tagline"]) + try: + result = conn.execute(stmt) + trans.commit() + except IntegrityError: + trans.rollback() + req_count += 1 +``` + +(Note: I was well within the rate-limit so I did not have to slow down or implement any other measures) + +Running this script took me approximately 3 hours, and resulted in an SQLite database of 141.5 MB + +## Embeddings! + +I did not want to put my poor Mac through the estimated 23 hours it would have taken to embed the sentences. I decided to use Google Colab instead. + +Because of the small size of the database file, I was able to just upload the file. + +For the encoding model, I decided to use the pretrained `paraphrase-multilingual-MiniLM-L12-v2` model for SentenceTransformers, a Python framework for SOTA sentence, text and image embeddings. I wanted to use a multilingual model as I personally consume content in various languages (natively, no dubs or subs) and some of the sources for their information do not translate to English. As of writing this post, I did not include any other database except Trakt. + +While deciding how I was going to process the embeddings, I came across multiple solutions: + +* [Milvus](https://milvus.io) - An open-source vector database with similar search functionality + +* [FAISS](https://faiss.ai) - A library for efficient similarity search + +* [Pinecone](https://pinecone.io) - A fully managed vector database with similar search functionality + +I did not want to waste time setting up the first two, so I decided to go with Pinecone which offers 1M 768-dim vectors for free with no credit card required (Our embeddings are 384-dim dense). + +Getting started with Pinecone was as easy as: + +* Signing up + +* Specifying the index name and vector dimensions along with the similarity search metric (Cosine Similarity for our use case) + +* Getting the API key + +* Installing the Python module (pinecone-client) + +```python +import pandas as pd +import pinecone +from sentence_transformers import SentenceTransformer +from tqdm import tqdm + +database_url = "sqlite:///jlm.db" +PINECONE_KEY = "not-this-at-all" +batch_size = 32 + +pinecone.init(api_key=PINECONE_KEY, environment="us-west1-gcp") +index = pinecone.Index("movies") + +model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2", device="cuda") +engine, Session = init_db_stuff(database_url) + +df = pd.read_sql("Select * from movies", engine) +df["combined_text"] = df["title"] + ": " + df["overview"].fillna('') + " - " + df["tagline"].fillna('') + " Genres:- " + df["genres"].fillna('') + +# Creating the embedding and inserting it into the database +for x in tqdm(range(0,len(df),batch_size)): + to_send = [] + trakt_ids = df["trakt_id"][x:x+batch_size].tolist() + sentences = df["combined_text"][x:x+batch_size].tolist() + embeddings = model.encode(sentences) + for idx, value in enumerate(trakt_ids): + to_send.append( + ( + str(value), embeddings[idx].tolist() + )) + index.upsert(to_send) +``` + +That's it! + +## Interacting with Vectors + +We use the `trakt_id` for the movie as the ID for the vectors and upsert it into the index. + +To find similar items, we will first have to map the name of the movie to its trakt_id, get the embeddings we have for that id and then perform a similarity search. It is possible that this additional step of mapping could be avoided by storing information as metadata in the index. + +```python +def get_trakt_id(df, title: str): + rec = df[df["title"].str.lower()==movie_name.lower()] + if len(rec.trakt_id.values.tolist()) > 1: + print(f"multiple values found... {len(rec.trakt_id.values)}") + for x in range(len(rec)): + print(f"[{x}] {rec['title'].tolist()[x]} ({rec['year'].tolist()[x]}) - {rec['overview'].tolist()}") + print("===") + z = int(input("Choose No: ")) + return rec.trakt_id.values[z] + return rec.trakt_id.values[0] + +def get_vector_value(trakt_id: int): + fetch_response = index.fetch(ids=[str(trakt_id)]) + return fetch_response["vectors"][str(trakt_id)]["values"] + +def query_vectors(vector: list, top_k: int = 20, include_values: bool = False, include_metada: bool = True): + query_response = index.query( + queries=[ + (vector), + ], + top_k=top_k, + include_values=include_values, + include_metadata=include_metada + ) + return query_response + +def query2ids(query_response): + trakt_ids = [] + for match in query_response["results"][0]["matches"]: + trakt_ids.append(int(match["id"])) + return trakt_ids + +def get_deets_by_trakt_id(df, trakt_id: int): + df = df[df["trakt_id"]==trakt_id] + return { + "title": df.title.values[0], + "overview": df.overview.values[0], + "runtime": df.runtime.values[0], + "year": df.year.values[0] + } +``` + +### Testing it Out + +```python +movie_name = "Now You See Me" + +movie_trakt_id = get_trakt_id(df, movie_name) +print(movie_trakt_id) +movie_vector = get_vector_value(movie_trakt_id) +movie_queries = query_vectors(movie_vector) +movie_ids = query2ids(movie_queries) +print(movie_ids) + +for trakt_id in movie_ids: + deets = get_deets_by_trakt_id(df, trakt_id) + print(f"{deets['title']} ({deets['year']}): {deets['overview']}") +``` + +Output: + +``` +55786 +[55786, 18374, 299592, 662622, 6054, 227458, 139687, 303950, 70000, 129307, 70823, 5766, 23950, 137696, 655723, 32842, 413269, 145994, 197990, 373832] +Now You See Me (2013): An FBI agent and an Interpol detective track a team of illusionists who pull off bank heists during their performances and reward their audiences with the money. +Trapped (1949): U.S. Treasury Department agents go after a ring of counterfeiters. +Brute Sanity (2018): An FBI-trained neuropsychologist teams up with a thief to find a reality-altering device while her insane ex-boss unleashes bizarre traps to stop her. +The Chase (2017): Some FBI agents hunt down a criminal +Surveillance (2008): An FBI agent tracks a serial killer with the help of three of his would-be victims - all of whom have wildly different stories to tell. +Marauders (2016): An untraceable group of elite bank robbers is chased by a suicidal FBI agent who uncovers a deeper purpose behind the robbery-homicides. +Miracles for Sale (1939): A maker of illusions for magicians protects an ingenue likely to be murdered. +Deceptors (2005): A Ghostbusters knock-off where a group of con-artists create bogus monsters to scare up some cash. They run for their lives when real spooks attack. +The Outfit (1993): A renegade FBI agent sparks an explosive mob war between gangster crime lords Legs Diamond and Dutch Schultz. +Bank Alarm (1937): A federal agent learns the gangsters he's been investigating have kidnapped his sister. +The Courier (2012): A shady FBI agent recruits a courier to deliver a mysterious package to a vengeful master criminal who has recently resurfaced with a diabolical plan. +After the Sunset (2004): An FBI agent is suspicious of two master thieves, quietly enjoying their retirement near what may - or may not - be the biggest score of their careers. +Down Three Dark Streets (1954): An FBI Agent takes on the three unrelated cases of a dead agent to track down his killer. +The Executioner (1970): A British intelligence agent must track down a fellow spy suspected of being a double agent. +Ace of Cactus Range (1924): A Secret Service agent goes undercover to unmask the leader of a gang of diamond thieves. +Firepower (1979): A mercenary is hired by the FBI to track down a powerful recluse criminal, a woman is also trying to track him down for her own personal vendetta. +Heroes & Villains (2018): an FBI agent chases a thug to great tunes +Federal Fugitives (1941): A government agent goes undercover in order to apprehend a saboteur who caused a plane crash. +Hell on Earth (2012): An FBI Agent on the trail of a group of drug traffickers learns that their corruption runs deeper than she ever imagined, and finds herself in a supernatural - and deadly - situation. +Spies (2015): A secret agent must perform a heist without time on his side +``` + +For now, I am happy with the recommendations. + +## Simple UI + +The code for the flask app can be found on GitHub: [navanchauhan/FlixRec](https://github.com/navanchauhan/FlixRec) or on my [Gitea instance](https://pi4.navan.dev/gitea/navan/FlixRec) + +I quickly whipped up a simple Flask App to deal with problems of multiple movies sharing the title, and typos in the search query. + +### Home Page + +![Home Page](/assets/flixrec/home.png) + +### Handling Multiple Movies with Same Title + +![Multiple Movies with Same Title](/assets/flixrec/multiple.png) + +### Results Page + +![Results Page](/assets/flixrec/results.png) + +Includes additional filter options + +![Advance Filtering Options](/assets/flixrec/filter.png) + +Test it out at [https://flixrec.navan.dev](https://flixrec.navan.dev) + +## Current Limittations + +* Does not work well with popular franchises +* No Genre Filter + +## Future Addons + +* Include Cast Data + * e.g. If it sees a movie with Tom Hanks and Meg Ryan, then it will boost similar movies including them + * e.g. If it sees the movie has been directed my McG, then it will boost similar movies directed by them +* REST API +* TV Shows +* Multilingual database +* Filter based on popularity: The data already exists in the indexed database
\ No newline at end of file diff --git a/Resources/assets/flixrec/filter.png b/Resources/assets/flixrec/filter.png Binary files differnew file mode 100644 index 0000000..c1e4c52 --- /dev/null +++ b/Resources/assets/flixrec/filter.png diff --git a/Resources/assets/flixrec/home.png b/Resources/assets/flixrec/home.png Binary files differnew file mode 100644 index 0000000..2d6fb51 --- /dev/null +++ b/Resources/assets/flixrec/home.png diff --git a/Resources/assets/flixrec/multiple.png b/Resources/assets/flixrec/multiple.png Binary files differnew file mode 100644 index 0000000..f35d342 --- /dev/null +++ b/Resources/assets/flixrec/multiple.png diff --git a/Resources/assets/flixrec/results.png b/Resources/assets/flixrec/results.png Binary files differnew file mode 100644 index 0000000..a239ba4 --- /dev/null +++ b/Resources/assets/flixrec/results.png diff --git a/docs/feed.rss b/docs/feed.rss index 2b53f53..3f65a70 100644 --- a/docs/feed.rss +++ b/docs/feed.rss @@ -4,8 +4,8 @@ <title>Navan's Archive</title> <description>Rare Tips, Tricks and Posts</description> <link>https://web.navan.dev/</link><language>en</language> - <lastBuildDate>Sun, 07 Nov 2021 17:42:49 -0000</lastBuildDate> - <pubDate>Sun, 07 Nov 2021 17:42:49 -0000</pubDate> + <lastBuildDate>Sun, 22 May 2022 11:59:10 -0000</lastBuildDate> + <pubDate>Sun, 22 May 2022 11:59:10 -0000</pubDate> <ttl>250</ttl> <atom:link href="https://web.navan.dev/feed.rss" rel="self" type="application/rss+xml"/> @@ -567,6 +567,410 @@ export BABEL_LIBDIR="/usr/lib/openbabel/3.1.0" <item> <guid isPermaLink="true"> + https://web.navan.dev/posts/2022-05-21-Similar-Movies-Recommender.html + </guid> + <title> + Building a Simple Similar Movies Recommender System + </title> + <description> + Building a Content Based Similar Movies Recommender System + </description> + <link>https://web.navan.dev/posts/2022-05-21-Similar-Movies-Recommender.html</link> + <pubDate>Sat, 21 May 2022 17:56:00 -0000</pubDate> + <content:encoded><![CDATA[<h1>Building a Simple Similar Movies Recommender System</h1> + +<h2>Why?</h2> + +<p>I recently came across a movie/tv-show recommender, <a rel="noopener" target="_blank" href="https://couchmoney.tv/">couchmoney.tv</a>. I loved it. I decided that I wanted to build something similar, so I could tinker with it as much as I wanted.</p> + +<p>I also wanted a recommendation system I could use via a REST API. Although I have not included that part in this post, I did eventually create it.</p> + +<h2>How?</h2> + +<p>By measuring the cosine of the angle between two vectors, you can get a value in the range [0,1] with 0 meaning no similarity. Now, if we find a way to represent information about movies as a vector, we can use cosine similarity as a metric to find similar movies.</p> + +<p>As we are recommending just based on the content of the movies, this is called a content based recommendation system.</p> + +<h2>Data Collection</h2> + +<p>Trakt exposes a nice API to search for movies/tv-shows. To access the API, you first need to get an API key (the Trakt ID you get when you create a new application). </p> + +<p>I decided to use SQL-Alchemy with a SQLite backend just to make my life easier if I decided on switching to Postgres anytime I felt like. </p> + +<p>First, I needed to check the total number of records in Trakt’s database.</p> + +<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">requests</span> +<span class="kn">import</span> <span class="nn">os</span> + +<span class="n">trakt_id</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"TRAKT_ID"</span><span class="p">)</span> + +<span class="n">api_base</span> <span class="o">=</span> <span class="s2">"https://api.trakt.tv"</span> + +<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"Content-Type"</span><span class="p">:</span> <span class="s2">"application/json"</span><span class="p">,</span> + <span class="s2">"trakt-api-version"</span><span class="p">:</span> <span class="s2">"2"</span><span class="p">,</span> + <span class="s2">"trakt-api-key"</span><span class="p">:</span> <span class="n">trakt_id</span> +<span class="p">}</span> + +<span class="n">params</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"query"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span> + <span class="s2">"years"</span><span class="p">:</span> <span class="s2">"1900-2021"</span><span class="p">,</span> + <span class="s2">"page"</span><span class="p">:</span> <span class="s2">"1"</span><span class="p">,</span> + <span class="s2">"extended"</span><span class="p">:</span> <span class="s2">"full"</span><span class="p">,</span> + <span class="s2">"languages"</span><span class="p">:</span> <span class="s2">"en"</span> +<span class="p">}</span> + +<span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">api_base</span><span class="si">}</span><span class="s2">/search/movie"</span><span class="p">,</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span><span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span> +<span class="n">total_items</span> <span class="o">=</span> <span class="n">res</span><span class="o">.</span><span class="n">headers</span><span class="p">[</span><span class="s2">"x-pagination-item-count"</span><span class="p">]</span> +<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"There are </span><span class="si">{</span><span class="n">total_items</span><span class="si">}</span><span class="s2"> movies"</span><span class="p">)</span> +</code></pre></div> + +<pre><code>There are 333946 movies +</code></pre> + +<p>First, I needed to declare the database schema in (<code>database.py</code>):</p> + +<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">sqlalchemy</span> +<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">create_engine</span> +<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">Table</span><span class="p">,</span> <span class="n">Column</span><span class="p">,</span> <span class="n">Integer</span><span class="p">,</span> <span class="n">String</span><span class="p">,</span> <span class="n">MetaData</span><span class="p">,</span> <span class="n">ForeignKey</span><span class="p">,</span> <span class="n">PickleType</span> +<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">insert</span> +<span class="kn">from</span> <span class="nn">sqlalchemy.orm</span> <span class="kn">import</span> <span class="n">sessionmaker</span> +<span class="kn">from</span> <span class="nn">sqlalchemy.exc</span> <span class="kn">import</span> <span class="n">IntegrityError</span> + +<span class="n">meta</span> <span class="o">=</span> <span class="n">MetaData</span><span class="p">()</span> + +<span class="n">movies_table</span> <span class="o">=</span> <span class="n">Table</span><span class="p">(</span> + <span class="s2">"movies"</span><span class="p">,</span> + <span class="n">meta</span><span class="p">,</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"trakt_id"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">autoincrement</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"title"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"overview"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"genres"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"released"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"runtime"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"country"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"language"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"rating"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"votes"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"comment_count"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"tagline"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"embeddings"</span><span class="p">,</span> <span class="n">PickleType</span><span class="p">)</span> + +<span class="p">)</span> + +<span class="c1"># Helper function to connect to the db</span> +<span class="k">def</span> <span class="nf">init_db_stuff</span><span class="p">(</span><span class="n">database_url</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span> + <span class="n">engine</span> <span class="o">=</span> <span class="n">create_engine</span><span class="p">(</span><span class="n">database_url</span><span class="p">)</span> + <span class="n">meta</span><span class="o">.</span><span class="n">create_all</span><span class="p">(</span><span class="n">engine</span><span class="p">)</span> + <span class="n">Session</span> <span class="o">=</span> <span class="n">sessionmaker</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span> + <span class="k">return</span> <span class="n">engine</span><span class="p">,</span> <span class="n">Session</span> +</code></pre></div> + +<p>In the end, I could have dropped the embeddings field from the table schema as I never got around to using it.</p> + +<h3>Scripting Time</h3> + +<div class="codehilite"><pre><span></span><code><span class="kn">from</span> <span class="nn">database</span> <span class="kn">import</span> <span class="o">*</span> +<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span> +<span class="kn">import</span> <span class="nn">requests</span> +<span class="kn">import</span> <span class="nn">os</span> + +<span class="n">trakt_id</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"TRAKT_ID"</span><span class="p">)</span> + +<span class="n">max_requests</span> <span class="o">=</span> <span class="mi">5000</span> <span class="c1"># How many requests I wanted to wrap everything up in</span> +<span class="n">req_count</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># A counter for how many requests I have made</span> + +<span class="n">years</span> <span class="o">=</span> <span class="s2">"1900-2021"</span> +<span class="n">page</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># The initial page number for the search</span> +<span class="n">extended</span> <span class="o">=</span> <span class="s2">"full"</span> <span class="c1"># Required to get additional information </span> +<span class="n">limit</span> <span class="o">=</span> <span class="s2">"10"</span> <span class="c1"># No of entires per request -- This will be automatically picked based on max_requests</span> +<span class="n">languages</span> <span class="o">=</span> <span class="s2">"en"</span> <span class="c1"># Limit to English</span> + +<span class="n">api_base</span> <span class="o">=</span> <span class="s2">"https://api.trakt.tv"</span> +<span class="n">database_url</span> <span class="o">=</span> <span class="s2">"sqlite:///jlm.db"</span> + +<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"Content-Type"</span><span class="p">:</span> <span class="s2">"application/json"</span><span class="p">,</span> + <span class="s2">"trakt-api-version"</span><span class="p">:</span> <span class="s2">"2"</span><span class="p">,</span> + <span class="s2">"trakt-api-key"</span><span class="p">:</span> <span class="n">trakt_id</span> +<span class="p">}</span> + +<span class="n">params</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"query"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span> + <span class="s2">"years"</span><span class="p">:</span> <span class="n">years</span><span class="p">,</span> + <span class="s2">"page"</span><span class="p">:</span> <span class="n">page</span><span class="p">,</span> + <span class="s2">"extended"</span><span class="p">:</span> <span class="n">extended</span><span class="p">,</span> + <span class="s2">"limit"</span><span class="p">:</span> <span class="n">limit</span><span class="p">,</span> + <span class="s2">"languages"</span><span class="p">:</span> <span class="n">languages</span> +<span class="p">}</span> + +<span class="c1"># Helper function to get desirable values from the response</span> +<span class="k">def</span> <span class="nf">create_movie_dict</span><span class="p">(</span><span class="n">movie</span><span class="p">:</span> <span class="nb">dict</span><span class="p">):</span> + <span class="n">m</span> <span class="o">=</span> <span class="n">movie</span><span class="p">[</span><span class="s2">"movie"</span><span class="p">]</span> + <span class="n">movie_dict</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"title"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"title"</span><span class="p">],</span> + <span class="s2">"overview"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"overview"</span><span class="p">],</span> + <span class="s2">"genres"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"genres"</span><span class="p">],</span> + <span class="s2">"language"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"language"</span><span class="p">],</span> + <span class="s2">"year"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"year"</span><span class="p">]),</span> + <span class="s2">"trakt_id"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"ids"</span><span class="p">][</span><span class="s2">"trakt"</span><span class="p">],</span> + <span class="s2">"released"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"released"</span><span class="p">],</span> + <span class="s2">"runtime"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"runtime"</span><span class="p">]),</span> + <span class="s2">"country"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"country"</span><span class="p">],</span> + <span class="s2">"rating"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"rating"</span><span class="p">]),</span> + <span class="s2">"votes"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"votes"</span><span class="p">]),</span> + <span class="s2">"comment_count"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"comment_count"</span><span class="p">]),</span> + <span class="s2">"tagline"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"tagline"</span><span class="p">]</span> + <span class="p">}</span> + <span class="k">return</span> <span class="n">movie_dict</span> + +<span class="c1"># Get total number of items</span> +<span class="n">params</span><span class="p">[</span><span class="s2">"limit"</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> +<span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">api_base</span><span class="si">}</span><span class="s2">/search/movie"</span><span class="p">,</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span><span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span> +<span class="n">total_items</span> <span class="o">=</span> <span class="n">res</span><span class="o">.</span><span class="n">headers</span><span class="p">[</span><span class="s2">"x-pagination-item-count"</span><span class="p">]</span> + +<span class="n">engine</span><span class="p">,</span> <span class="n">Session</span> <span class="o">=</span> <span class="n">init_db_stuff</span><span class="p">(</span><span class="n">database_url</span><span class="p">)</span> + + +<span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">max_requests</span><span class="o">+</span><span class="mi">1</span><span class="p">)):</span> + <span class="n">params</span><span class="p">[</span><span class="s2">"page"</span><span class="p">]</span> <span class="o">=</span> <span class="n">page</span> + <span class="n">params</span><span class="p">[</span><span class="s2">"limit"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">total_items</span><span class="p">)</span><span class="o">/</span><span class="n">max_requests</span><span class="p">)</span> + <span class="n">movies</span> <span class="o">=</span> <span class="p">[]</span> + <span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">api_base</span><span class="si">}</span><span class="s2">/search/movie"</span><span class="p">,</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span><span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span> + + <span class="k">if</span> <span class="n">res</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">500</span><span class="p">:</span> + <span class="k">break</span> + <span class="k">elif</span> <span class="n">res</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span> + <span class="kc">None</span> + <span class="k">else</span><span class="p">:</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"OwO Code </span><span class="si">{</span><span class="n">res</span><span class="o">.</span><span class="n">status_code</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> + + <span class="k">for</span> <span class="n">movie</span> <span class="ow">in</span> <span class="n">res</span><span class="o">.</span><span class="n">json</span><span class="p">():</span> + <span class="n">movies</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">create_movie_dict</span><span class="p">(</span><span class="n">movie</span><span class="p">))</span> + + <span class="k">with</span> <span class="n">engine</span><span class="o">.</span><span class="n">connect</span><span class="p">()</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span> + <span class="k">for</span> <span class="n">movie</span> <span class="ow">in</span> <span class="n">movies</span><span class="p">:</span> + <span class="k">with</span> <span class="n">conn</span><span class="o">.</span><span class="n">begin</span><span class="p">()</span> <span class="k">as</span> <span class="n">trans</span><span class="p">:</span> + <span class="n">stmt</span> <span class="o">=</span> <span class="n">insert</span><span class="p">(</span><span class="n">movies_table</span><span class="p">)</span><span class="o">.</span><span class="n">values</span><span class="p">(</span> + <span class="n">trakt_id</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"trakt_id"</span><span class="p">],</span> <span class="n">title</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"title"</span><span class="p">],</span> <span class="n">genres</span><span class="o">=</span><span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">movie</span><span class="p">[</span><span class="s2">"genres"</span><span class="p">]),</span> + <span class="n">language</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"language"</span><span class="p">],</span> <span class="n">year</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"year"</span><span class="p">],</span> <span class="n">released</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"released"</span><span class="p">],</span> + <span class="n">runtime</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"runtime"</span><span class="p">],</span> <span class="n">country</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"country"</span><span class="p">],</span> <span class="n">overview</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"overview"</span><span class="p">],</span> + <span class="n">rating</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"rating"</span><span class="p">],</span> <span class="n">votes</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"votes"</span><span class="p">],</span> <span class="n">comment_count</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"comment_count"</span><span class="p">],</span> + <span class="n">tagline</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"tagline"</span><span class="p">])</span> + <span class="k">try</span><span class="p">:</span> + <span class="n">result</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">stmt</span><span class="p">)</span> + <span class="n">trans</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span> + <span class="k">except</span> <span class="n">IntegrityError</span><span class="p">:</span> + <span class="n">trans</span><span class="o">.</span><span class="n">rollback</span><span class="p">()</span> + <span class="n">req_count</span> <span class="o">+=</span> <span class="mi">1</span> +</code></pre></div> + +<p>(Note: I was well within the rate-limit so I did not have to slow down or implement any other measures)</p> + +<p>Running this script took me approximately 3 hours, and resulted in an SQLite database of 141.5 MB</p> + +<h2>Embeddings!</h2> + +<p>I did not want to put my poor Mac through the estimated 23 hours it would have taken to embed the sentences. I decided to use Google Colab instead.</p> + +<p>Because of the small size of the database file, I was able to just upload the file.</p> + +<p>For the encoding model, I decided to use the pretrained <code>paraphrase-multilingual-MiniLM-L12-v2</code> model for SentenceTransformers, a Python framework for SOTA sentence, text and image embeddings. I wanted to use a multilingual model as I personally consume content in various languages (natively, no dubs or subs) and some of the sources for their information do not translate to English. As of writing this post, I did not include any other database except Trakt. </p> + +<p>While deciding how I was going to process the embeddings, I came across multiple solutions:</p> + +<ul> +<li><p><a rel="noopener" target="_blank" href="https://milvus.io">Milvus</a> - An open-source vector database with similar search functionality</p></li> +<li><p><a rel="noopener" target="_blank" href="https://faiss.ai">FAISS</a> - A library for efficient similarity search</p></li> +<li><p><a rel="noopener" target="_blank" href="https://pinecone.io">Pinecone</a> - A fully managed vector database with similar search functionality</p></li> +</ul> + +<p>I did not want to waste time setting up the first two, so I decided to go with Pinecone which offers 1M 768-dim vectors for free with no credit card required (Our embeddings are 384-dim dense).</p> + +<p>Getting started with Pinecone was as easy as:</p> + +<ul> +<li><p>Signing up</p></li> +<li><p>Specifying the index name and vector dimensions along with the similarity search metric (Cosine Similarity for our use case)</p></li> +<li><p>Getting the API key</p></li> +<li><p>Installing the Python module (pinecone-client)</p></li> +</ul> + +<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> +<span class="kn">import</span> <span class="nn">pinecone</span> +<span class="kn">from</span> <span class="nn">sentence_transformers</span> <span class="kn">import</span> <span class="n">SentenceTransformer</span> +<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span> + +<span class="n">database_url</span> <span class="o">=</span> <span class="s2">"sqlite:///jlm.db"</span> +<span class="n">PINECONE_KEY</span> <span class="o">=</span> <span class="s2">"not-this-at-all"</span> +<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">32</span> + +<span class="n">pinecone</span><span class="o">.</span><span class="n">init</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">PINECONE_KEY</span><span class="p">,</span> <span class="n">environment</span><span class="o">=</span><span class="s2">"us-west1-gcp"</span><span class="p">)</span> +<span class="n">index</span> <span class="o">=</span> <span class="n">pinecone</span><span class="o">.</span><span class="n">Index</span><span class="p">(</span><span class="s2">"movies"</span><span class="p">)</span> + +<span class="n">model</span> <span class="o">=</span> <span class="n">SentenceTransformer</span><span class="p">(</span><span class="s2">"paraphrase-multilingual-MiniLM-L12-v2"</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s2">"cuda"</span><span class="p">)</span> +<span class="n">engine</span><span class="p">,</span> <span class="n">Session</span> <span class="o">=</span> <span class="n">init_db_stuff</span><span class="p">(</span><span class="n">database_url</span><span class="p">)</span> + +<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_sql</span><span class="p">(</span><span class="s2">"Select * from movies"</span><span class="p">,</span> <span class="n">engine</span><span class="p">)</span> +<span class="n">df</span><span class="p">[</span><span class="s2">"combined_text"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"title"</span><span class="p">]</span> <span class="o">+</span> <span class="s2">": "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s2">"overview"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span> <span class="o">+</span> <span class="s2">" - "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s2">"tagline"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span> <span class="o">+</span> <span class="s2">" Genres:- "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s2">"genres"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span> + +<span class="c1"># Creating the embedding and inserting it into the database</span> +<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">),</span><span class="n">batch_size</span><span class="p">)):</span> + <span class="n">to_send</span> <span class="o">=</span> <span class="p">[]</span> + <span class="n">trakt_ids</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"trakt_id"</span><span class="p">][</span><span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> + <span class="n">sentences</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"combined_text"</span><span class="p">][</span><span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> + <span class="n">embeddings</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span> + <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">trakt_ids</span><span class="p">):</span> + <span class="n">to_send</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> + <span class="p">(</span> + <span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">embeddings</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> + <span class="p">))</span> + <span class="n">index</span><span class="o">.</span><span class="n">upsert</span><span class="p">(</span><span class="n">to_send</span><span class="p">)</span> +</code></pre></div> + +<p>That's it!</p> + +<h2>Interacting with Vectors</h2> + +<p>We use the <code>trakt_id</code> for the movie as the ID for the vectors and upsert it into the index. </p> + +<p>To find similar items, we will first have to map the name of the movie to its trakt_id, get the embeddings we have for that id and then perform a similarity search. It is possible that this additional step of mapping could be avoided by storing information as metadata in the index.</p> + +<div class="codehilite"><pre><span></span><code><span class="k">def</span> <span class="nf">get_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">title</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span> + <span class="n">rec</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"title"</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">==</span><span class="n">movie_name</span><span class="o">.</span><span class="n">lower</span><span class="p">()]</span> + <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"multiple values found... </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="p">)</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> + <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">rec</span><span class="p">)):</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"[</span><span class="si">{</span><span class="n">x</span><span class="si">}</span><span class="s2">] </span><span class="si">{</span><span class="n">rec</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()[</span><span class="n">x</span><span class="p">]</span><span class="si">}</span><span class="s2"> (</span><span class="si">{</span><span class="n">rec</span><span class="p">[</span><span class="s1">'year'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()[</span><span class="n">x</span><span class="p">]</span><span class="si">}</span><span class="s2">) - </span><span class="si">{</span><span class="n">rec</span><span class="p">[</span><span class="s1">'overview'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> + <span class="nb">print</span><span class="p">(</span><span class="s2">"==="</span><span class="p">)</span> + <span class="n">z</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">input</span><span class="p">(</span><span class="s2">"Choose No: "</span><span class="p">))</span> + <span class="k">return</span> <span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="n">z</span><span class="p">]</span> + <span class="k">return</span> <span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> + +<span class="k">def</span> <span class="nf">get_vector_value</span><span class="p">(</span><span class="n">trakt_id</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span> + <span class="n">fetch_response</span> <span class="o">=</span> <span class="n">index</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">ids</span><span class="o">=</span><span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">trakt_id</span><span class="p">)])</span> + <span class="k">return</span> <span class="n">fetch_response</span><span class="p">[</span><span class="s2">"vectors"</span><span class="p">][</span><span class="nb">str</span><span class="p">(</span><span class="n">trakt_id</span><span class="p">)][</span><span class="s2">"values"</span><span class="p">]</span> + +<span class="k">def</span> <span class="nf">query_vectors</span><span class="p">(</span><span class="n">vector</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">20</span><span class="p">,</span> <span class="n">include_values</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span> <span class="n">include_metada</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span> + <span class="n">query_response</span> <span class="o">=</span> <span class="n">index</span><span class="o">.</span><span class="n">query</span><span class="p">(</span> + <span class="n">queries</span><span class="o">=</span><span class="p">[</span> + <span class="p">(</span><span class="n">vector</span><span class="p">),</span> + <span class="p">],</span> + <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span><span class="p">,</span> + <span class="n">include_values</span><span class="o">=</span><span class="n">include_values</span><span class="p">,</span> + <span class="n">include_metadata</span><span class="o">=</span><span class="n">include_metada</span> + <span class="p">)</span> + <span class="k">return</span> <span class="n">query_response</span> + +<span class="k">def</span> <span class="nf">query2ids</span><span class="p">(</span><span class="n">query_response</span><span class="p">):</span> + <span class="n">trakt_ids</span> <span class="o">=</span> <span class="p">[]</span> + <span class="k">for</span> <span class="n">match</span> <span class="ow">in</span> <span class="n">query_response</span><span class="p">[</span><span class="s2">"results"</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">"matches"</span><span class="p">]:</span> + <span class="n">trakt_ids</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">match</span><span class="p">[</span><span class="s2">"id"</span><span class="p">]))</span> + <span class="k">return</span> <span class="n">trakt_ids</span> + +<span class="k">def</span> <span class="nf">get_deets_by_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">trakt_id</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span> + <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"trakt_id"</span><span class="p">]</span><span class="o">==</span><span class="n">trakt_id</span><span class="p">]</span> + <span class="k">return</span> <span class="p">{</span> + <span class="s2">"title"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> + <span class="s2">"overview"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">overview</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> + <span class="s2">"runtime"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">runtime</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> + <span class="s2">"year"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> + <span class="p">}</span> +</code></pre></div> + +<h3>Testing it Out</h3> + +<div class="codehilite"><pre><span></span><code><span class="n">movie_name</span> <span class="o">=</span> <span class="s2">"Now You See Me"</span> + +<span class="n">movie_trakt_id</span> <span class="o">=</span> <span class="n">get_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">movie_name</span><span class="p">)</span> +<span class="nb">print</span><span class="p">(</span><span class="n">movie_trakt_id</span><span class="p">)</span> +<span class="n">movie_vector</span> <span class="o">=</span> <span class="n">get_vector_value</span><span class="p">(</span><span class="n">movie_trakt_id</span><span class="p">)</span> +<span class="n">movie_queries</span> <span class="o">=</span> <span class="n">query_vectors</span><span class="p">(</span><span class="n">movie_vector</span><span class="p">)</span> +<span class="n">movie_ids</span> <span class="o">=</span> <span class="n">query2ids</span><span class="p">(</span><span class="n">movie_queries</span><span class="p">)</span> +<span class="nb">print</span><span class="p">(</span><span class="n">movie_ids</span><span class="p">)</span> + +<span class="k">for</span> <span class="n">trakt_id</span> <span class="ow">in</span> <span class="n">movie_ids</span><span class="p">:</span> + <span class="n">deets</span> <span class="o">=</span> <span class="n">get_deets_by_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">trakt_id</span><span class="p">)</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">deets</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="si">}</span><span class="s2"> (</span><span class="si">{</span><span class="n">deets</span><span class="p">[</span><span class="s1">'year'</span><span class="p">]</span><span class="si">}</span><span class="s2">): </span><span class="si">{</span><span class="n">deets</span><span class="p">[</span><span class="s1">'overview'</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> +</code></pre></div> + +<p>Output:</p> + +<pre><code>[55786, 18374, 299592, 662622, 6054, 227458, 139687, 303950, 70000, 129307, 70823, 5766, 23950, 137696, 655723, 32842, 413269, 145994, 197990, 373832] +Now You See Me (2013): An FBI agent and an Interpol detective track a team of illusionists who pull off bank heists during their performances and reward their audiences with the money. +Trapped (1949): U.S. Treasury Department agents go after a ring of counterfeiters. +Brute Sanity (2018): An FBI-trained neuropsychologist teams up with a thief to find a reality-altering device while her insane ex-boss unleashes bizarre traps to stop her. +The Chase (2017): Some FBI agents hunt down a criminal +Surveillance (2008): An FBI agent tracks a serial killer with the help of three of his would-be victims - all of whom have wildly different stories to tell. +Marauders (2016): An untraceable group of elite bank robbers is chased by a suicidal FBI agent who uncovers a deeper purpose behind the robbery-homicides. +Miracles for Sale (1939): A maker of illusions for magicians protects an ingenue likely to be murdered. +Deceptors (2005): A Ghostbusters knock-off where a group of con-artists create bogus monsters to scare up some cash. They run for their lives when real spooks attack. +The Outfit (1993): A renegade FBI agent sparks an explosive mob war between gangster crime lords Legs Diamond and Dutch Schultz. +Bank Alarm (1937): A federal agent learns the gangsters he's been investigating have kidnapped his sister. +The Courier (2012): A shady FBI agent recruits a courier to deliver a mysterious package to a vengeful master criminal who has recently resurfaced with a diabolical plan. +After the Sunset (2004): An FBI agent is suspicious of two master thieves, quietly enjoying their retirement near what may - or may not - be the biggest score of their careers. +Down Three Dark Streets (1954): An FBI Agent takes on the three unrelated cases of a dead agent to track down his killer. +The Executioner (1970): A British intelligence agent must track down a fellow spy suspected of being a double agent. +Ace of Cactus Range (1924): A Secret Service agent goes undercover to unmask the leader of a gang of diamond thieves. +Firepower (1979): A mercenary is hired by the FBI to track down a powerful recluse criminal, a woman is also trying to track him down for her own personal vendetta. +Heroes & Villains (2018): an FBI agent chases a thug to great tunes +Federal Fugitives (1941): A government agent goes undercover in order to apprehend a saboteur who caused a plane crash. +Hell on Earth (2012): An FBI Agent on the trail of a group of drug traffickers learns that their corruption runs deeper than she ever imagined, and finds herself in a supernatural - and deadly - situation. +Spies (2015): A secret agent must perform a heist without time on his side +</code></pre> + +<p>For now, I am happy with the recommendations.</p> + +<h2>Simple UI</h2> + +<p>The code for the flask app can be found on GitHub: <a rel="noopener" target="_blank" href="https://github.com/navanchauhan/FlixRec">navanchauhan/FlixRec</a> or on my <a rel="noopener" target="_blank" href="https://pi4.navan.dev/gitea/navan/FlixRec">Gitea instance</a></p> + +<p>I quickly whipped up a simple Flask App to deal with problems of multiple movies sharing the title, and typos in the search query.</p> + +<h3>Home Page</h3> + +<p><img src="/assets/flixrec/home.png" alt="Home Page" /></p> + +<h3>Handling Multiple Movies with Same Title</h3> + +<p><img src="/assets/flixrec/multiple.png" alt="Multiple Movies with Same Title" /></p> + +<h3>Results Page</h3> + +<p><img src="/assets/flixrec/results.png" alt="Results Page" /></p> + +<p>Includes additional filter options</p> + +<p><img src="/assets/flixrec/filter.png" alt="Advance Filtering Options" /></p> + +<p>Test it out at <a rel="noopener" target="_blank" href="https://flixrec.navan.dev">https://flixrec.navan.dev</a></p> + +<h2>Current Limittations</h2> + +<ul> +<li>Does not work well with popular franchises</li> +<li>No Genre Filter</li> +</ul> + +<h2>Future Addons</h2> + +<ul> +<li>Include Cast Data +<ul> +<li>e.g. If it sees a movie with Tom Hanks and Meg Ryan, then it will boost similar movies including them</li> +<li>e.g. If it sees the movie has been directed my McG, then it will boost similar movies directed by them</li> +</ul></li> +<li>REST API</li> +<li>TV Shows</li> +<li>Multilingual database</li> +<li>Filter based on popularity: The data already exists in the indexed database</li> +</ul> +]]></content:encoded> + </item> + + <item> + <guid isPermaLink="true"> https://web.navan.dev/posts/2020-08-01-Natural-Feature-Tracking-ARJS.html </guid> <title> diff --git a/docs/index.html b/docs/index.html index 66eee2a..d55f8ee 100644 --- a/docs/index.html +++ b/docs/index.html @@ -45,17 +45,34 @@ <ul> + <li><a href="/posts/2022-05-21-Similar-Movies-Recommender.html">Building a Simple Similar Movies Recommender System</a></li> + <ul> + <li>Building a Content Based Similar Movies Recommender System</li> + <li>Published On: 2022-05-21 17:56</li> + <li>Tags: + + Python, + + Transformers, + + Movies, + + Recommender-System + + </ul> + + <li><a href="/posts/2021-06-27-Crude-ML-AI-Powered-Chatbot-Swift.html">Making a Crude ML Powered Chatbot in Swift using CoreML</a></li> <ul> <li>Writing a simple Machine-Learning powered Chatbot (or, daresay virtual personal assistant ) in Swift using CoreML.</li> <li>Published On: 2021-06-27 23:26</li> <li>Tags: - Swift, + Swift, - CoreML, + CoreML, - NLP, + NLP </ul> @@ -66,9 +83,9 @@ <li>Published On: 2021-06-26 13:04</li> <li>Tags: - Cheminformatics, + Cheminformatics, - JavaScript, + JavaScript </ul> @@ -79,11 +96,11 @@ <li>Published On: 2021-06-25 16:20</li> <li>Tags: - iOS, + iOS, - Shortcuts, + Shortcuts, - Fun, + Fun </ul> @@ -94,11 +111,11 @@ <li>Published On: 2021-06-25 00:08</li> <li>Tags: - Python, + Python, - Twitter, + Twitter, - Eh, + Eh </ul> @@ -109,13 +126,13 @@ <li>Published On: 2020-12-01 20:52</li> <li>Tags: - Tutorial, + Tutorial, - Code-Snippet, + Code-Snippet, - HTML, + HTML, - JavaScript, + JavaScript </ul> @@ -126,11 +143,11 @@ <li>Published On: 2020-11-17 15:04</li> <li>Tags: - Tutorial, + Tutorial, - Code-Snippet, + Code-Snippet, - Web-Development, + Web-Development </ul> @@ -141,11 +158,11 @@ <li>Published On: 2020-10-11 16:12</li> <li>Tags: - Tutorial, + Tutorial, - Review, + Review, - Webcam, + Webcam </ul> @@ -156,13 +173,13 @@ <li>Published On: 2020-08-01 15:43</li> <li>Tags: - Tutorial, + Tutorial, - AR.js, + AR.js, - JavaScript, + JavaScript, - Augmented-Reality, + Augmented-Reality </ul> @@ -173,11 +190,11 @@ <li>Published On: 2020-07-01 14:23</li> <li>Tags: - Tutorial, + Tutorial, - Code-Snippet, + Code-Snippet, - Colab, + Colab </ul> @@ -188,15 +205,15 @@ <li>Published On: 2020-06-02 23:23</li> <li>Tags: - iOS, + iOS, - Jailbreak, + Jailbreak, - Cheminformatics, + Cheminformatics, - AutoDock Vina, + AutoDock Vina, - Molecular-Docking, + Molecular-Docking </ul> @@ -207,15 +224,15 @@ <li>Published On: 2020-06-01 13:10</li> <li>Tags: - Code-Snippet, + Code-Snippet, - Molecular-Docking, + Molecular-Docking, - Cheminformatics, + Cheminformatics, - Open-Babel, + Open-Babel, - AutoDock Vina, + AutoDock Vina </ul> @@ -226,13 +243,13 @@ <li>Published On: 2020-05-31 23:30</li> <li>Tags: - iOS, + iOS, - Jailbreak, + Jailbreak, - Cheminformatics, + Cheminformatics, - Open-Babel, + Open-Babel </ul> @@ -243,9 +260,9 @@ <li>Published On: 2020-04-13 11:41</li> <li>Tags: - Molecular-Dynamics, + Molecular-Dynamics, - macOS, + macOS </ul> @@ -256,9 +273,9 @@ <li>Published On: 2020-03-17 17:40</li> <li>Tags: - publication, + publication, - pre-print, + pre-print </ul> @@ -269,9 +286,9 @@ <li>Published On: 2020-03-14 22:23</li> <li>Tags: - publication, + publication, - pre-print, + pre-print </ul> @@ -282,9 +299,9 @@ <li>Published On: 2020-03-08 23:17</li> <li>Tags: - Vaporwave, + Vaporwave, - Music, + Music </ul> @@ -295,9 +312,9 @@ <li>Published On: 2020-03-03 18:37</li> <li>Tags: - Android-TV, + Android-TV, - Android, + Android </ul> @@ -308,13 +325,13 @@ <li>Published On: 2020-01-19 15:27</li> <li>Tags: - Code-Snippet, + Code-Snippet, - tutorial, + tutorial, - Raspberry-Pi, + Raspberry-Pi, - Linux, + Linux </ul> @@ -325,11 +342,11 @@ <li>Published On: 2020-01-16 10:36</li> <li>Tags: - Tutorial, + Tutorial, - Colab, + Colab, - Turicreate, + Turicreate </ul> @@ -340,13 +357,13 @@ <li>Published On: 2020-01-15 23:36</li> <li>Tags: - Tutorial, + Tutorial, - Colab, + Colab, - Turicreate, + Turicreate, - Kaggle, + Kaggle </ul> @@ -357,9 +374,9 @@ <li>Published On: 2020-01-14 00:10</li> <li>Tags: - Code-Snippet, + Code-Snippet, - Tutorial, + Tutorial </ul> @@ -370,13 +387,13 @@ <li>Published On: 2019-12-22 11:10</li> <li>Tags: - Tutorial, + Tutorial, - Colab, + Colab, - SwiftUI, + SwiftUI, - Turicreate, + Turicreate </ul> @@ -387,11 +404,11 @@ <li>Published On: 2019-12-16 14:16</li> <li>Tags: - Tutorial, + Tutorial, - Tensorflow, + Tensorflow, - Colab, + Colab </ul> @@ -402,11 +419,11 @@ <li>Published On: 2019-12-10 11:10</li> <li>Tags: - Tutorial, + Tutorial, - Tensorflow, + Tensorflow, - Code-Snippet, + Code-Snippet </ul> @@ -417,11 +434,11 @@ <li>Published On: 2019-12-08 14:16</li> <li>Tags: - Tutorial, + Tutorial, - Tensorflow, + Tensorflow, - Colab, + Colab </ul> @@ -432,9 +449,9 @@ <li>Published On: 2019-12-08 13:27</li> <li>Tags: - Code-Snippet, + Code-Snippet, - Tutorial, + Tutorial </ul> @@ -445,7 +462,7 @@ <li>Published On: 2019-12-04 18:23</li> <li>Tags: - Tutorial, + Tutorial </ul> @@ -456,7 +473,7 @@ <li>Published On: 2019-05-14 02:42</li> <li>Tags: - publication, + publication </ul> @@ -467,15 +484,15 @@ <li>Published On: 2019-05-05 12:34</li> <li>Tags: - Tutorial, + Tutorial, - Jailbreak, + Jailbreak, - Designing, + Designing, - Snowboard, + Snowboard, - Anemone, + Anemone </ul> @@ -486,7 +503,7 @@ <li>Published On: 2019-04-16 17:39</li> <li>Tags: - hello-world, + hello-world </ul> @@ -497,7 +514,7 @@ <li>Published On: 2010-01-24 23:43</li> <li>Tags: - Experiment, + Experiment </ul> diff --git a/docs/posts/2022-05-21-Similar-Movies-Recommender.html b/docs/posts/2022-05-21-Similar-Movies-Recommender.html new file mode 100644 index 0000000..42b887a --- /dev/null +++ b/docs/posts/2022-05-21-Similar-Movies-Recommender.html @@ -0,0 +1,438 @@ +<!DOCTYPE html> +<html lang="en"> +<head> + + <link rel="stylesheet" href="/assets/main.css" /> + <link rel="stylesheet" href="/assets/sakura.css" /> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <title>Hey - Post - Building a Simple Similar Movies Recommender System</title> + <meta name="og:site_name" content="Navan Chauhan" /> + <link rel="canonical" href="https://web.navan.dev/" /> + <meta name="twitter:url" content="https://web.navan.dev/" /> + <meta name="og:url" content="https://web.navan.dev/" /> + <meta name="twitter:title" content="Hey - Post - Building a Simple Similar Movies Recommender System" /> + <meta name="og:title" content="Hey - Post - Building a Simple Similar Movies Recommender System" /> + <meta name="description" content=" Building a Content Based Similar Movies Recommender System " /> + <meta name="twitter:description" content=" Building a Content Based Similar Movies Recommender System " /> + <meta name="og:description" content=" Building a Content Based Similar Movies Recommender System " /> + <meta name="twitter:card" content=" Building a Content Based Similar Movies Recommender System " /> + <meta name="viewport" content="width=device-width, initial-scale=1.0" /> + <link rel="shortcut icon" href="/images/favicon.png" type="image/png" /> + <link rel="alternate" href="/feed.rss" type="application/rss+xml" title="Subscribe to Navan Chauhan" /> + <meta name="twitter:image" content="https://web.navan.dev/images/logo.png" /> + <meta name="og:image" content="https://web.navan.dev/images/logo.png" /> + <link rel="manifest" href="manifest.json" /> + <meta name="google-site-verification" content="LVeSZxz-QskhbEjHxOi7-BM5dDxTg53x2TwrjFxfL0k" /> + <script async src="//gc.zgo.at/count.js" data-goatcounter="https://navanchauhan.goatcounter.com/count"></script> + +</head> +<body> + <nav style="display: block;"> +| +<a href="/">home</a> | +<a href="/about/">about/links</a> | +<a href="/posts/">posts</a> | +<a href="/publications/">publications</a> | +<a href="/repo/">iOS repo</a> | +<a href="/feed.rss">RSS Feed</a> | +</nav> + +<main> + <h1>Building a Simple Similar Movies Recommender System</h1> + +<h2>Why?</h2> + +<p>I recently came across a movie/tv-show recommender, <a rel="noopener" target="_blank" href="https://couchmoney.tv/">couchmoney.tv</a>. I loved it. I decided that I wanted to build something similar, so I could tinker with it as much as I wanted.</p> + +<p>I also wanted a recommendation system I could use via a REST API. Although I have not included that part in this post, I did eventually create it.</p> + +<h2>How?</h2> + +<p>By measuring the cosine of the angle between two vectors, you can get a value in the range [0,1] with 0 meaning no similarity. Now, if we find a way to represent information about movies as a vector, we can use cosine similarity as a metric to find similar movies.</p> + +<p>As we are recommending just based on the content of the movies, this is called a content based recommendation system.</p> + +<h2>Data Collection</h2> + +<p>Trakt exposes a nice API to search for movies/tv-shows. To access the API, you first need to get an API key (the Trakt ID you get when you create a new application). </p> + +<p>I decided to use SQL-Alchemy with a SQLite backend just to make my life easier if I decided on switching to Postgres anytime I felt like. </p> + +<p>First, I needed to check the total number of records in Trakt’s database.</p> + +<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">requests</span> +<span class="kn">import</span> <span class="nn">os</span> + +<span class="n">trakt_id</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"TRAKT_ID"</span><span class="p">)</span> + +<span class="n">api_base</span> <span class="o">=</span> <span class="s2">"https://api.trakt.tv"</span> + +<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"Content-Type"</span><span class="p">:</span> <span class="s2">"application/json"</span><span class="p">,</span> + <span class="s2">"trakt-api-version"</span><span class="p">:</span> <span class="s2">"2"</span><span class="p">,</span> + <span class="s2">"trakt-api-key"</span><span class="p">:</span> <span class="n">trakt_id</span> +<span class="p">}</span> + +<span class="n">params</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"query"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span> + <span class="s2">"years"</span><span class="p">:</span> <span class="s2">"1900-2021"</span><span class="p">,</span> + <span class="s2">"page"</span><span class="p">:</span> <span class="s2">"1"</span><span class="p">,</span> + <span class="s2">"extended"</span><span class="p">:</span> <span class="s2">"full"</span><span class="p">,</span> + <span class="s2">"languages"</span><span class="p">:</span> <span class="s2">"en"</span> +<span class="p">}</span> + +<span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">api_base</span><span class="si">}</span><span class="s2">/search/movie"</span><span class="p">,</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span><span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span> +<span class="n">total_items</span> <span class="o">=</span> <span class="n">res</span><span class="o">.</span><span class="n">headers</span><span class="p">[</span><span class="s2">"x-pagination-item-count"</span><span class="p">]</span> +<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"There are </span><span class="si">{</span><span class="n">total_items</span><span class="si">}</span><span class="s2"> movies"</span><span class="p">)</span> +</code></pre></div> + +<pre><code>There are 333946 movies +</code></pre> + +<p>First, I needed to declare the database schema in (<code>database.py</code>):</p> + +<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">sqlalchemy</span> +<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">create_engine</span> +<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">Table</span><span class="p">,</span> <span class="n">Column</span><span class="p">,</span> <span class="n">Integer</span><span class="p">,</span> <span class="n">String</span><span class="p">,</span> <span class="n">MetaData</span><span class="p">,</span> <span class="n">ForeignKey</span><span class="p">,</span> <span class="n">PickleType</span> +<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">insert</span> +<span class="kn">from</span> <span class="nn">sqlalchemy.orm</span> <span class="kn">import</span> <span class="n">sessionmaker</span> +<span class="kn">from</span> <span class="nn">sqlalchemy.exc</span> <span class="kn">import</span> <span class="n">IntegrityError</span> + +<span class="n">meta</span> <span class="o">=</span> <span class="n">MetaData</span><span class="p">()</span> + +<span class="n">movies_table</span> <span class="o">=</span> <span class="n">Table</span><span class="p">(</span> + <span class="s2">"movies"</span><span class="p">,</span> + <span class="n">meta</span><span class="p">,</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"trakt_id"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">autoincrement</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"title"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"overview"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"genres"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"year"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"released"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"runtime"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"country"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"language"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"rating"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"votes"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"comment_count"</span><span class="p">,</span> <span class="n">Integer</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"tagline"</span><span class="p">,</span> <span class="n">String</span><span class="p">),</span> + <span class="n">Column</span><span class="p">(</span><span class="s2">"embeddings"</span><span class="p">,</span> <span class="n">PickleType</span><span class="p">)</span> + +<span class="p">)</span> + +<span class="c1"># Helper function to connect to the db</span> +<span class="k">def</span> <span class="nf">init_db_stuff</span><span class="p">(</span><span class="n">database_url</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span> + <span class="n">engine</span> <span class="o">=</span> <span class="n">create_engine</span><span class="p">(</span><span class="n">database_url</span><span class="p">)</span> + <span class="n">meta</span><span class="o">.</span><span class="n">create_all</span><span class="p">(</span><span class="n">engine</span><span class="p">)</span> + <span class="n">Session</span> <span class="o">=</span> <span class="n">sessionmaker</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span> + <span class="k">return</span> <span class="n">engine</span><span class="p">,</span> <span class="n">Session</span> +</code></pre></div> + +<p>In the end, I could have dropped the embeddings field from the table schema as I never got around to using it.</p> + +<h3>Scripting Time</h3> + +<div class="codehilite"><pre><span></span><code><span class="kn">from</span> <span class="nn">database</span> <span class="kn">import</span> <span class="o">*</span> +<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span> +<span class="kn">import</span> <span class="nn">requests</span> +<span class="kn">import</span> <span class="nn">os</span> + +<span class="n">trakt_id</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getenv</span><span class="p">(</span><span class="s2">"TRAKT_ID"</span><span class="p">)</span> + +<span class="n">max_requests</span> <span class="o">=</span> <span class="mi">5000</span> <span class="c1"># How many requests I wanted to wrap everything up in</span> +<span class="n">req_count</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># A counter for how many requests I have made</span> + +<span class="n">years</span> <span class="o">=</span> <span class="s2">"1900-2021"</span> +<span class="n">page</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># The initial page number for the search</span> +<span class="n">extended</span> <span class="o">=</span> <span class="s2">"full"</span> <span class="c1"># Required to get additional information </span> +<span class="n">limit</span> <span class="o">=</span> <span class="s2">"10"</span> <span class="c1"># No of entires per request -- This will be automatically picked based on max_requests</span> +<span class="n">languages</span> <span class="o">=</span> <span class="s2">"en"</span> <span class="c1"># Limit to English</span> + +<span class="n">api_base</span> <span class="o">=</span> <span class="s2">"https://api.trakt.tv"</span> +<span class="n">database_url</span> <span class="o">=</span> <span class="s2">"sqlite:///jlm.db"</span> + +<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"Content-Type"</span><span class="p">:</span> <span class="s2">"application/json"</span><span class="p">,</span> + <span class="s2">"trakt-api-version"</span><span class="p">:</span> <span class="s2">"2"</span><span class="p">,</span> + <span class="s2">"trakt-api-key"</span><span class="p">:</span> <span class="n">trakt_id</span> +<span class="p">}</span> + +<span class="n">params</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"query"</span><span class="p">:</span> <span class="s2">""</span><span class="p">,</span> + <span class="s2">"years"</span><span class="p">:</span> <span class="n">years</span><span class="p">,</span> + <span class="s2">"page"</span><span class="p">:</span> <span class="n">page</span><span class="p">,</span> + <span class="s2">"extended"</span><span class="p">:</span> <span class="n">extended</span><span class="p">,</span> + <span class="s2">"limit"</span><span class="p">:</span> <span class="n">limit</span><span class="p">,</span> + <span class="s2">"languages"</span><span class="p">:</span> <span class="n">languages</span> +<span class="p">}</span> + +<span class="c1"># Helper function to get desirable values from the response</span> +<span class="k">def</span> <span class="nf">create_movie_dict</span><span class="p">(</span><span class="n">movie</span><span class="p">:</span> <span class="nb">dict</span><span class="p">):</span> + <span class="n">m</span> <span class="o">=</span> <span class="n">movie</span><span class="p">[</span><span class="s2">"movie"</span><span class="p">]</span> + <span class="n">movie_dict</span> <span class="o">=</span> <span class="p">{</span> + <span class="s2">"title"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"title"</span><span class="p">],</span> + <span class="s2">"overview"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"overview"</span><span class="p">],</span> + <span class="s2">"genres"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"genres"</span><span class="p">],</span> + <span class="s2">"language"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"language"</span><span class="p">],</span> + <span class="s2">"year"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"year"</span><span class="p">]),</span> + <span class="s2">"trakt_id"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"ids"</span><span class="p">][</span><span class="s2">"trakt"</span><span class="p">],</span> + <span class="s2">"released"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"released"</span><span class="p">],</span> + <span class="s2">"runtime"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"runtime"</span><span class="p">]),</span> + <span class="s2">"country"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"country"</span><span class="p">],</span> + <span class="s2">"rating"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"rating"</span><span class="p">]),</span> + <span class="s2">"votes"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"votes"</span><span class="p">]),</span> + <span class="s2">"comment_count"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s2">"comment_count"</span><span class="p">]),</span> + <span class="s2">"tagline"</span><span class="p">:</span> <span class="n">m</span><span class="p">[</span><span class="s2">"tagline"</span><span class="p">]</span> + <span class="p">}</span> + <span class="k">return</span> <span class="n">movie_dict</span> + +<span class="c1"># Get total number of items</span> +<span class="n">params</span><span class="p">[</span><span class="s2">"limit"</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> +<span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">api_base</span><span class="si">}</span><span class="s2">/search/movie"</span><span class="p">,</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span><span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span> +<span class="n">total_items</span> <span class="o">=</span> <span class="n">res</span><span class="o">.</span><span class="n">headers</span><span class="p">[</span><span class="s2">"x-pagination-item-count"</span><span class="p">]</span> + +<span class="n">engine</span><span class="p">,</span> <span class="n">Session</span> <span class="o">=</span> <span class="n">init_db_stuff</span><span class="p">(</span><span class="n">database_url</span><span class="p">)</span> + + +<span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">max_requests</span><span class="o">+</span><span class="mi">1</span><span class="p">)):</span> + <span class="n">params</span><span class="p">[</span><span class="s2">"page"</span><span class="p">]</span> <span class="o">=</span> <span class="n">page</span> + <span class="n">params</span><span class="p">[</span><span class="s2">"limit"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">total_items</span><span class="p">)</span><span class="o">/</span><span class="n">max_requests</span><span class="p">)</span> + <span class="n">movies</span> <span class="o">=</span> <span class="p">[]</span> + <span class="n">res</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">api_base</span><span class="si">}</span><span class="s2">/search/movie"</span><span class="p">,</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span><span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span> + + <span class="k">if</span> <span class="n">res</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">500</span><span class="p">:</span> + <span class="k">break</span> + <span class="k">elif</span> <span class="n">res</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span> + <span class="kc">None</span> + <span class="k">else</span><span class="p">:</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"OwO Code </span><span class="si">{</span><span class="n">res</span><span class="o">.</span><span class="n">status_code</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> + + <span class="k">for</span> <span class="n">movie</span> <span class="ow">in</span> <span class="n">res</span><span class="o">.</span><span class="n">json</span><span class="p">():</span> + <span class="n">movies</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">create_movie_dict</span><span class="p">(</span><span class="n">movie</span><span class="p">))</span> + + <span class="k">with</span> <span class="n">engine</span><span class="o">.</span><span class="n">connect</span><span class="p">()</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span> + <span class="k">for</span> <span class="n">movie</span> <span class="ow">in</span> <span class="n">movies</span><span class="p">:</span> + <span class="k">with</span> <span class="n">conn</span><span class="o">.</span><span class="n">begin</span><span class="p">()</span> <span class="k">as</span> <span class="n">trans</span><span class="p">:</span> + <span class="n">stmt</span> <span class="o">=</span> <span class="n">insert</span><span class="p">(</span><span class="n">movies_table</span><span class="p">)</span><span class="o">.</span><span class="n">values</span><span class="p">(</span> + <span class="n">trakt_id</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"trakt_id"</span><span class="p">],</span> <span class="n">title</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"title"</span><span class="p">],</span> <span class="n">genres</span><span class="o">=</span><span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">movie</span><span class="p">[</span><span class="s2">"genres"</span><span class="p">]),</span> + <span class="n">language</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"language"</span><span class="p">],</span> <span class="n">year</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"year"</span><span class="p">],</span> <span class="n">released</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"released"</span><span class="p">],</span> + <span class="n">runtime</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"runtime"</span><span class="p">],</span> <span class="n">country</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"country"</span><span class="p">],</span> <span class="n">overview</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"overview"</span><span class="p">],</span> + <span class="n">rating</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"rating"</span><span class="p">],</span> <span class="n">votes</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"votes"</span><span class="p">],</span> <span class="n">comment_count</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"comment_count"</span><span class="p">],</span> + <span class="n">tagline</span><span class="o">=</span><span class="n">movie</span><span class="p">[</span><span class="s2">"tagline"</span><span class="p">])</span> + <span class="k">try</span><span class="p">:</span> + <span class="n">result</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">stmt</span><span class="p">)</span> + <span class="n">trans</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span> + <span class="k">except</span> <span class="n">IntegrityError</span><span class="p">:</span> + <span class="n">trans</span><span class="o">.</span><span class="n">rollback</span><span class="p">()</span> + <span class="n">req_count</span> <span class="o">+=</span> <span class="mi">1</span> +</code></pre></div> + +<p>(Note: I was well within the rate-limit so I did not have to slow down or implement any other measures)</p> + +<p>Running this script took me approximately 3 hours, and resulted in an SQLite database of 141.5 MB</p> + +<h2>Embeddings!</h2> + +<p>I did not want to put my poor Mac through the estimated 23 hours it would have taken to embed the sentences. I decided to use Google Colab instead.</p> + +<p>Because of the small size of the database file, I was able to just upload the file.</p> + +<p>For the encoding model, I decided to use the pretrained <code>paraphrase-multilingual-MiniLM-L12-v2</code> model for SentenceTransformers, a Python framework for SOTA sentence, text and image embeddings. I wanted to use a multilingual model as I personally consume content in various languages (natively, no dubs or subs) and some of the sources for their information do not translate to English. As of writing this post, I did not include any other database except Trakt. </p> + +<p>While deciding how I was going to process the embeddings, I came across multiple solutions:</p> + +<ul> +<li><p><a rel="noopener" target="_blank" href="https://milvus.io">Milvus</a> - An open-source vector database with similar search functionality</p></li> +<li><p><a rel="noopener" target="_blank" href="https://faiss.ai">FAISS</a> - A library for efficient similarity search</p></li> +<li><p><a rel="noopener" target="_blank" href="https://pinecone.io">Pinecone</a> - A fully managed vector database with similar search functionality</p></li> +</ul> + +<p>I did not want to waste time setting up the first two, so I decided to go with Pinecone which offers 1M 768-dim vectors for free with no credit card required (Our embeddings are 384-dim dense).</p> + +<p>Getting started with Pinecone was as easy as:</p> + +<ul> +<li><p>Signing up</p></li> +<li><p>Specifying the index name and vector dimensions along with the similarity search metric (Cosine Similarity for our use case)</p></li> +<li><p>Getting the API key</p></li> +<li><p>Installing the Python module (pinecone-client)</p></li> +</ul> + +<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> +<span class="kn">import</span> <span class="nn">pinecone</span> +<span class="kn">from</span> <span class="nn">sentence_transformers</span> <span class="kn">import</span> <span class="n">SentenceTransformer</span> +<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span> + +<span class="n">database_url</span> <span class="o">=</span> <span class="s2">"sqlite:///jlm.db"</span> +<span class="n">PINECONE_KEY</span> <span class="o">=</span> <span class="s2">"not-this-at-all"</span> +<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">32</span> + +<span class="n">pinecone</span><span class="o">.</span><span class="n">init</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">PINECONE_KEY</span><span class="p">,</span> <span class="n">environment</span><span class="o">=</span><span class="s2">"us-west1-gcp"</span><span class="p">)</span> +<span class="n">index</span> <span class="o">=</span> <span class="n">pinecone</span><span class="o">.</span><span class="n">Index</span><span class="p">(</span><span class="s2">"movies"</span><span class="p">)</span> + +<span class="n">model</span> <span class="o">=</span> <span class="n">SentenceTransformer</span><span class="p">(</span><span class="s2">"paraphrase-multilingual-MiniLM-L12-v2"</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s2">"cuda"</span><span class="p">)</span> +<span class="n">engine</span><span class="p">,</span> <span class="n">Session</span> <span class="o">=</span> <span class="n">init_db_stuff</span><span class="p">(</span><span class="n">database_url</span><span class="p">)</span> + +<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_sql</span><span class="p">(</span><span class="s2">"Select * from movies"</span><span class="p">,</span> <span class="n">engine</span><span class="p">)</span> +<span class="n">df</span><span class="p">[</span><span class="s2">"combined_text"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"title"</span><span class="p">]</span> <span class="o">+</span> <span class="s2">": "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s2">"overview"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span> <span class="o">+</span> <span class="s2">" - "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s2">"tagline"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span> <span class="o">+</span> <span class="s2">" Genres:- "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s2">"genres"</span><span class="p">]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span> + +<span class="c1"># Creating the embedding and inserting it into the database</span> +<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">),</span><span class="n">batch_size</span><span class="p">)):</span> + <span class="n">to_send</span> <span class="o">=</span> <span class="p">[]</span> + <span class="n">trakt_ids</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"trakt_id"</span><span class="p">][</span><span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> + <span class="n">sentences</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"combined_text"</span><span class="p">][</span><span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> + <span class="n">embeddings</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span> + <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">trakt_ids</span><span class="p">):</span> + <span class="n">to_send</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> + <span class="p">(</span> + <span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">embeddings</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> + <span class="p">))</span> + <span class="n">index</span><span class="o">.</span><span class="n">upsert</span><span class="p">(</span><span class="n">to_send</span><span class="p">)</span> +</code></pre></div> + +<p>That's it!</p> + +<h2>Interacting with Vectors</h2> + +<p>We use the <code>trakt_id</code> for the movie as the ID for the vectors and upsert it into the index. </p> + +<p>To find similar items, we will first have to map the name of the movie to its trakt_id, get the embeddings we have for that id and then perform a similarity search. It is possible that this additional step of mapping could be avoided by storing information as metadata in the index.</p> + +<div class="codehilite"><pre><span></span><code><span class="k">def</span> <span class="nf">get_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">title</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span> + <span class="n">rec</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"title"</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">==</span><span class="n">movie_name</span><span class="o">.</span><span class="n">lower</span><span class="p">()]</span> + <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"multiple values found... </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="p">)</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> + <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">rec</span><span class="p">)):</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"[</span><span class="si">{</span><span class="n">x</span><span class="si">}</span><span class="s2">] </span><span class="si">{</span><span class="n">rec</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()[</span><span class="n">x</span><span class="p">]</span><span class="si">}</span><span class="s2"> (</span><span class="si">{</span><span class="n">rec</span><span class="p">[</span><span class="s1">'year'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()[</span><span class="n">x</span><span class="p">]</span><span class="si">}</span><span class="s2">) - </span><span class="si">{</span><span class="n">rec</span><span class="p">[</span><span class="s1">'overview'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> + <span class="nb">print</span><span class="p">(</span><span class="s2">"==="</span><span class="p">)</span> + <span class="n">z</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">input</span><span class="p">(</span><span class="s2">"Choose No: "</span><span class="p">))</span> + <span class="k">return</span> <span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="n">z</span><span class="p">]</span> + <span class="k">return</span> <span class="n">rec</span><span class="o">.</span><span class="n">trakt_id</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> + +<span class="k">def</span> <span class="nf">get_vector_value</span><span class="p">(</span><span class="n">trakt_id</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span> + <span class="n">fetch_response</span> <span class="o">=</span> <span class="n">index</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">ids</span><span class="o">=</span><span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">trakt_id</span><span class="p">)])</span> + <span class="k">return</span> <span class="n">fetch_response</span><span class="p">[</span><span class="s2">"vectors"</span><span class="p">][</span><span class="nb">str</span><span class="p">(</span><span class="n">trakt_id</span><span class="p">)][</span><span class="s2">"values"</span><span class="p">]</span> + +<span class="k">def</span> <span class="nf">query_vectors</span><span class="p">(</span><span class="n">vector</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">20</span><span class="p">,</span> <span class="n">include_values</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span> <span class="n">include_metada</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span> + <span class="n">query_response</span> <span class="o">=</span> <span class="n">index</span><span class="o">.</span><span class="n">query</span><span class="p">(</span> + <span class="n">queries</span><span class="o">=</span><span class="p">[</span> + <span class="p">(</span><span class="n">vector</span><span class="p">),</span> + <span class="p">],</span> + <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span><span class="p">,</span> + <span class="n">include_values</span><span class="o">=</span><span class="n">include_values</span><span class="p">,</span> + <span class="n">include_metadata</span><span class="o">=</span><span class="n">include_metada</span> + <span class="p">)</span> + <span class="k">return</span> <span class="n">query_response</span> + +<span class="k">def</span> <span class="nf">query2ids</span><span class="p">(</span><span class="n">query_response</span><span class="p">):</span> + <span class="n">trakt_ids</span> <span class="o">=</span> <span class="p">[]</span> + <span class="k">for</span> <span class="n">match</span> <span class="ow">in</span> <span class="n">query_response</span><span class="p">[</span><span class="s2">"results"</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">"matches"</span><span class="p">]:</span> + <span class="n">trakt_ids</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">match</span><span class="p">[</span><span class="s2">"id"</span><span class="p">]))</span> + <span class="k">return</span> <span class="n">trakt_ids</span> + +<span class="k">def</span> <span class="nf">get_deets_by_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">trakt_id</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span> + <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"trakt_id"</span><span class="p">]</span><span class="o">==</span><span class="n">trakt_id</span><span class="p">]</span> + <span class="k">return</span> <span class="p">{</span> + <span class="s2">"title"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> + <span class="s2">"overview"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">overview</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> + <span class="s2">"runtime"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">runtime</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> + <span class="s2">"year"</span><span class="p">:</span> <span class="n">df</span><span class="o">.</span><span class="n">year</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> + <span class="p">}</span> +</code></pre></div> + +<h3>Testing it Out</h3> + +<div class="codehilite"><pre><span></span><code><span class="n">movie_name</span> <span class="o">=</span> <span class="s2">"Now You See Me"</span> + +<span class="n">movie_trakt_id</span> <span class="o">=</span> <span class="n">get_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">movie_name</span><span class="p">)</span> +<span class="nb">print</span><span class="p">(</span><span class="n">movie_trakt_id</span><span class="p">)</span> +<span class="n">movie_vector</span> <span class="o">=</span> <span class="n">get_vector_value</span><span class="p">(</span><span class="n">movie_trakt_id</span><span class="p">)</span> +<span class="n">movie_queries</span> <span class="o">=</span> <span class="n">query_vectors</span><span class="p">(</span><span class="n">movie_vector</span><span class="p">)</span> +<span class="n">movie_ids</span> <span class="o">=</span> <span class="n">query2ids</span><span class="p">(</span><span class="n">movie_queries</span><span class="p">)</span> +<span class="nb">print</span><span class="p">(</span><span class="n">movie_ids</span><span class="p">)</span> + +<span class="k">for</span> <span class="n">trakt_id</span> <span class="ow">in</span> <span class="n">movie_ids</span><span class="p">:</span> + <span class="n">deets</span> <span class="o">=</span> <span class="n">get_deets_by_trakt_id</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">trakt_id</span><span class="p">)</span> + <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">deets</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="si">}</span><span class="s2"> (</span><span class="si">{</span><span class="n">deets</span><span class="p">[</span><span class="s1">'year'</span><span class="p">]</span><span class="si">}</span><span class="s2">): </span><span class="si">{</span><span class="n">deets</span><span class="p">[</span><span class="s1">'overview'</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span> +</code></pre></div> + +<p>Output:</p> + +<pre><code>[55786, 18374, 299592, 662622, 6054, 227458, 139687, 303950, 70000, 129307, 70823, 5766, 23950, 137696, 655723, 32842, 413269, 145994, 197990, 373832] +Now You See Me (2013): An FBI agent and an Interpol detective track a team of illusionists who pull off bank heists during their performances and reward their audiences with the money. +Trapped (1949): U.S. Treasury Department agents go after a ring of counterfeiters. +Brute Sanity (2018): An FBI-trained neuropsychologist teams up with a thief to find a reality-altering device while her insane ex-boss unleashes bizarre traps to stop her. +The Chase (2017): Some FBI agents hunt down a criminal +Surveillance (2008): An FBI agent tracks a serial killer with the help of three of his would-be victims - all of whom have wildly different stories to tell. +Marauders (2016): An untraceable group of elite bank robbers is chased by a suicidal FBI agent who uncovers a deeper purpose behind the robbery-homicides. +Miracles for Sale (1939): A maker of illusions for magicians protects an ingenue likely to be murdered. +Deceptors (2005): A Ghostbusters knock-off where a group of con-artists create bogus monsters to scare up some cash. They run for their lives when real spooks attack. +The Outfit (1993): A renegade FBI agent sparks an explosive mob war between gangster crime lords Legs Diamond and Dutch Schultz. +Bank Alarm (1937): A federal agent learns the gangsters he's been investigating have kidnapped his sister. +The Courier (2012): A shady FBI agent recruits a courier to deliver a mysterious package to a vengeful master criminal who has recently resurfaced with a diabolical plan. +After the Sunset (2004): An FBI agent is suspicious of two master thieves, quietly enjoying their retirement near what may - or may not - be the biggest score of their careers. +Down Three Dark Streets (1954): An FBI Agent takes on the three unrelated cases of a dead agent to track down his killer. +The Executioner (1970): A British intelligence agent must track down a fellow spy suspected of being a double agent. +Ace of Cactus Range (1924): A Secret Service agent goes undercover to unmask the leader of a gang of diamond thieves. +Firepower (1979): A mercenary is hired by the FBI to track down a powerful recluse criminal, a woman is also trying to track him down for her own personal vendetta. +Heroes & Villains (2018): an FBI agent chases a thug to great tunes +Federal Fugitives (1941): A government agent goes undercover in order to apprehend a saboteur who caused a plane crash. +Hell on Earth (2012): An FBI Agent on the trail of a group of drug traffickers learns that their corruption runs deeper than she ever imagined, and finds herself in a supernatural - and deadly - situation. +Spies (2015): A secret agent must perform a heist without time on his side +</code></pre> + +<p>For now, I am happy with the recommendations.</p> + +<h2>Simple UI</h2> + +<p>The code for the flask app can be found on GitHub: <a rel="noopener" target="_blank" href="https://github.com/navanchauhan/FlixRec">navanchauhan/FlixRec</a> or on my <a rel="noopener" target="_blank" href="https://pi4.navan.dev/gitea/navan/FlixRec">Gitea instance</a></p> + +<p>I quickly whipped up a simple Flask App to deal with problems of multiple movies sharing the title, and typos in the search query.</p> + +<h3>Home Page</h3> + +<p><img src="/assets/flixrec/home.png" alt="Home Page" /></p> + +<h3>Handling Multiple Movies with Same Title</h3> + +<p><img src="/assets/flixrec/multiple.png" alt="Multiple Movies with Same Title" /></p> + +<h3>Results Page</h3> + +<p><img src="/assets/flixrec/results.png" alt="Results Page" /></p> + +<p>Includes additional filter options</p> + +<p><img src="/assets/flixrec/filter.png" alt="Advance Filtering Options" /></p> + +<p>Test it out at <a rel="noopener" target="_blank" href="https://flixrec.navan.dev">https://flixrec.navan.dev</a></p> + +<h2>Current Limittations</h2> + +<ul> +<li>Does not work well with popular franchises</li> +<li>No Genre Filter</li> +</ul> + +<h2>Future Addons</h2> + +<ul> +<li>Include Cast Data +<ul> +<li>e.g. If it sees a movie with Tom Hanks and Meg Ryan, then it will boost similar movies including them</li> +<li>e.g. If it sees the movie has been directed my McG, then it will boost similar movies directed by them</li> +</ul></li> +<li>REST API</li> +<li>TV Shows</li> +<li>Multilingual database</li> +<li>Filter based on popularity: The data already exists in the indexed database</li> +</ul> + +</main> + + +<script src="assets/manup.min.js"></script> +<script src="/pwabuilder-sw-register.js"></script> +</body> +</html>
\ No newline at end of file diff --git a/docs/posts/index.html b/docs/posts/index.html index bb704f8..d1e3bf4 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -48,6 +48,23 @@ <ul> + <li><a href="/posts/2022-05-21-Similar-Movies-Recommender.html">Building a Simple Similar Movies Recommender System</a></li> + <ul> + <li>Building a Content Based Similar Movies Recommender System</li> + <li>Published On: 2022-05-21 17:56</li> + <li>Tags: + + Python, + + Transformers, + + Movies, + + Recommender-System, + + </ul> + + <li><a href="/posts/2021-06-27-Crude-ML-AI-Powered-Chatbot-Swift.html">Making a Crude ML Powered Chatbot in Swift using CoreML</a></li> <ul> <li>Writing a simple Machine-Learning powered Chatbot (or, daresay virtual personal assistant ) in Swift using CoreML.</li> |