diff options
| author | navanchauhan <navanchauhan@gmail.com> | 2022-05-22 12:03:28 -0600 | 
|---|---|---|
| committer | navanchauhan <navanchauhan@gmail.com> | 2022-05-22 12:03:28 -0600 | 
| commit | 702fd6ebc1b9ccf135cef9947f07912659fc752d (patch) | |
| tree | 08e2a5b7639d5963093f3af572d677eeef673997 /Content/posts | |
| parent | 235bb7256ad966de87a3b5ab433598271a3d21d0 (diff) | |
added new post movie recommender
Diffstat (limited to 'Content/posts')
| -rw-r--r-- | Content/posts/2022-05-21-Similar-Movies-Recommender.md | 400 | 
1 files changed, 400 insertions, 0 deletions
| diff --git a/Content/posts/2022-05-21-Similar-Movies-Recommender.md b/Content/posts/2022-05-21-Similar-Movies-Recommender.md new file mode 100644 index 0000000..fbc9fdb --- /dev/null +++ b/Content/posts/2022-05-21-Similar-Movies-Recommender.md @@ -0,0 +1,400 @@ +--- +date: 2022-05-21 17:56 +description: Building a Content Based Similar Movies Recommender System +tags: Python, Transformers, Movies, Recommender-System +--- + +# Building a Simple Similar Movies Recommender System + +## Why? + +I recently came across a movie/tv-show recommender, [couchmoney.tv](https://couchmoney.tv/). I loved it. I decided that I wanted to build something similar, so I could tinker with it as much as I wanted. + +I also wanted a recommendation system I could use via a REST API. Although I have not included that part in this post, I did eventually create it. + + +## How? + +By measuring the cosine of the angle between two vectors, you can get a value in the range [0,1] with 0 meaning no similarity. Now, if we find a way to represent information about movies as a vector, we can use cosine similarity as a metric to find similar movies. + +As we are recommending just based on the content of the movies, this is called a content based recommendation system. + +## Data Collection + +Trakt exposes a nice API to search for movies/tv-shows. To access the API, you first need to get an API key (the Trakt ID you get when you create a new application).  + +I decided to use SQL-Alchemy with a SQLite backend just to make my life easier if I decided on switching to Postgres anytime I felt like.  + +First, I needed to check the total number of records in Trakt’s database. + +```python +import requests +import os + +trakt_id = os.getenv("TRAKT_ID") + +api_base = "https://api.trakt.tv" + +headers = { +	"Content-Type": "application/json", +	"trakt-api-version": "2", +	"trakt-api-key": trakt_id +} + +params = { +	"query": "", +	"years": "1900-2021", +	"page": "1", +	"extended": "full", +	"languages": "en" +} + +res = requests.get(f"{api_base}/search/movie",headers=headers,params=params) +total_items = res.headers["x-pagination-item-count"] +print(f"There are {total_items} movies") +``` + +``` +There are 333946 movies +``` + +First, I needed to declare the database schema in (`database.py`): + +```python +import sqlalchemy +from sqlalchemy import create_engine +from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey, PickleType +from sqlalchemy import insert +from sqlalchemy.orm import sessionmaker +from sqlalchemy.exc import IntegrityError + +meta = MetaData() + +movies_table = Table( +    "movies", +    meta, +    Column("trakt_id", Integer, primary_key=True, autoincrement=False), +    Column("title", String), +    Column("overview", String), +    Column("genres", String), +    Column("year", Integer), +    Column("released", String), +    Column("runtime", Integer), +    Column("country", String), +    Column("language", String), +    Column("rating", Integer), +    Column("votes", Integer), +    Column("comment_count", Integer), +    Column("tagline", String), +    Column("embeddings", PickleType) + +) + +# Helper function to connect to the db +def init_db_stuff(database_url: str): +    engine = create_engine(database_url) +    meta.create_all(engine) +    Session = sessionmaker(bind=engine) +    return engine, Session +``` + +In the end, I could have dropped the embeddings field from the table schema as I never got around to using it. + +### Scripting Time + +```python +from database import * +from tqdm import tqdm +import requests +import os + +trakt_id = os.getenv("TRAKT_ID") + +max_requests = 5000 # How many requests I wanted to wrap everything up in +req_count = 0 # A counter for how many requests I have made + +years = "1900-2021"  +page = 1 # The initial page number for the search +extended = "full" # Required to get additional information  +limit = "10" # No of entires per request -- This will be automatically picked based on max_requests +languages = "en" # Limit to English + +api_base = "https://api.trakt.tv" +database_url = "sqlite:///jlm.db" + +headers = { +	"Content-Type": "application/json", +	"trakt-api-version": "2", +	"trakt-api-key": trakt_id +} + +params = { +	"query": "", +	"years": years, +	"page": page, +	"extended": extended, +	"limit": limit, +	"languages": languages +} + +# Helper function to get desirable values from the response +def create_movie_dict(movie: dict): +	m = movie["movie"] +	movie_dict = { +		"title": m["title"], +		"overview": m["overview"], +		"genres": m["genres"], +		"language": m["language"], +		"year": int(m["year"]), +		"trakt_id": m["ids"]["trakt"], +		"released": m["released"], +		"runtime": int(m["runtime"]), +		"country": m["country"], +		"rating": int(m["rating"]), +		"votes": int(m["votes"]), +		"comment_count": int(m["comment_count"]), +		"tagline": m["tagline"] +	} +	return movie_dict + +# Get total number of items +params["limit"] = 1 +res = requests.get(f"{api_base}/search/movie",headers=headers,params=params) +total_items = res.headers["x-pagination-item-count"] + +engine, Session = init_db_stuff(database_url) + + +for page in tqdm(range(1,max_requests+1)): +	params["page"] = page +	params["limit"] = int(int(total_items)/max_requests) +	movies = [] +	res = requests.get(f"{api_base}/search/movie",headers=headers,params=params) + +	if res.status_code == 500: +		break +	elif res.status_code == 200: +		None +	else: +		print(f"OwO Code {res.status_code}") + +	for movie in res.json(): +		movies.append(create_movie_dict(movie)) + +	with engine.connect() as conn: +		for movie in movies: +			with conn.begin() as trans: +				stmt = insert(movies_table).values( +					trakt_id=movie["trakt_id"], title=movie["title"], genres=" ".join(movie["genres"]), +					language=movie["language"], year=movie["year"], released=movie["released"], +					runtime=movie["runtime"], country=movie["country"], overview=movie["overview"], +					rating=movie["rating"], votes=movie["votes"], comment_count=movie["comment_count"], +					tagline=movie["tagline"]) +				try: +					result = conn.execute(stmt) +					trans.commit() +				except IntegrityError: +					trans.rollback() +	req_count += 1 +``` + +(Note: I was well within the rate-limit so I did not have to slow down or implement any other measures) + +Running this script took me approximately 3 hours, and resulted in an SQLite database of 141.5 MB + +## Embeddings! + +I did not want to put my poor Mac through the estimated 23 hours it would have taken to embed the sentences. I decided to use Google Colab instead. + +Because of the small size of the database file, I was able to just upload the file. + +For the encoding model, I decided to use the pretrained `paraphrase-multilingual-MiniLM-L12-v2` model for SentenceTransformers, a Python framework for SOTA sentence, text and image embeddings. I wanted to use a multilingual model as I personally consume content in various languages (natively, no dubs or subs) and some of the sources for their information do not translate to English. As of writing this post, I did not include any other database except Trakt.  + +While deciding how I was going to process the embeddings, I came across multiple solutions: + +* [Milvus](https://milvus.io) - An open-source vector database with similar search functionality + +* [FAISS](https://faiss.ai) - A library for efficient similarity search + +* [Pinecone](https://pinecone.io) - A fully managed vector database with similar search functionality + +I did not want to waste time setting up the first two, so I decided to go with Pinecone which offers 1M 768-dim vectors for free with no credit card required (Our embeddings are 384-dim dense). + +Getting started with Pinecone was as easy as: + +* Signing up + +* Specifying the index name and vector dimensions along with the similarity search metric (Cosine Similarity for our use case) + +* Getting the API key + +* Installing the Python module (pinecone-client) + +```python +import pandas as pd +import pinecone +from sentence_transformers import SentenceTransformer +from tqdm import tqdm  + +database_url = "sqlite:///jlm.db" +PINECONE_KEY = "not-this-at-all" +batch_size = 32 + +pinecone.init(api_key=PINECONE_KEY, environment="us-west1-gcp") +index = pinecone.Index("movies") + +model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2", device="cuda") +engine, Session = init_db_stuff(database_url) + +df = pd.read_sql("Select * from movies", engine) +df["combined_text"] = df["title"] + ": " + df["overview"].fillna('') + " -  " + df["tagline"].fillna('') + " Genres:-  " + df["genres"].fillna('') + +# Creating the embedding and inserting it into the database +for x in tqdm(range(0,len(df),batch_size)): +	to_send = [] +	trakt_ids = df["trakt_id"][x:x+batch_size].tolist() +	sentences = df["combined_text"][x:x+batch_size].tolist() +	embeddings = model.encode(sentences) +	for idx, value in enumerate(trakt_ids): +		to_send.append( +			( +				str(value), embeddings[idx].tolist() +			)) +	index.upsert(to_send) +``` + +That's it! + +## Interacting with Vectors + +We use the `trakt_id` for the movie as the ID for the vectors and upsert it into the index.  + +To find similar items, we will first have to map the name of the movie to its trakt_id, get the embeddings we have for that id and then perform a similarity search. It is possible that this additional step of mapping could be avoided by storing information as metadata in the index. + +```python +def get_trakt_id(df, title: str): +  rec = df[df["title"].str.lower()==movie_name.lower()] +  if len(rec.trakt_id.values.tolist()) > 1: +    print(f"multiple values found... {len(rec.trakt_id.values)}") +    for x in range(len(rec)): +      print(f"[{x}] {rec['title'].tolist()[x]} ({rec['year'].tolist()[x]}) - {rec['overview'].tolist()}") +      print("===") +      z = int(input("Choose No: ")) +      return rec.trakt_id.values[z] +  return rec.trakt_id.values[0] + +def get_vector_value(trakt_id: int): +  fetch_response = index.fetch(ids=[str(trakt_id)]) +  return fetch_response["vectors"][str(trakt_id)]["values"] + +def query_vectors(vector: list, top_k: int = 20, include_values: bool = False, include_metada: bool = True): +  query_response = index.query( +      queries=[ +          (vector), +      ], +      top_k=top_k, +      include_values=include_values, +      include_metadata=include_metada +  ) +  return query_response + +def query2ids(query_response): +  trakt_ids = [] +  for match in query_response["results"][0]["matches"]: +    trakt_ids.append(int(match["id"])) +  return trakt_ids + +def get_deets_by_trakt_id(df, trakt_id: int): +  df = df[df["trakt_id"]==trakt_id] +  return { +      "title": df.title.values[0], +      "overview": df.overview.values[0], +      "runtime": df.runtime.values[0], +      "year": df.year.values[0] +  } +``` + +### Testing it Out + +```python +movie_name = "Now You See Me" + +movie_trakt_id = get_trakt_id(df, movie_name) +print(movie_trakt_id) +movie_vector = get_vector_value(movie_trakt_id) +movie_queries = query_vectors(movie_vector) +movie_ids = query2ids(movie_queries) +print(movie_ids) + +for trakt_id in movie_ids: +  deets = get_deets_by_trakt_id(df, trakt_id) +  print(f"{deets['title']} ({deets['year']}): {deets['overview']}") +``` + +Output: + +``` +55786 +[55786, 18374, 299592, 662622, 6054, 227458, 139687, 303950, 70000, 129307, 70823, 5766, 23950, 137696, 655723, 32842, 413269, 145994, 197990, 373832] +Now You See Me (2013): An FBI agent and an Interpol detective track a team of illusionists who pull off bank heists during their performances and reward their audiences with the money. +Trapped (1949): U.S. Treasury Department agents go after a ring of counterfeiters. +Brute Sanity (2018): An FBI-trained neuropsychologist teams up with a thief to find a reality-altering device while her insane ex-boss unleashes bizarre traps to stop her. +The Chase (2017): Some FBI agents hunt down a criminal +Surveillance (2008): An FBI agent tracks a serial killer with the help of three of his would-be victims - all of whom have wildly different stories to tell. +Marauders (2016): An untraceable group of elite bank robbers is chased by a suicidal FBI agent who uncovers a deeper purpose behind the robbery-homicides. +Miracles for Sale (1939): A maker of illusions for magicians protects an ingenue likely to be murdered. +Deceptors (2005): A Ghostbusters knock-off where a group of con-artists create bogus monsters to scare up some cash. They run for their lives when real spooks attack. +The Outfit (1993): A renegade FBI agent sparks an explosive mob war between gangster crime lords Legs Diamond and Dutch Schultz. +Bank Alarm (1937): A federal agent learns the gangsters he's been investigating have kidnapped his sister. +The Courier (2012): A shady FBI agent recruits a courier to deliver a mysterious package to a vengeful master criminal who has recently resurfaced with a diabolical plan. +After the Sunset (2004): An FBI agent is suspicious of two master thieves, quietly enjoying their retirement near what may - or may not - be the biggest score of their careers. +Down Three Dark Streets (1954): An FBI Agent takes on the three unrelated cases of a dead agent to track down his killer. +The Executioner (1970): A British intelligence agent must track down a fellow spy suspected of being a double agent. +Ace of Cactus Range (1924): A Secret Service agent goes undercover to unmask the leader of a gang of diamond thieves. +Firepower (1979): A mercenary is hired by the FBI to track down a powerful recluse criminal, a woman is also trying to track him down for her own personal vendetta. +Heroes & Villains (2018): an FBI agent chases a thug to great tunes +Federal Fugitives (1941): A government agent goes undercover in order to apprehend a saboteur who caused a plane crash. +Hell on Earth (2012): An FBI Agent on the trail of a group of drug traffickers learns that their corruption runs deeper than she ever imagined, and finds herself in a supernatural - and deadly - situation. +Spies (2015): A secret agent must perform a heist without time on his side +``` + +For now, I am happy with the recommendations. + +## Simple UI + +The code for the flask app can be found on GitHub: [navanchauhan/FlixRec](https://github.com/navanchauhan/FlixRec) or on my [Gitea instance](https://pi4.navan.dev/gitea/navan/FlixRec) + +I quickly whipped up a simple Flask App to deal with problems of multiple movies sharing the title, and typos in the search query. + +### Home Page + + + +### Handling Multiple Movies with Same Title + + + +### Results Page + + + +Includes additional filter options + + + +Test it out at [https://flixrec.navan.dev](https://flixrec.navan.dev) + +## Current Limittations + +* Does not work well with popular franchises +* No Genre Filter + +## Future Addons + +* Include Cast Data +	* e.g. If it sees a movie with Tom Hanks and Meg Ryan, then it will boost similar movies including them +	* e.g. If it sees the movie has been directed my McG, then it will boost similar movies directed by them +* REST API +* TV Shows +* Multilingual database +* Filter based on popularity: The data already exists in the indexed database
\ No newline at end of file | 
