Multilabel Classification on Genre
A good movie recommendation is when the characteristics of the movie matches what the user wants. If I want to watch a romantic movie I will look through the romance category. Now the problem is what classifies a movie to fall under the romance category? So the first thing I am going to do is write an algorithm that can can detect what genres a movie is classified as.
Method 1 :
X variable used to train : Keywords (contains important words representing the movie)
Y variable : Genres (contains genres each movie Is classified as)
X variable is cleaned by removing all commas and transformed into tf-idf vector with TfidVecotrizer. Now X variable can be used in algorithms.
Y variable is encoded using MultiLabelBinarizer which allows multiple label per instance. Example shown below
action ---- drama ---- fantasy ---- sci-fi
1 1 1 0
1 0 1 0
0 1 0 0
0 1 0 1
["action", "drama","fantasy"],
["fantasy","action"],
["drama"],
["sci-fi", "drama"]]
——->
X variables and Y variables are ready for use now. I do a basic train/test split and train my dataset. The algorithm I used was the combination of Logistic Regression and One vs Rest Classifier. OneVsRest turns a multiple classification problem into multiple binary classifications. I used F1 score to measure accuracy of genre prediction on test set and plotted a box-plot for 10 instances.
Method 2 :
Changed the X variable from keywords column to description column. I cleaned the description by removing stop words and punctuation.
Method 1 and Method 2 yield almost the same accuracy
Movie Recommendations by Rating and Popularity
Although data already provides average rating, this can be quite misleading. When comparing two average ratings the consideration of number of votes is very important. Movie 1 can have a rating of 9.0 with only 3 votes compared to Movie 2 having a rating of 8.0 with 50 votes. In this situation Movie 1's rating is very biased due to the low amount of voters. To tackle this problem IMBD have a weight rating formula.
Weighted Rating = (vR)/(v+m) + (mC)/(v+m)
m = minimum votes to be listed in chart ( we will assume movies with 85 percentile number of voters to be relevant)
C = mean of average rating for all movies
R = average rating for specific movie
v = number of votes for specific movie
After the calculation of weighted rating I created a function that recommends top 20 movies by genre.
I am also interested in some explanatory analysis on genre ratings
Movie Recommendations by Plot Similarity
Method 1 :
Each movie have a description column explaining the plot summary. To recommend a movie based on description means I need to know how to compare these texts and look for similarities. I vectorized the descriptions just like before and with this matrix I can use cosine similarity scores. Cosine similarity can be used to calculate a numeric value that denotes the level of similarity between two movies.
Lastly I wrote a function that returns 10 movies with the highest similarity score representing the 10 most similar movie to the last movie watched.
Method 2 :
Changing the X variable from description to cast, keywords and genre. Since I am using three columns I concatenated the matrix and used CountVectorizer to vectorize the strings. CountVectorizer weighs the frequency of repeating words so I can account for casts in multiple movies have more significance.
Movie Recommendations by User History & Plot Similarity
There is a separate CSV containing user and their rating on movies. I did some left joins and data manipulation to have all the information I need on one dataframe. getTopMov is a function to get the top 10 movies rated by a particular user ID.
getTopMov(10) <- user id 10
array(['The Usual Suspects', 'The Matrix', 'Sling Blade', 'My Own Private Idaho', 'Runaway Train'], dtype=object)
Method 1 :
recommendation by finding 10 highest pair similarity from list of movies
Method 2 :
recommendation by finding 3 highest pair similarity for each movie in list