Movie Recommender:
Content-Based Filtering
A practical implementation of a Content-Based Recommendation System, focusing on recommending movies based on similarity metrics derived from plot summaries and metadata.
Project Overview and Methodology
This system uses **Content-Based Filtering**, which suggests items similar to those a user has liked in the past. It relies purely on item features (the "content") rather than user behavior or ratings (which is the basis of Collaborative Filtering). [Image of Content Based Filtering vs Collaborative Filtering Diagram]
- Dataset: TMDB Movie Metadata (Titles, Genres, Taglines, Plot Summaries).
- Goal: Generate a list of the 10 most similar movies to a user-selected movie.
- Key Technique: TF-IDF Vectorization and Cosine Similarity.
- Tech Stack: Python, Pandas, Scikit-Learn.
Step 1: Feature Engineering and Text Processing
The core of content-based recommendation is representing the movie's content (plot summary, genres, keywords) as a numerical vector. We use the Term Frequency-Inverse Document Frequency (TF-IDF) method to weigh the importance of words in the plot summaries.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Load dataset (simulation)
df = pd.read_csv('tmdb_movies.csv')
df = df.dropna(subset=['overview']).head(1000) # Use a subset for performance
# Initialize the TF-IDF Vectorizer
# Stop words (like 'the', 'a') are removed to focus on meaningful words
tfidf = TfidfVectorizer(stop_words='english')
# Apply the vectorizer to the 'overview' (plot summary) column
tfidf_matrix = tfidf.fit_transform(df['overview'])
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
# Each row is a movie, each column is a word/token with its importance score.
Step 2: Calculating Similarity
Once all movies are represented as vectors, we use **Cosine Similarity** to measure the distance between them. Cosine similarity calculates the cosine of the angle between two vectors. A smaller angle (closer to 1) means higher similarity.
from sklearn.metrics.pairwise import cosine_similarity
# Compute the cosine similarity matrix
# This matrix stores the similarity score between every pair of movies.
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f"Cosine Similarity Matrix Shape: {cosine_sim.shape}")
Step 3: Building the Recommender Function
The final step is to create a function that takes a movie title as input, finds its index, retrieves its similarity scores from the matrix, sorts the movies by score, and returns the top 10 most similar movie titles.
# Map movie titles to their index in the DataFrame
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
def get_recommendations(title, cosine_sim=cosine_sim, df=df, indices=indices):
# 1. Get the index of the movie that matches the title
idx = indices[title]
# 2. Get the similarity scores for that movie with all other movies
sim_scores = list(enumerate(cosine_sim[idx]))
# 3. Sort the movies based on the similarity score
# We slice [1:] to exclude the movie itself (which has a similarity of 1)
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
# 4. Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# 5. Return the top 10 recommended movie titles
return df['title'].iloc[movie_indices]
# Example Usage
recommended_movies = get_recommendations('The Dark Knight Rises')
print("Recommended Movies for 'The Dark Knight Rises':")
for i, movie in enumerate(recommended_movies):
print(f"{i+1}. {movie}")
Results and Scalability
- Effectiveness: The Content-Based method excels at recommending highly relevant, niche items and is resistant to the "cold start" problem (it can recommend a brand new movie instantly).
- Limitations: The system struggles with recommending diverse items (it only recommends what is similar) and requires deep, high-quality metadata.
- Future Work: For real-world scale, this system would be combined with a Collaborative Filtering approach (creating a Hybrid Recommender) to leverage the benefits of both item-to-item similarity and user behavior.