Movie Recommender:
Content-Based Filtering

Data Science

• June 01, 2025

• DQ Daniel Querales

A practical implementation of a Content-Based Recommendation System, focusing on recommending movies based on similarity metrics derived from plot summaries and metadata.

Project Overview and Methodology

This system uses **Content-Based Filtering**, which suggests items similar to those a user has liked in the past. It relies purely on item features (the "content") rather than user behavior or ratings (which is the basis of Collaborative Filtering). [Image of Content Based Filtering vs Collaborative Filtering Diagram]

Dataset: TMDB Movie Metadata (Titles, Genres, Taglines, Plot Summaries).
Goal: Generate a list of the 10 most similar movies to a user-selected movie.
Key Technique: TF-IDF Vectorization and Cosine Similarity.
Tech Stack: Python, Pandas, Scikit-Learn.

Step 1: Feature Engineering and Text Processing

The core of content-based recommendation is representing the movie's content (plot summary, genres, keywords) as a numerical vector. We use the Term Frequency-Inverse Document Frequency (TF-IDF) method to weigh the importance of words in the plot summaries.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset (simulation)
df = pd.read_csv('tmdb_movies.csv')
df = df.dropna(subset=['overview']).head(1000) # Use a subset for performance

# Initialize the TF-IDF Vectorizer
# Stop words (like 'the', 'a') are removed to focus on meaningful words
tfidf = TfidfVectorizer(stop_words='english')

# Apply the vectorizer to the 'overview' (plot summary) column
tfidf_matrix = tfidf.fit_transform(df['overview'])

print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
# Each row is a movie, each column is a word/token with its importance score.

Step 2: Calculating Similarity

Once all movies are represented as vectors, we use **Cosine Similarity** to measure the distance between them. Cosine similarity calculates the cosine of the angle between two vectors. A smaller angle (closer to 1) means higher similarity.

from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
# This matrix stores the similarity score between every pair of movies.
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(f"Cosine Similarity Matrix Shape: {cosine_sim.shape}")

Step 3: Building the Recommender Function

The final step is to create a function that takes a movie title as input, finds its index, retrieves its similarity scores from the matrix, sorts the movies by score, and returns the top 10 most similar movie titles.

# Map movie titles to their index in the DataFrame
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # 1. Get the index of the movie that matches the title
    idx = indices[title]

    # 2. Get the similarity scores for that movie with all other movies
    sim_scores = list(enumerate(cosine_sim[idx]))

    # 3. Sort the movies based on the similarity score
    # We slice [1:] to exclude the movie itself (which has a similarity of 1)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11] 

    # 4. Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # 5. Return the top 10 recommended movie titles
    return df['title'].iloc[movie_indices]

# Example Usage
recommended_movies = get_recommendations('The Dark Knight Rises')

print("Recommended Movies for 'The Dark Knight Rises':")
for i, movie in enumerate(recommended_movies):
    print(f"{i+1}. {movie}")

Results and Scalability

Effectiveness: The Content-Based method excels at recommending highly relevant, niche items and is resistant to the "cold start" problem (it can recommend a brand new movie instantly).
Limitations: The system struggles with recommending diverse items (it only recommends what is similar) and requires deep, high-quality metadata.
Future Work: For real-world scale, this system would be combined with a Collaborative Filtering approach (creating a Hybrid Recommender) to leverage the benefits of both item-to-item similarity and user behavior.