The objective in this post is to generate some movie
recommendations for a user, given movies they have already watched, and the
ratings they gave for those movies
We will do this a few
different ways. We'll also use Pandas, a data analysis library, for most of the
First let's start by
downloading the dataset we'll be using. This is the MovieLens dataset
which is maintained by
the Department of Computer Science at the University of Minnesota
There are several
datasets available of varying sizes. Let's download the 100K dataset.
This has 100K data points, each row is a rating given by 1
user for 1 movie at a particular date and time
Check out the readme that comes with the data to see all the
files that are provided
There are 2 files
that we are interested in u.data - this has the userId, the movieId, the rating
and the date that rating was given
u.item has a bunch of movie related details, like the title,
genre, imdb url etc. We'll just use this file for the movie titles
Pandas is a python
library for data analysis in a way that's similar to dataframe manipulation in
R. We can read the data from a csv, write to a csv, manipulate it into
different shapes , subset the data based on conditions etc
dataFile='/Users/swethakolalapudi/Downloads/ml-100k/u.data'
data=pd.read_csv(dataFile,sep="\t",header=None,
names=['userId','itemId','rating','timestamp'])
This line will read
the data file, it will treat it as a tab delimited file, i.e; the columns (or
values) are separated by \t
There is no header in the file, (this is specified to Pandas
by header=None)
the names list will be used as the column names for the data
the first column will be checked to see if it's a serial number, if yes it will
be automatically used as a row index. Else a row index which starts from 0 will
be assigned
In[2]: data.head()
data is a pandas DataFrame
object. There are many complex ways of indexing this DataFrame and manipulating
it, subsetting it etc..
head() will print the
first few rows in the DataFrame
In[3]: movieInfoFile='/Users/swethakolalapudi/Downloads/ml-100k/u.item'
movieInfo=pd.read_csv(movieInfoFile,sep="|",
header=None, index_col=False,
names=['itemId','title'],
usecols=[0,1])
Here we are reading the movie data. We just care about the
itemId (movieId) and the title, so we
are only reading the first two columns - this is specified in usecols. We are
explicitly passing the column names in names. Note that index_col is set to
false. This will explicitly make sure that none of the columns in the file are
used to create a row index
In[4]: movieInfo.head()
data=pd.merge(data,movieInfo,left_on='itemId',right_on="itemId")
the result will be that a column 'title' will be added to
our data object. This line is very much like and SQL join. We are specifying
the columns from each table(dataframe) to join on
In[5]: data.head()
Let's now see how we can index the data in the
dataframe.
All the values in a column can simply be indexed
by the column name
userIds=data.userId - a Pandas series object
userIds2=data[['userId']]
- a Pandas DataFrame object
In[6]:userIds.head()
In[7]:userIds2.head()
In[8]:type(userIds)
In[9]:type(userIds2)
In[10]: data.loc[0:10,['userId']]
loc is a function we'll use very heavily for indexing. You
can give it column
and row indices , or
use boolean indexing.
Give loc a list of row indices and a list of column names
In[11]:toyStoryUsers=data[data.title=="Toy
Story (1995)"]
This will give us a
subset dataframe with only the users who have rated Toy Story
In[17]: toyStoryUsers.head()
You can sort values
in the dataframe using the sort_values function ,This function will take in the
dataframe, the columns to sort on and whether to sort ascending or not
data=pd.DataFrame.sort_values(data,['userId','itemId'],ascending=[0,1])
Let's see how many users and how many movies there are
numUsers=max(data.userId)
numMovies=max(data.itemId)
WE can also see how
many movies were rated by each user, and the number of users that rated each
movie
In[18]: moviesPerUser=data.userId.value_counts()
usersPerMovie=data.title.value_counts()
usersPerMovie
numUsers
Let's write a
function to find the top N favorite movies of a user def
favoriteMovies(activeUser,N):
1. subset the
dataframe to have the rows corresponding to the active user
2. sort by the rating in descending order
3. pick the top N rows
In[19]: topMovies=pd.DataFrame.sort_values(
data[data.userId==activeUser],['rating'],ascending=[0])[:N]
return list(topMovies.title)
print favoriteMovies(5,3)
Let's get down to
finding some recommendations now!
We'll start by using
a neigbour based collaborative filtering model .The idea is to find the K
Nearest neighbours of a user and use
their ratings to predict ratings of the active user for movies they haven't rated.
First we'll represent
each user as a vector - each element of the vector will be their rating for 1 movie. Since there
are 1600 odd movies in all Each user
will be represented by a vector that has 1600 odd values When the user doesn't
have any rating for a movie - the corresponding
element will be
blank. NaN is a value in numpy that represents numbers that don't exist. This
is a little tricky - any operation of any other number with NaN will give us
NaN. So, we'll keep this mind as we manipulate the vectors
In[20]: userItemRatingMatrix=pd.pivot_table(data,
values='rating',
index=['userId'], columns=['itemId'])
userItemRatingMatrix.head()
Now each user has
been represented using their ratings. Let's write a function to find the
similarity between 2 users. We'll user a correlation to do so from
scipy.spatial.distance import correlation
In[24]: def similarity(user1,user2):
user1=np.array(user1)-np.nanmean(user1)
user2=np.array(user2)-np.nanmean(user2)
commonItemIds=[i
for i in range(len(user1)) if user1[i]>0 and user2[i]>0]
if len(commonItemIds)==0:
return 0
else:
user1=np.array([user1[i] for i in
commonItemIds])
user2=np.array([user2[i] for i in
commonItemIds])
return correlation(user1,user2)
Using this similarity
function, let's find the nearest neighbours of the active user
def nearestNeighbourRatings(activeUser,K):
similarityMatrix=pd.DataFrame(index=userItemRatingMatrix.index,
columns=['Similarity'])
for i in userItemRatingMatrix.index:
similarityMatrix.loc[i]=similarity(userItemRatingMatrix.loc[activeUser],
userItemRatingMatrix.loc[i])
similarityMatrix=pd.DataFrame.sort_values(similarityMatrix,
['Similarity'],ascending=[0])
nearestNeighbours=similarityMatrix[:K]
The above line will give us the K Nearest neighbours .We'll
now take the nearest neighbours and use their ratings to predict the active
user's rating for every movie
neighbourItemRatings=userItemRatingMatrix.loc[nearestNeighbours.index]
predictItemRating=pd.DataFrame(index=userItemRatingMatrix.columns,
columns=['Rating'])
A placeholder for the predicted item ratings.
It's row index is the
list of itemIds which is the same as the
column index of userItemRatingMatrix
Let's fill this up
now
for i in userItemRatingMatrix.columns:
predictedRating=np.nanmean(userItemRatingMatrix.loc[activeUser])
for j in neighbourItemRatings.index:
if
userItemRatingMatrix.loc[j,i]>0:
predictedRating +=
(userItemRatingMatrix.loc[j,i]
-np.nanmean(userItemRatingMatrix.loc[j]))*nearestNeighbours.loc[j,'Similarity']
predictItemRating.loc[i,'Rating']=predictedRating
return predictItemRating
Let's now use these
predicted Ratings to find the top N Recommendations for the
active user
def
topNRecommendations(activeUser,N):
predictItemRating=nearestNeighbourRatings(activeUser,10)
moviesAlreadyWatched=list(userItemRatingMatrix.loc[activeUser]
.loc[userItemRatingMatrix.loc[activeUser]>0].index)
predictItemRating=predictItemRating.drop(moviesAlreadyWatched)
topRecommendations=pd.DataFrame.sort_values(predictItemRating,
['Rating'],ascending=[0])[:N]
This will give us the list of itemIds which
are the top recommendations
Let's find the corresponding movie titles
topRecommendationTitles=(movieInfo.loc[movieInfo.itemId.isin(topRecommendations.index)])
return
list(topRecommendationTitles.title)
Let's take this for a
spin
activeUser=5
print favoriteMovies(activeUser,5),"\n",topNRecommendations(activeUser,3)
The above code will print the predicted values of favorite movies of the users
So thus using python we have created a movie recomendation system