Thursday, December 8, 2016

Movie Recomendation System with Python



The objective in this post is to generate some movie recommendations for a user, given movies they have already watched, and the ratings they gave for those movies

 We will do this a few different ways. We'll also use Pandas, a data analysis library, for most of the
 data preparation and analysis.

 First let's start by downloading the dataset we'll be using. This is the  MovieLens dataset
 which is maintained by the Department of Computer Science at the University of Minnesota

 There are several datasets available of varying sizes. Let's download the 100K dataset.
This has 100K data points, each row is a rating given by 1 user for 1 movie at a particular date and time
Check out the readme that comes with the data to see all the files that are provided

 There are 2 files that we are interested in u.data - this has the userId, the movieId, the rating and the date that rating was given
u.item has a bunch of movie related details, like the title, genre, imdb url etc. We'll just use this file for the movie titles

 Pandas is a python library for data analysis in a way that's similar to dataframe manipulation in R. We can read the data from a csv, write to a csv, manipulate it into different shapes , subset the data based on conditions etc

dataFile='/Users/swethakolalapudi/Downloads/ml-100k/u.data'
data=pd.read_csv(dataFile,sep="\t",header=None,
                 names=['userId','itemId','rating','timestamp'])


 This line will read the data file, it will treat it as a tab delimited file, i.e; the columns (or values) are separated by \t
There is no header in the file, (this is specified to Pandas by header=None)
the names list will be used as the column names for the data the first column will be checked to see if it's a serial number, if yes it will be automatically used as a row index. Else a row index which starts from 0 will be assigned



 In[2]: data.head()

 data is a pandas DataFrame object. There are many complex ways of indexing this DataFrame and manipulating it, subsetting it etc..
 head() will print the first few rows in the DataFrame


 In[3]: movieInfoFile='/Users/swethakolalapudi/Downloads/ml-100k/u.item'
movieInfo=pd.read_csv(movieInfoFile,sep="|", header=None, index_col=False,
                     names=['itemId','title'], usecols=[0,1])

Here we are reading the movie data. We just care about the itemId (movieId) and  the title, so we are only reading the first two columns - this is specified in usecols. We are explicitly passing the column names in names. Note that index_col is set to false. This will explicitly make sure that none of the columns in the file are used to create a row index

 In[4]: movieInfo.head()
data=pd.merge(data,movieInfo,left_on='itemId',right_on="itemId")

the result will be that a column 'title' will be added to our data object. This line is very much like and SQL join. We are specifying the columns from each table(dataframe) to join on 

 In[5]: data.head()
 Let's now see how we can index the data in the dataframe.
 All the values in a column can simply be indexed by the column name
userIds=data.userId  - a Pandas series object
userIds2=data[['userId']] - a Pandas DataFrame object

 
 In[6]:userIds.head()
 In[7]:userIds2.head()
 In[8]:type(userIds)
 In[9]:type(userIds2)
 In[10]: data.loc[0:10,['userId']]

loc is a function we'll use very heavily for indexing. You can give it column
 and row indices , or use boolean indexing.
Give loc a list of row indices and a list of column names

 In[11]:toyStoryUsers=data[data.title=="Toy Story (1995)"]

 This will give us a subset dataframe with only the users who have rated Toy Story
In[17]: toyStoryUsers.head()

 You can sort values in the dataframe using the sort_values function ,This function will take in the dataframe, the columns to sort on and whether to sort ascending or not
data=pd.DataFrame.sort_values(data,['userId','itemId'],ascending=[0,1])

Let's see how many users and how  many movies there are

numUsers=max(data.userId)
numMovies=max(data.itemId)

 WE can also see how many movies were rated by each user, and the number of users that rated each movie 

In[18]: moviesPerUser=data.userId.value_counts()
usersPerMovie=data.title.value_counts()
usersPerMovie
 numUsers


 Let's write a function to find the top N favorite movies of a user def favoriteMovies(activeUser,N):
    1. subset the dataframe to have the rows corresponding to the active user
     2. sort by the rating in descending order
     3. pick the top N rows
     
 In[19]: topMovies=pd.DataFrame.sort_values(
        data[data.userId==activeUser],['rating'],ascending=[0])[:N]
        return list(topMovies.title)
        print favoriteMovies(5,3)

 Let's get down to finding some recommendations now!

 We'll start by using a neigbour based collaborative filtering model .The idea is to find the K Nearest neighbours of a user and  use their ratings to predict ratings of the active user for movies  they haven't rated.

 First we'll represent each user as a vector - each element of the vector  will be their rating for 1 movie. Since there are 1600 odd movies in all  Each user will be represented by a vector that has 1600 odd values When the user doesn't have any rating for a movie - the corresponding
 element will be blank. NaN is a value in numpy that represents numbers that don't exist. This is a little tricky - any operation of any other number with NaN will give us NaN. So, we'll keep this mind as we manipulate the vectors

 In[20]: userItemRatingMatrix=pd.pivot_table(data, values='rating',
                                    index=['userId'], columns=['itemId'])

userItemRatingMatrix.head()

 Now each user has been represented using their ratings. Let's write a function to find the similarity between 2 users. We'll user a correlation to do so from scipy.spatial.distance import correlation

 In[24]: def similarity(user1,user2):
    user1=np.array(user1)-np.nanmean(user1)  
    user2=np.array(user2)-np.nanmean(user2)
    commonItemIds=[i for i in range(len(user1)) if user1[i]>0 and user2[i]>0]
    if len(commonItemIds)==0:
            return 0
    else:
        user1=np.array([user1[i] for i in commonItemIds])
        user2=np.array([user2[i] for i in commonItemIds])
        return correlation(user1,user2)
   
 Using this similarity function, let's find the nearest neighbours of the active user
def nearestNeighbourRatings(activeUser,K):
    similarityMatrix=pd.DataFrame(index=userItemRatingMatrix.index,
                                  columns=['Similarity'])
    for i in userItemRatingMatrix.index:
        similarityMatrix.loc[i]=similarity(userItemRatingMatrix.loc[activeUser],
                                          userItemRatingMatrix.loc[i])
    similarityMatrix=pd.DataFrame.sort_values(similarityMatrix,
                                              ['Similarity'],ascending=[0])
    nearestNeighbours=similarityMatrix[:K]

The above line will give us the K Nearest neighbours .We'll now take the nearest neighbours and use their ratings to predict the active user's rating for every movie
neighbourItemRatings=userItemRatingMatrix.loc[nearestNeighbours.index]
    predictItemRating=pd.DataFrame(index=userItemRatingMatrix.columns, columns=['Rating'])
     A placeholder for the predicted item ratings. It's row index is the
     list of itemIds which is the same as the column index of userItemRatingMatrix
    Let's fill this up now

    for i in userItemRatingMatrix.columns:
     predictedRating=np.nanmean(userItemRatingMatrix.loc[activeUser])
        for j in neighbourItemRatings.index:
            if userItemRatingMatrix.loc[j,i]>0:
                predictedRating += (userItemRatingMatrix.loc[j,i]
                                    -np.nanmean(userItemRatingMatrix.loc[j]))*nearestNeighbours.loc[j,'Similarity']
        predictItemRating.loc[i,'Rating']=predictedRating
    return predictItemRating

 Let's now use these predicted Ratings to find the top N Recommendations for the
 active user

def topNRecommendations(activeUser,N):
    predictItemRating=nearestNeighbourRatings(activeUser,10)
    moviesAlreadyWatched=list(userItemRatingMatrix.loc[activeUser]
                              .loc[userItemRatingMatrix.loc[activeUser]>0].index)
    predictItemRating=predictItemRating.drop(moviesAlreadyWatched)
    topRecommendations=pd.DataFrame.sort_values(predictItemRating,
                                                ['Rating'],ascending=[0])[:N]

     This will give us the list of itemIds which are the top recommendations
     Let's find the corresponding movie titles
    topRecommendationTitles=(movieInfo.loc[movieInfo.itemId.isin(topRecommendations.index)])
    return list(topRecommendationTitles.title)

 Let's take this for a spin
activeUser=5
print favoriteMovies(activeUser,5),"\n",topNRecommendations(activeUser,3)

The above code will print the predicted values of  favorite movies of the users

So thus using python we have created a movie recomendation system



4 comments:

  1. who is swethakolalapudi and how was his machine involved in doing this exercise? curious

    ReplyDelete
  2. and i see no evidence of the use of Spark libraries here

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. will this work if the user hasn't liked the movie but it's been watched by him... this movie shouldn't be recommended ideally ..

    ReplyDelete