Logo
Book popularity prediction header image

BOOK POPULARITY PREDICTION

A machine learning project delivering data-driven strategies for movie production companies to select books with high film adaptation potential.

Python • Machine Learning • Data Analysis • <pandas> • <numpy> • <scikit-learn> • <matplotlib> • <Seaborn> •  Python • Machine Learning • Data Analysis • <pandas> • <numpy> • <scikit-learn> • <matplotlib> • <Seaborn> •  Python • Machine Learning • Data Analysis • <pandas> • <numpy> • <scikit-learn> • <matplotlib> • <Seaborn> •  Python • Machine Learning • Data Analysis • <pandas> • <numpy> • <scikit-learn> • <matplotlib> • <Seaborn> •  

MOTIVATION

Books have long been a source of inspiration for blockbuster movies, but how do studios decide which ones have cinematic potential? Our team set out to answer this question using machine learning. My team and I analysed over 18,000 book titles to predict their likelihood of becoming high-revenue films, empowering studios with data-driven decision-making.

TECHNOLOGIES USED

To ensure an efficient and scalable workflow, we utilized a range of tools for data handling, model training, and evaluation:

  • Programming Languages: Python
  • Data Processing: Pandas, NumPy
  • Machine Learning: Scikit-learn, TensorFlow
  • Visualization: Matplotlib, Seaborn
  • Workflow & Version Control: Jupyter Notebook, GitHub

DATA PREPROCESSING AND FEATURE SELECTION

We started by diving into the data, cleaning and preprocessing it to ensure its quality. This involved handling invalid ISBNs, standardizing author names, and dealing with missing or abnormal data in book years, user ages, and user countries.

To assess a book's popularity, we needed a metric that went beyond simple average ratings. We considered both the mean rating and the number of people who had rated the book. In our approach, a book's average rating was given higher importance than the number of ratings. Additionally, we recognized that the significance of each additional rating decreases as the total number of ratings increases. For instance, a book with a 10/10 rating from 20 reviewers should be considered more popular than a 10/10 book with only 2 reviews. However, the difference in popularity between a book with 600 ratings and one with 582 ratings is much less pronounced. Our "book popularity score" was designed to capture these nuances.

pi = ri · log20(ni + m)

Formula explanation:
  • m is the mean number of ratings across all books.
  • i represents a specific book.
  • pi is the popularity score for book i.
  • ri is the mean rating for book i.
  • ni is the number of ratings for book i.

To identify the most influential factors in a book's success, we used mutual information scores. We looked at features like book author, publication year, number of unique countries reviewing the book, and median reviewer age. Author and number of unique country reviews stood out as key indicators of popularity.

FeatureNormalized MI Score
Book Author0.47535
Year of Publication0.00749
Number of Unique Countries that Have Reviewed the Book0.21448
Median Reviewer Age0.00708

The normalised mutual information scores between book popularity and various other features of a book.

UNSUPERVISED LEARNING

K-means clustering was used to explore reader demographics, seeking patterns based on book popularity, reader age, and global reach. While initial visualizations didn't reveal distinct clusters, further analysis suggested that reader age was the most significant differentiating factor. Ultimately, our clustering analysis indicated a general reader base across all ages, rather than distinct demographic groups with unique reading habits.

Graph 1Graph 2

K-Means clustering between 3 features using k = 3 by the Elbow Method

SUPERVISED LEARNING

To quantify the relationship between a book’s popularity and its global reach, we used regression analysis. We explored linear, quadratic, square root, logarithmic, and reciprocal regressions, using 5-fold cross-validation to compare their predictive power. Linear regression emerged as the most accurate and precise model. Our analysis confirmed a positive, linear relationship: books with a wider global spread tend to be more popular.

Regression TechniqueMSE cross validation round 1MSE cross validation round 2MSE cross validation round 3
Linear1.34181.31931.3183
Quadratic1.35071.32871.3229
Square Root1.40491.37981.3728
Logarithmic1.49361.46401.4557
Reciprocal1.64001.60711.5911
MSE cross validation round 4MSE cross validation round 5Mean MSEVariance
1.38781.38051.34960.0009
1.39631.65141.41000.0152
1.46651.36671.39810.0013
1.57721.44781.48770.0022
1.75301.58941.63610.0037

The mean squared error (MSE), average of mean squared errors and variance of mean squared errors for each regression technique over 5 rounds of cross validation

AUTHOR ANALYSIS

Given the strong link between an author’s popularity and a book's success, we delved deeper into author analysis. We found that highly popular authors often have a larger body of work and a broad global audience. For authors with popularity below 8, popularity tends to increase as they write more books. Authors with outstanding popularities above 8 tend to have “one-hit-wonders.

Graph 3

Regression analysis of author popularity against the number of books that author has written

OUR RECOMMENDATIONS

Based on our analysis, we recommend the following strategies for selecting books with high potential for film adaptation:

  • Focus on global reach: Prioritize books with a wide global spread, even if they don't have a massive number of readers yet.
  • Bet on established authors: Seek out books by well-established, popular authors, or acquire the rights to their works early.
  • Consider prolific authors with global appeal: Alternatively, consider authors who may not be famous yet but have written many books and appeal to a global audience.

By following these data-driven strategies, movie production companies can increase their odds of selecting books that will captivate audiences worldwide and achieve box office success.


Logo/Logo

Anubhav Jain

Passionate learner. Innovative developer.