My own AI Model for Sentiment Analysis

Rahul Bhattacharya May 09, 2025 07:23 AM

Live Demo GitHub

Synopsis: In this post, I want to walk through how I created an AI-powered IMDb Sentiment Classifier. The goal of this project was simple: take a piece of text (a movie review) and classify it as Positive or Negative. I trained my own model, saved it, and deployed it as a...

In this post, I want to walk through how I created an AI-powered IMDb Sentiment Classifier. The goal of this project was simple: take a piece of text (a movie review) and classify it as Positive or Negative. I trained my own model, saved it, and deployed it as a live demo using Streamlit.

Loading the IMDb Dataset

I started with the IMDb dataset available through Hugging Face Datasets. This dataset contains 50,000 movie reviews labeled with sentiment.

from datasets import load_dataset
import pandas as pd

ds = load_dataset("imdb")

train_df = pd.DataFrame({"text": ds["train"]["text"], "label": ds["train"]["label"]})
test_df  = pd.DataFrame({"text": ds["test"]["text"],  "label": ds["test"]["label"]})

Here I converted the dataset into Pandas DataFrames for easier handling. Each row has a text review and a label (0 = Negative, 1 = Positive).

Text Preprocessing

Movie reviews often have HTML tags or strange punctuation. To make the model more effective, I normalized the text.

import re

def normalize(text: str) -> str:
    text = re.sub(r"<br\s*/?>", " ", text)
    text = re.sub(r"[^A-Za-z0-9\s']", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text.lower()

train_df["text"] = train_df["text"].apply(normalize)
test_df["text"]  = test_df["text"].apply(normalize)

This step ensures that the reviews are lowercased, stripped of extra characters, and ready for vectorization.

Vectorization with TF-IDF

Next, I converted the raw text into numeric features using TF-IDF. This technique captures which words are important in a review relative to all other reviews.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=50000, ngram_range=(1,2))
X_train = vectorizer.fit_transform(train_df["text"])
X_test  = vectorizer.transform(test_df["text"])

Here I kept up to 50,000 features and considered both unigrams (single words) and bigrams (two-word phrases).

Training the Classifier

I chose Logistic Regression, a simple yet powerful classifier for text problems.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

clf = LogisticRegression(solver="liblinear", max_iter=1000)
clf.fit(X_train, train_df["label"])

pred = clf.predict(X_test)
acc = accuracy_score(test_df["label"], pred)
print("Accuracy:", acc)
print(classification_report(test_df["label"], pred, digits=3))

This gave me a strong baseline sentiment classifier with accuracy well above random guessing.

Saving the Model

After training, I combined the vectorizer and classifier into a pipeline and saved it as a .pkl file. This makes it easy to reuse later without retraining.

from sklearn.pipeline import Pipeline
import joblib
import os

pipeline = Pipeline([
    ("tfidf", vectorizer),
    ("clf", clf)
])

os.makedirs("artifacts", exist_ok=True)
joblib.dump(pipeline, "artifacts/imdb_tfidf_logreg.pkl")

Now I had a portable model file that I could use in my web app.

Building the Streamlit App

I wanted recruiters to try the model themselves, so I built a small web app using Streamlit.

import joblib
import re
import streamlit as st
from pathlib import Path

MODEL_PATH = Path(__file__).parent / "models" / "imdb_tfidf_logreg.pkl"

@st.cache_resource
def load_model():
    return joblib.load(MODEL_PATH)

def normalize(text: str) -> str:
    text = re.sub(r"<br\s*/?>", " ", text)
    text = re.sub(r"[^A-Za-z0-9\s']", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text.lower()

st.set_page_config(page_title="IMDb Sentiment Classifier")
st.title("IMDb Sentiment Classifier")

model = load_model()

text = st.text_area("Enter a movie review:")
if st.button("Analyze") and text.strip():
    x = normalize(text)
    pred = model.predict([x])[0]
    label = "Positive" if pred == 1 else "Negative"
    st.subheader(f"Prediction: {label}")

With this app, a user can type or paste a review, click Analyze, and instantly see if the model thinks the review is Positive or Negative.

Packaging Requirements

To make deployment work on Streamlit Cloud, I created a requirements.txt file:

streamlit>=1.37
scikit-learn>=1.3
joblib>=1.3
numpy>=1.23

This ensures that all necessary Python packages are installed automatically.

Deployment

I pushed my code and model file to GitHub, then connected the repository to Streamlit Community Cloud. Streamlit automatically built the environment and launched the app at a free public URL.

Finally, I embedded the app into my GitHub Pages portfolio so recruiters could interact with it directly.

What This Project Demonstrates

By completing this project, I was able to demonstrate:

How to train a machine learning model for sentiment analysis using IMDb data.
How to clean and preprocess text for natural language processing.
How to package and deploy a model as a live demo using Streamlit.
How to integrate an AI project into a professional portfolio.

This project shows end-to-end designing of an AI model: from dataset to model to deployment.