Text Classification Basics Part 3: Text Classification Basics with SciKit-Learn

Chitra's Playground
4 min readSep 26, 2024

--

To begin working with SciKit-Learn, we first need to install the library. You can do this by running the following command in your environment:

pip install -U scikit-learn

In this tutorial, we’ll be using the SPAM SMS Collection dataset, which contains four columns: label, message, length, and punctuation (punct). You can download the dataset from the provided link.

Importing Libraries and Data

Let’s start by importing the necessary libraries and loading the dataset into our environment:

import numpy as np
import pandas as pd

df = pd.read_csv('/content/smsspamcollection.tsv',sep='\t')
df.head()

Here, we can see the first five entries of the dataset.

Now, let’s check for any missing values:

df.isnull().sum()

As we can see, there are no missing values in any of the columns, which means we have clean data.

Checking the Distribution of Labels

Next, let’s check the unique counts of each label (ham and spam):

df['label'].value_counts()

The dataset contains two types of labels: ham (non-spam messages) and spam. We have 4,825 ham messages and 747 spam messages, indicating that the dataset is imbalanced.

Splitting the Data into Training and Test Sets

Now, let’s split the data into training and test sets using train_test_split from SciKit-Learn. We’ll use length and punct as our features since they are already in numerical form, and label as our target. We will allocate 30% of the data for testing. You can set the random_state to any seed value for reproducibility:

from sklearn.model_selection import train_test_split

#feature
X = df[['length','punct']]

#label
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=69)

print('X train', X_train.shape)
print('X test', X_test.shape)
print('y train', y_train.shape)
print('y test', y_test.shape)

This gives us 3,900 samples for training and 1,572 for testing. You’ll notice that y_train and y_test don't have numbers after the comma because they are one-dimensional arrays.

Training a Naïve Bayes Classifier

Now that our data is prepared, let’s train a Naïve Bayes classifier. We’ll use the Multinomial Naïve Bayes algorithm, which is commonly used for text classification tasks. After training, we can test the model using the test set and check the confusion matrix to evaluate its performance.

from sklearn.naive_bayes import MultinomialNB

MultinomialNB_model = MultinomialNB()
MultinomialNB_model.fit(X_train, y_train)
MultinomialNB_predict = MultinomialNB_model.predict(X_test)

MultinomialNB_confusion = pd.DataFrame(metrics.confusion_matrix(y_test, MultinomialNB_predict), index=['ham','spam'], columns=['ham','spam'])
MultinomialNB_confusion

It seems that our model struggles to predict spam messages accurately. This is likely due to the imbalance in the dataset, where the majority of messages are labeled as ham.

Evaluating the Model

Let’s now generate the classification report and check the accuracy score of our Naïve Bayes classifier:

MultinomialNB_classification = metrics.classification_report(y_test, MultinomialNB_predict)
MultinomialNB_accuracy = metrics.accuracy_score(y_test, MultinomialNB_predict)

print(MultinomialNB_classification)
print(f"accuracy: {MultinomialNB_accuracy*100:.2f}%")

The classification report provides the precision, recall, and F1-score for each label, while the accuracy score shows the overall performance of the model. Although the accuracy may seem high, the imbalance in the dataset can affect how well the model predicts spam messages.

In this tutorial, we walked through the process of building a basic classification model using SciKit-Learn, from loading and exploring the data to training a Naïve Bayes classifier and evaluating its performance. While our model achieved a reasonable accuracy, it struggled with predicting spam messages due to the imbalance in the dataset. In real-world applications, it’s essential to address such issues by applying techniques like oversampling, undersampling, or using more advanced algorithms tailored for imbalanced data. Experimenting with different models and evaluation methods can further improve the performance. As you continue exploring machine learning, remember that understanding your data and choosing the right metrics are key to building effective models.

--

--

Chitra's Playground
Chitra's Playground

Written by Chitra's Playground

Tech enthusiast with a passion for machine learning & eating chicken. Sharing insights on my learning journey.

No responses yet