Text Classification Basics Part 4: Feature Extractions from Text

4 min readOct 1, 2024

In this part of the series, we’ll explore how to perform feature extraction from text. We’ll use the SMSSpamCollection dataset, which you can download via the provided link. Our goal is to predict whether a message is ham (non-spam) or spam, based on features like text length and punctuation.

Importing the Dataset

Let’s start by importing the necessary libraries and loading the dataset into our environment:

import numpy as np
import pandas as pd

df = pd.read_csv('/content/smsspamcollection.tsv', sep='\t')
df.head()

The dataset contains the following columns: label, message, length, and punctuation.

Now, let’s check if there are any missing values:

df.isnull().sum()

We have clean data — there are no missing values in any of the columns.

Exploring the Label Distribution

Next, let’s take a quick look at the label column and check how many ham and spam messages we have:

df['label'].value_counts()

The dataset contains 4,825 ham messages and 747 spam messages.

Splitting the Data into Training and Test Sets

Now, let’s split the data into training and testing sets using train_test_split. We’ll use the message as our feature and label as our target:

from sklearn.model_selection import train_test_split

X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Feature Extraction with CountVectorizer

To transform the text data into numerical features, we’ll use CountVectorizer. This method tokenizes the text and counts the frequency of words to create feature vectors:

#count vectorization

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

The output shows that the training set consists of 3,900 documents and 7,263 features (unique words).

Improving Feature Extraction with TF-IDF

Counting words is useful, but longer documents may have higher word counts, even if they talk about the same topic. To adjust for this, we use Term Frequency-Inverse Document Frequency (TF-IDF), which scales down the importance of words that appear frequently across many documents.

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

We can streamline this process by using TfidfVectorizer, which combines the steps of CountVectorizer and TfidfTransformer:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train)
X_train_tfidf.shape

Training a Linear SVC Classifier

Next, we’ll use LinearSVC, a classifier that is efficient with sparse data and scales well with large datasets.

from sklearn.svm import LinearSVC

clf = LinearSVC()

X_train_clf = clf.fit(X_train_tfidf, y_train)
X_train_clf

We’ll also build a Pipeline to ensure that the same preprocessing steps (TF-IDF transformation) are applied to the test data:

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf',TfidfVectorizer()), ('clf',LinearSVC())])

text_clf.fit(X_train,y_train)

Testing the Classifier

Now, let’s make predictions on the test data and evaluate the model using a confusion matrix and classification report:

predictions = text_clf.predict(X_test)

from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

The model correctly predicts 1,659 ham and spam messages.

Checking the Accuracy Score

Finally, let’s check the model’s accuracy:

from sklearn import metrics
accuracy = metrics.accuracy_score(y_test, predictions)*100
print(f'{accuracy:.2f}%')

Our model achieved an impressive 99.22% accuracy.

Testing the Model with Sample Text

Let’s test the model with some sample text to see how well it can predict whether a message is ham or spam:

#ham sample
text_clf.predict(["Machine learning algorithms have made significant strides in recent years, enhancing applications across various fields such as finance, healthcare, and transportation."])

#spam sample
text_clf.predict(["🚨Congratulations!🎉 You’ve been selected to WIN a FREE iPhone 14! 📱 Claim your prize NOW by clicking the link below: [fake-link]. Act fast, this offer expires in 24 hours!🔥"])

Our model was able to correctly predict both the ham and spam messages.

Conclusion

In this post, we explored how to perform feature extraction from text using CountVectorizer and TF-IDF. We then trained a LinearSVC model and achieved a high accuracy score of 99.22%. This shows the power of proper text preprocessing combined with effective classification algorithms. You can experiment with different models and feature extraction methods to further improve performance for your own text classification tasks. Happy coding!