Text Classification Basics Part 4: Feature Extractions from Text
In this part of the series, we’ll explore how to perform feature extraction from text. We’ll use the SMSSpamCollection dataset, which you can download via the provided link. Our goal is to predict whether a message is ham (non-spam) or spam, based on features like text length and punctuation.
Importing the Dataset
Let’s start by importing the necessary libraries and loading the dataset into our environment:
import numpy as np
import pandas as pd
df = pd.read_csv('/content/smsspamcollection.tsv', sep='\t')
df.head()
The dataset contains the following columns: label, message, length, and punctuation.
Now, let’s check if there are any missing values:
df.isnull().sum()
We have clean data — there are no missing values in any of the columns.
Exploring the Label Distribution
Next, let’s take a quick look at the label column and check how many ham and spam messages we have:
df['label'].value_counts()
The dataset contains 4,825 ham messages and 747 spam messages.
Splitting the Data into Training and Test Sets
Now, let’s split the data into training and testing sets using train_test_split. We’ll use the message as our feature and label as our target:
from sklearn.model_selection import train_test_split
X = df['message']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
Feature Extraction with CountVectorizer
To transform the text data into numerical features, we’ll use CountVectorizer. This method tokenizes the text and counts the frequency of words to create feature vectors:
#count vectorization
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
The output shows that the training set consists of 3,900 documents and 7,263 features (unique words).
Improving Feature Extraction with TF-IDF
Counting words is useful, but longer documents may have higher word counts, even if they talk about the same topic. To adjust for this, we use Term Frequency-Inverse Document Frequency (TF-IDF), which scales down the importance of words that appear frequently across many documents.
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
We can streamline this process by using TfidfVectorizer, which combines the steps of CountVectorizer and TfidfTransformer:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_train_tfidf.shape
Training a Linear SVC Classifier
Next, we’ll use LinearSVC, a classifier that is efficient with sparse data and scales well with large datasets.
from sklearn.svm import LinearSVC
clf = LinearSVC()
X_train_clf = clf.fit(X_train_tfidf, y_train)
X_train_clf
We’ll also build a Pipeline to ensure that the same preprocessing steps (TF-IDF transformation) are applied to the test data:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('tfidf',TfidfVectorizer()), ('clf',LinearSVC())])
text_clf.fit(X_train,y_train)
Testing the Classifier
Now, let’s make predictions on the test data and evaluate the model using a confusion matrix and classification report:
predictions = text_clf.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
The model correctly predicts 1,659 ham and spam messages.
Checking the Accuracy Score
Finally, let’s check the model’s accuracy:
from sklearn import metrics
accuracy = metrics.accuracy_score(y_test, predictions)*100
print(f'{accuracy:.2f}%')
Our model achieved an impressive 99.22% accuracy.
Testing the Model with Sample Text
Let’s test the model with some sample text to see how well it can predict whether a message is ham or spam:
#ham sample
text_clf.predict(["Machine learning algorithms have made significant strides in recent years, enhancing applications across various fields such as finance, healthcare, and transportation."])
#spam sample
text_clf.predict(["🚨Congratulations!🎉 You’ve been selected to WIN a FREE iPhone 14! 📱 Claim your prize NOW by clicking the link below: [fake-link]. Act fast, this offer expires in 24 hours!🔥"])
Our model was able to correctly predict both the ham and spam messages.
Conclusion
In this post, we explored how to perform feature extraction from text using CountVectorizer and TF-IDF. We then trained a LinearSVC model and achieved a high accuracy score of 99.22%. This shows the power of proper text preprocessing combined with effective classification algorithms. You can experiment with different models and feature extraction methods to further improve performance for your own text classification tasks. Happy coding!