NLP Part 1: Tokenization

3 min readSep 5, 2024

In text preprocessing, the initial step is tokenization. This involves dividing a complete text into smaller units or pieces.

"My computer isn't working!"

For example, the text above is a sentence that we want to break down. After the splitting process, the sentence will become like this.

"  ||  My  ||  computer  ||  isn  ||  't  ||  working  ||  !  ||  "

Essentially, tokenization has four parts:

Prefix: characters at the start of a word.
Suffix: characters at the end of a word.
Infix: characters between the prefix and suffix.
Exception: special rules for handling specific cases where splitting strings into tokens or preventing splitting due to punctuation might not follow standard rules (e.g., “isn’t” and “working!”).

Let’s start with a sentence. You can replace this sentence with any text you prefer.

qotd = (u"Nature\'s music is never over; her silences are pauses, not conclusions.\n-Mary Webb \nmary.webb@email.com \nhttp://www.marywebb.com \nUploaded by Adelle Smiths")
print(qotd)

and the result of the code will be like this

To identify the part of speech for each word, you can use the following code

docs = nlp(qotd)

for token in docs:
  print(token.text,token.pos_,token.dep_)

As shown in the results, Spacy can automatically categorize emails and websites.

Now, let’s examine entities. An entity is a term with deeper significance, like a person’s name, country, or brand. By running this code, you can identify the entity type.

for ent in docs.ents:
  print(ent)
  print(ent.label_)
  print(str(spacy.explain(ent.label_)))

Spacy recognizes Webb and Adelle Smiths as individuals based on the example sentence.

Next, we can perform chunking. This involves combining words into groups that convey more meaning. Individual words might not have significant meaning on their own.

for chunk in docs.noun_chunks:
  print(chunk)
  print(chunk.label_)
  print(str(spacy.explain(chunk.label_)))

Check out the first result, “Nature’s Music.” Normally, “Nature’s” and “music” are separate words. “Nature’s” means something belonging to nature, and “music” is sound made by instruments or voices. But when combined, “Nature’s music” has a special meaning.

Now, let’s visualize the tokens. Start by adding the displacy visualizer using this code.

from spacy import displacy
displacy.render(docs,style='dep',jupyter=True,options={'distance':110})

Once you run the code, you’ll see a graph that shows the part of speech and grammatical structure of each word in the sentence. Because I used a long sentence, the actual image will be larger than the one shown here.

Let’s try a different sentence for a clearer example. Run this code to see the results.

doc = nlp(u'My name is Adele. I have $9999 in my bank account. I use Union Bank. I have U.S.A bank account Eventhough I live in the U.K.')
displacy.render(doc,style='ent',jupyter=True)

Check out the results. You’ll see that the entities in the sentence are highlighted, with a brief explanation next to each one.

This is just a taste of what text preprocessing can do! If you’re curious about learning more about the cool Spacy visualizer, click here to find out more. Happy exploring the world of text data!

NLP Part 1: Tokenization

Written by Chitra's Playground

No responses yet