POS & NER Part 3: Sentence Segmentation

Chitra's Playground
2 min readSep 17, 2024

--

Now we are going to do sentence segmentation. It’s basically how to divide doc objects into sentences. This post will show you how sentence segmentation works and how to set segmentation rules.

Let’s try something new! In this tutorial, we’ll teach the computer to split sentences apart whenever it sees a semicolon. We’ll use two examples, one called ‘docs’ and another called ‘qotd,’ to see how this works.

import spacy
nlp = spacy.load('en_core_web_sm')

docs = nlp(u"Natural language processing (NLP) encompasses a wide range of tasks; these include tokenization, part-of-speech tagging, and named entity recognition. Companies like Google, Microsoft, & Facebook invest heavily in NLP research to improve their AI systems' capabilities in understanding human language. However, challenges such as ambiguity, context, and diverse linguistic structures (e.g., slang, dialects, etc.) make accurate language processing a complex task.")
qotd = nlp(u"Success is not the key to happiness; happiness is the key to success. If you love what you are doing, you will be successful. – Albert Schweitzer")

Let’s check what sentences we have in the ‘docs’ variable.

list(docs.sents)

It can be seen that we have 3 sentences in the “docs” variable.

Let’s make a new rule so that every time we see a semicolon, we’ll start a new sentence.

First, we have to import Language from the spacy library and after that, we can create a new rule. For this example, we have to create a rule called “set_custom_boundaries” so the computer can understand that a new sentence will be formed after a semicolon.

from spacy.language import Language

@Language.component("set_custom_boundaries")
def set_custom_boundaries(docs):
for token in docs[:-1]:
if token.text == ';':
docs[token.i+1].is_sent_start = True

return docs

nlp.add_pipe("set_custom_boundaries", before='parser')
nlp.pipe_names

Now let’s check if the segmentation works by checking on the “qotd”

list(qotd.sents)

It works! A new sentence is formed after a semicolon.

By creating a new rule that recognizes semicolons as sentence boundaries, we can effectively split text into individual sentences. This process involves loading a spacy model, defining example text, creating a custom rule, adding it to the pipeline, and verifying the results.

--

--

Chitra's Playground
Chitra's Playground

Written by Chitra's Playground

Tech enthusiast with a passion for machine learning & eating chicken. Sharing insights on my learning journey.

No responses yet