POS & NER Part 2: Name Entity Recognition
Today, we’ll explore of of NLP fundamental task, Named Entity Recognition (NER). NER involves identifying and classifying named entities within text, such as people, organizations, locations, dates, and quantities. These entities provide crucial context and meaning to the text. Our objective for today is to take a raw text as input and apply NER techniques to extract and categorize the relevant named entities. Let’s start by adding the Spacy library.
import spacy
nlp = spacy.load('en_core_web_sm')
Alright, next up, let’s create a variable to hold the text we’re gonna work with. I’m gonna use a big chunk of text for this one
doc = nlp(u'Elon Musk announced that Tesla, Inc. will open a new manufacturing plant in Berlin, Germany, by the end of 2025. The facility is expected to produce over 100,000 electric vehicles annually and create thousands of jobs in the region. Last year, SpaceX, another company founded by Musk, successfully launched its Starship rocket from Cape Canaveral in Florida, marking a major milestone in space exploration. The announcement was made during the 2024 International Automotive Expo in New York City, where several companies, including BMW and Ford, showcased their latest innovations.')
We’re gonna build a function to show us the entity information.
def show_ents(doc):
if doc.ents:
for ent in doc.ents:
print(f"{ent.text:{30}} {ent.label_:{10}} {str(spacy.explain(ent.label_))}")
else:
print('No entities found')
show_ents(doc)
After you run the code, you’ll see that the text we’re using is packed with entities.
Let’s figure out how many times each entity is mentioned.
from collections import Counter
def count_ents(doc):
ent_labels = [ent.label_ for ent in doc.ents]
counts = Counter(ent_labels)
for label, count in counts.items():
print(f"{label:{10}} {count}")
count_ents(doc)
Now you can see a list of all the entities and how often they show up. You’ll notice that some of them appear more than once.
Let’s visualize it to show it better. But this time, I want to highlight only on GPE and PERSON.
for sents in doc.sents:
colors={'GPE':'radial-gradient(#a18cd1 ,#6a11cb )', 'PERSON':'radial-gradient(#ff9a9e ,#b91d73 )'}
options = {'ents':['GPE', 'PERSON'], 'colors':colors}
displacy.render(nlp(sents.text), style='ent', jupyter=True, options=options)
Here’s what I got when I made a picture of the named entities. I put each sentence on its own line and highlighted GPE and PERSON entities.
There could be times when you need to include more entities. Let’s do another example with a shorter text.
docs = nlp(u'She is an avid book reader and spends hours every day immersed in novels.'
u'The book-reader she bought, similar to the Amazon Kindle, makes reading more convenient on the go.')
show_ents(docs)
As we can see, there are two entities listed on on the text
For this one, I want to include ‘book reader’ and ‘book-reader’ to our list of entities and see what happens. We need to add ‘book reader’ and ‘book-reader’ to our list. Now let’s see if they match anything in the text
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = ['book reader', 'book-reader']
phrase_pattern = [nlp(text) for text in phrase_list]
matcher.add('br', None, *phrase_pattern)
found_matches = matcher(docs)
for match_id, start, end in found_matches:
string_id = nlp.vocab.strings[match_id]
span = docs[start:end]
print(match_id, string_id, start, end, span.text)
We found 2 matches in the text.
Now let’s sign the phrase list as product entity
from spacy.tokens import Span
PROD = docs.vocab.strings[u'PRODUCT']
new_entities = [Span(docs,match[1],match[2],label=PROD) for match in found_matches]
docs.ents = list(docs.ents) + new_entities
show_ents(docs)
Here’s the updated list of entities. ‘book reader’ and ‘book-reader’ are now listed as PRODUCT entities.
In today’s exploration of Named Entity Recognition (NER), we’ve successfully demonstrated its ability to extract and categorize crucial information from raw text. By leveraging the Spacy library and applying NER techniques, we were able to identify and analyze various named entities, including people, organizations, locations, and even custom entities like “book reader.” Through our visualizations and analysis, we gained valuable insights into the frequency and distribution of these entities within the text.