NLP Part 6: Assessment
Congratulations! You have learned basic NLP that covers tokenization, stemming, lemmatization, stop words, and phrase matching & vocabulary.
Okay, let’s test your knowledge on everything we’ve learned. We’re going to use a short story called “An Occurrence at Owl Creek Bridge” by Ambrose Bierce. You can find it by clicking on the title. Save it as a text file and upload it to your Google Colab. Before we start, make sure you’ve imported the Spacy module and the language library.
import spacy
nlp = spacy.load('en_core_web_sm')
Once you’ve uploaded the “owlcreek.txt” file, create a document object from it. This will help us check if the file loaded properly. If it did, you’ll see a sample of the text appear after you run the code.
with open('/content/owlcreek.txt') as f:
doc = nlp(f.read())
doc[:36]
Alright, let’s see how many tokens (or words) are in the document. I’m getting 4,835.
print(len(doc))
Great! Now let’s count the sentences in the document. I got 204.
sents = [sent for sent in doc.sents]
print(len(sents))
Okay, let’s find the second sentence in the story. Remember, we start counting from 0.
print(sents[1].text)
Let’s take a closer look at a specific token. We’ll print the token itself, its part of speech, dependencies, and lemma.
for token in sents[1]:
print(token.text, token.pos_, token.dep_, token.lemma_)
That’s a lot of information at once. Let’s clean it up and make it easier to read.
for token in sents[1]:
print(f"{token.text:{10}}{token.pos_:{10}}{token.dep_:{10}}{token.lemma_}")
Let’s create a matcher called ‘swimming’ that finds all the places where ‘Swimming Vigorously’ appears in the story. First, we need to import the matcher library.
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern1 = [{'LOWER':'swimming'},{'IS_SPACE':True,'OP':'*'},{'LOWER':'vigorously'}]
pattern2 = [{'LOWER':'swimming'},{'LOWER':'vigorously'}]
matcher = Matcher(nlp.vocab)
matcher.add('Swimming', [pattern1, pattern2])
found_matches = matcher(doc)
print(found_matches)
We found two matches! Now let’s print the words around those matches so we can see them in context.
print(doc[1265:1290])
I can also show you the entire sentence that contains one of the matches. This code will print the sentence from the second match. You can change ‘found_matches[1][1]’ to ‘found_matches[0][1]’ to see the sentence from the first match😉
for sent in sents:
if found_matches[1][1] < sent.end:
print(sent)
break
Well, that was quite the deep dive into the story! We analyzed tokens, sentences, and even searched for specific phrases. You’re getting pretty good at using Spacy for text analysis. Keep practicing and you’ll be a natural in no time!