Natural Language Processing -NER, Stop words extraction-Spacy

Image for post
Image for post
Photo by Brett Jordan on Unsplash

Spacy:

import spacy
nlp = spacy.load('en')
$sudo python3 -m spacy download en

Named Entity Recognition(NER):

text = 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE'doc = nlp(text)for X in doc.ents:
print((X.text,X.label_))

Output

('Indian', 'NORP')
('Africa', 'LOC')
('55,000 years ago', 'DATE')
('Indus river', 'LOC')
('9,000 years ago', 'DATE')
('the Indus Valley Civilisation of the third', 'EVENT')

Extract Stop words from a sentence

text = 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE'
doc = nlp(text)
# Extract stop words list
print([x for x in doc if x.is_stop])
# Remove all stop words from text
print([x for x in doc if not x.is_stop])

Output — stopwords list


[on, the, from, no, than, on, the, in, the, of, the, into, the, of, the, third]

Output-Text after removing stop words

[Modern, humans, arrived, Indian, subcontinent, Africa, later, 55,000, years, ago, ., Settled, life, emerged, subcontinent, western, margins, Indus, river, basin, 9,000, years, ago, ,, evolving, gradually, Indus, Valley, Civilisation, millennium, BCE]

Limitations:

Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 331671174 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB
of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not
using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can
check whether your inputs are too long by checking `len(text)`

References

Python Developer, AWS certified solutions architect associate | CSM | Django | Flask | www.linkedin.com/in/dineshkumarkb

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store