Natural Language Processing -NER, Stop words extraction-Spacy

Photo by Brett Jordan on Unsplash

What is NLP?
Natural Language Processing basically consists of combining machine learning techniques with text using math and statistics to get that text
in a format that the machine learning algorithms can understand.
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by
software.

Spacy:

spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research and was designed from day one
to be used in real products. spaCy comes with pre-trained statistical models and word vectors. It currently supports tokenization for 49+ languages.

  1. https://spacy.io/
  2. https://github.com/explosion/spaCy

Steps to install spacy python library

import spacy
nlp = spacy.load('en')

Manual download of the English language corpus. Use of sudo is mandatory here. Else, segmentation fault error will be thrown.

$sudo python3 -m spacy download en

Named Entity Recognition(NER):

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.

text = 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE'doc = nlp(text)for X in doc.ents:
print((X.text,X.label_))

Output

('Indian', 'NORP')
('Africa', 'LOC')
('55,000 years ago', 'DATE')
('Indus river', 'LOC')
('9,000 years ago', 'DATE')
('the Indus Valley Civilisation of the third', 'EVENT')

Extract Stop words from a sentence

text = 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE'
doc = nlp(text)
# Extract stop words list
print([x for x in doc if x.is_stop])
# Remove all stop words from text
print([x for x in doc if not x.is_stop])

Output — stopwords list


[on, the, from, no, than, on, the, in, the, of, the, into, the, of, the, third]

Output-Text after removing stop words

[Modern, humans, arrived, Indian, subcontinent, Africa, later, 55,000, years, ago, ., Settled, life, emerged, subcontinent, western, margins, Indus, river, basin, 9,000, years, ago, ,, evolving, gradually, Indus, Valley, Civilisation, millennium, BCE]

Limitations:

Spacy has a max_length limit of 1,000,000 characters.Any string longer than this will raise the following error.

Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 331671174 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB
of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not
using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can
check whether your inputs are too long by checking `len(text)`

https://github.com/boudinfl/pke/issues/68 .
https://github.com/explosion/spaCy/issues/2508

References

https://github.com/explosion/spaCy/issues/4838

https://spacy.io/

--

--

--

Python Back-End Developer, AWS | Django | Flask | Azure | www.linkedin.com/in/dineshkumarkb | https://dock2learn.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Download !PDF Multidimensional Signal and Color Im

Why should you deploy your ML model in shadow mode?

Is an expensive car cheaper in the long run?

super slomo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

What Is NLP And How It Can Change Your Life For The Better.

Fruit Drawing Classification Web-App

Designing a path planner for robots

Decision Trees

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dinesh Kumar K B

Dinesh Kumar K B

Python Back-End Developer, AWS | Django | Flask | Azure | www.linkedin.com/in/dineshkumarkb | https://dock2learn.com

More from Medium

Monitoring Spark NLP Pipelines in Comet

Python — Software required for NLP programming

Intelligence Images Synthetization with Python

Complex word identification with Zipf