Natural Language Processing -NER, Stop words extraction-Spacy
What is NLP?
Natural Language Processing basically consists of combining machine learning techniques with text using math and statistics to get that text
in a format that the machine learning algorithms can understand.
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by
software.
Spacy:
spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research and was designed from day one
to be used in real products. spaCy comes with pre-trained statistical models and word vectors. It currently supports tokenization for 49+ languages.
Steps to install spacy python library
import spacy
nlp = spacy.load('en')
Manual download of the English language corpus. Use of sudo is mandatory here. Else, segmentation fault error will be thrown.
$sudo python3 -m spacy download en
Named Entity Recognition(NER):
Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.
text = 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE'doc = nlp(text)for X in doc.ents:
print((X.text,X.label_))
Output
('Indian', 'NORP')
('Africa', 'LOC')
('55,000 years ago', 'DATE')
('Indus river', 'LOC')
('9,000 years ago', 'DATE')
('the Indus Valley Civilisation of the third', 'EVENT')
Extract Stop words from a sentence
text = 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE'
doc = nlp(text)# Extract stop words list
print([x for x in doc if x.is_stop])# Remove all stop words from text
print([x for x in doc if not x.is_stop])
Output — stopwords list
[on, the, from, no, than, on, the, in, the, of, the, into, the, of, the, third]
Output-Text after removing stop words
[Modern, humans, arrived, Indian, subcontinent, Africa, later, 55,000, years, ago, ., Settled, life, emerged, subcontinent, western, margins, Indus, river, basin, 9,000, years, ago, ,, evolving, gradually, Indus, Valley, Civilisation, millennium, BCE]
Limitations:
Spacy has a max_length limit of 1,000,000 characters.Any string longer than this will raise the following error.
Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 331671174 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB
of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not
using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can
check whether your inputs are too long by checking `len(text)`
https://github.com/boudinfl/pke/issues/68 .
https://github.com/explosion/spaCy/issues/2508