Introduction to Python TextBlob: Simplified Text Processing

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob is a wrapper around NLTK library which is used for text processing and NLP. NLTK is very in-depthbut TextBlob provides a very easy and convenient to use API to NLTK to perform most common tasks which makes it very suitable for beginner learning and experimenting. After learning TextBlob you can move on to learning NLTK and learn its in-depth concepts and working. NLTK is very good for learning intermediate and advanced NLP concepts.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

Installation

TextBlob can be installed using pip. If you want to install python or pip see this post.

pip install textblob

Or if you have python3 installed

pip3 install textblob

Introduction

Main class in textblob package is TextBlob class

from textblob import TextBlob

Let’s create an instance of TextBlob and supply a paragraph of text from textblob documentation example.

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

#creating TextBlob object
blob = TextBlob(text)

Tokenization

Like other functions in textblob, Tokenization is also easy to do. To retrieve the tokens we can access properties of blob object like words and sentences.

>>> blob.words
WordList(['The', 'titular', 'threat', 'of', 'The', 'Blob', 'has', 'always', 'struck', 'me', 'as', 'the', 'ultimate', 'movie', 'monster', 'an', 'insatiably', 'hungry', 'amoeba-like', 'mass', 'able', 'to', 'penetrate', 'virtually', 'any', 'safeguard', 'capable', 'of', 'as', 'a', 'doomed', 'doctor', 'chillingly', 'describes', 'it', 'assimilating', 'flesh', 'on', 'contact', 'Snide', 'comparisons', 'to', 'gelatin', 'be', 'damned', 'it', "'s", 'a', 'concept', 'with', 'the', 'most', 'devastating', 'of', 'potential', 'consequences', 'not', 'unlike', 'the', 'grey', 'goo', 'scenario', 'proposed', 'by', 'technological', 'theorists', 'fearful', 'of', 'artificial', 'intelligence', 'run', 'rampant'])

>>> blob.sentences
[Sentence("
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact."), Sentence("Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.")]

Various properties and methods of WordList and Sentence will be discussed in later section of this post.

Lemmatization

for word in blob.words:
    print(word.lemmatize())

#OUTPUT
The
titular
threat
of
The
Blob
ha
always
struck
me
a
the
ultimate
movie
monster
an
insatiably
hungry
amoeba-like
mass
able
to
penetrate
virtually
any
safeguard
capable
of
a
a
doomed
doctor
chillingly
describes
it
assimilating
flesh
on
contact
Snide
comparison
to
gelatin
be
damned
it
's
a
concept
with
the
most
devastating
of
potential
consequence
not
unlike
the
grey
goo
scenario
proposed
by
technological
theorist
fearful
of
artificial
intelligence
run
rampant

P.O.S Tags

>>> blob.tags
[('The', 'DT'), ('titular', 'JJ'), ('threat', 'NN'), ('of', 'IN'), ('The', 'DT'), ('Blob', 'NNP'), ('has', 'VBZ'), ('always', 'RB'), ('struck', 'VBN'), ('me', 'PRP'), ('as', 'IN'), ('the', 'DT'), ('ultimate', 'JJ'), ('movie', 'NN'), ('monster', 'NN'), ('an', 'DT'), ('insatiably', 'RB'), ('hungry', 'JJ'), ('amoeba-like', 'JJ'), ('mass', 'NN'), ('able', 'JJ'), ('to', 'TO'), ('penetrate', 'VB'), ('virtually', 'RB'), ('any', 'DT'), ('safeguard', 'NN'), ('capable', 'JJ'), ('of', 'IN'), ('as', 'IN'), ('a', 'DT'), ('doomed', 'JJ'), ('doctor', 'NN'), ('chillingly', 'RB'), ('describes', 'VBZ'), ('it', 'PRP'), ('assimilating', 'VBG'), ('flesh', 'NN'), ('on', 'IN'), ('contact', 'NN'), ('Snide', 'JJ'), ('comparisons', 'NNS'), ('to', 'TO'), ('gelatin', 'VB'), ('be', 'VB'), ('damned', 'VBN'), ('it', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('concept', 'NN'), ('with', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('devastating', 'JJ'), ('of', 'IN'), ('potential', 'JJ'), ('consequences', 'NNS'), ('not', 'RB'), ('unlike', 'IN'), ('the', 'DT'), ('grey', 'NN'), ('goo', 'NN'), ('scenario', 'NN'), ('proposed', 'VBN'), ('by', 'IN'), ('technological', 'JJ'), ('theorists', 'NNS'), ('fearful', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('run', 'NN'), ('rampant', 'NN')]

Getting Noun Phrases

>>> blob.noun_phrases
WordList(['titular threat', 'blob', 'ultimate movie monster', 'amoeba-like mass', 'snide', 'potential consequences', 'grey goo scenario', 'technological theorists fearful', 'artificial intelligence run rampant'])

Sentiment

>>> blob.sentiment
Sentiment(polarity=-0.1590909090909091, subjectivity=0.6931818181818182)

Sentiment class has polarity and subjectivity properties to retrieve individual polarity and subjectivity.

>>> blob.sentiment.polarity
-0.1590909090909091
>>> blob.sentiment.subjectivity
0.6931818181818182

We can also get sentiment of individual sentences by sentiment on Sentence instances

>>> for sentence in blob.sentences:
... print(sentence.sentiment)
Sentiment(polarity=0.06000000000000001, subjectivity=0.605)
Sentiment(polarity=-0.34166666666666673, subjectivity=0.7666666666666666)

Synsets

>>> blob.words[1].synsets
[Synset('titular.a.01'), Synset('titular.a.02'), Synset('titular.a.03'), Synset('titular.a.04'), Synset('nominal.s.06')]

We can also integrate WordNet

from textblob.wordnet import VERB
from textblob import Word

>>> Word('run').get_synsets(pos=VERB)
[Synset('run.v.01'), Synset('scat.v.01'), Synset('run.v.03'), Synset('operate.v.01'), Synset('run.v.05'), Synset('run.v.06'), Synset('function.v.01'), Synset('range.v.01'), Synset('campaign.v.01'), Synset('play.v.18'), Synset('run.v.11'), Synset('tend.v.01'), Synset('run.v.13'), Synset('run.v.14'), Synset('run.v.15'), Synset('run.v.16'), Synset('prevail.v.03'), Synset('run.v.18'), Synset('run.v.19'), Synset('carry.v.15'), Synset('run.v.21'), Synset('guide.v.05'), Synset('run.v.23'), Synset('run.v.24'), Synset('run.v.25'), Synset('run.v.26'), Synset('run.v.27'), Synset('run.v.28'), Synset('run.v.29'), Synset('run.v.30'), Synset('run.v.31'), Synset('run.v.32'), Synset('run.v.33'), Synset('run.v.34'), Synset('ply.v.03'), Synset('hunt.v.01'), Synset('race.v.02'), Synset('move.v.13'), Synset('melt.v.01'), Synset('ladder.v.01'), Synset('run.v.41')]

#getting definition
>>> Word('car').definitions
['a motor vehicle with four wheels; usually propelled by an internal combustion engine', 'a wheeled vehicle adapted to the rails of railroad', 'the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant', 'where passengers ride up and down', 'a conveyance for passengers or freight on a cable railway']

n-grams

>>> blob.ngrams(2)
[WordList(['The', 'titular']), WordList(['titular', 'threat']), WordList(['threat', 'of']), WordList(['of', 'The']), WordList(['The', 'Blob']), WordList(['Blob', 'has']), WordList(['has', 'always']), WordList(['always', 'struck']), WordList(['struck', 'me']), WordList(['me', 'as']), WordList(['as', 'the']), WordList(['the', 'ultimate']), WordList(['ultimate', 'movie']), WordList(['movie', 'monster']), WordList(['monster', 'an']), WordList(['an', 'insatiably']), WordList(['insatiably', 'hungry']), WordList(['hungry', 'amoeba-like']), WordList(['amoeba-like', 'mass']), WordList(['mass', 'able']), WordList(['able', 'to']), WordList(['to', 'penetrate']), WordList(['penetrate', 'virtually']), WordList(['virtually', 'any']), WordList(['any', 'safeguard']), WordList(['safeguard', 'capable']), WordList(['capable', 'of']), WordList(['of', 'as']), WordList(['as', 'a']), WordList(['a', 'doomed']), WordList(['doomed', 'doctor']), WordList(['doctor', 'chillingly']), WordList(['chillingly', 'describes']), WordList(['describes', 'it']), WordList(['it', 'assimilating']), WordList(['assimilating', 'flesh']), WordList(['flesh', 'on']), WordList(['on', 'contact']), WordList(['contact', 'Snide']), WordList(['Snide', 'comparisons']), WordList(['comparisons', 'to']), WordList(['to', 'gelatin']), WordList(['gelatin', 'be']), WordList(['be', 'damned']), WordList(['damned', 'it']), WordList(['it', "'s"]), WordList(["'s", 'a']), WordList(['a', 'concept']), WordList(['concept', 'with']), WordList(['with', 'the']), WordList(['the', 'most']), WordList(['most', 'devastating']), WordList(['devastating', 'of']), WordList(['of', 'potential']), WordList(['potential', 'consequences']), WordList(['consequences', 'not']), WordList(['not', 'unlike']), WordList(['unlike', 'the']), WordList(['the', 'grey']), WordList(['grey', 'goo']), WordList(['goo', 'scenario']), WordList(['scenario', 'proposed']), WordList(['proposed', 'by']), WordList(['by', 'technological']), WordList(['technological', 'theorists']), WordList(['theorists', 'fearful']), WordList(['fearful', 'of']), WordList(['of', 'artificial']), WordList(['artificial', 'intelligence']), WordList(['intelligence', 'run']), WordList(['run', 'rampant'])]

Translations

We can easily detect language of text by using detect_language()

>>> blob.detect_language()
'en'

We can also perform translations

>>> blob.translate(to='fr')
TextBlob("La menace titulaire de The Blob m'a toujours été le film ultime
monstre: une masse insatisfaisante affamée et amibe capable de pénétrer
pratiquement n'importe quelle sauvegarde, capable de - en tant que docteur condamné avec calme
le décrit - "assimilant la chair au contact.
Les comparaisons de Snide à la gélatine seront damnées, c'est un concept avec le plus grand
Dévastatrice de conséquences potentielles, contrairement au scénario gris
proposé par les théoriciens technologiques craignant
l'intelligence artificielle est courante.")

Note: to parameter in translate() function requires an ISO 639-2 language code. To see the full reference of language code see this link.

Leave a comment