Notification: you are in staging mode.

4 NLP libraries that are awesome

There are many libraries created to solve NLP problems. Here are some of the most amazing ones that helped us over the years to deliver quality projects to our clients. Keep in mind that this list is not a complete overview of all the available NLP libraries but these are the ones that we think are awesome.

In the past doing an NLP project required a lot of great minds together, you needed mathematicians, machine learning engineers and linguistics. Now, developers can use ready-made tools that simplify text preprocessing so that they can concentrate on building machine learning models.

Why Python?

First of all, the programming language that is our first choice in NLP projects is Python. The simple syntax and transparent semantics of this language make it an excellent choice.

But there is something else why this makes a great programming language for helping computers cope with natural languages. There is an extensive collection of NLP libraries out there that handles a great number of tasks such as sentiment analysis, tokenization, classification and so on.

The NLTK library is generally the most popular. This is because of the wide range of applications it allows, such as sentiment analysis, tokenization, classification. NLTK can also be applied to many languages, including Dutch (which is often not the case with other libraries). NLTK is especially useful in text processing.

The downside is that this library can be quite slow and also difficult to use, the learning curve is steep.

Natural language toolkit features include:

  • Text classification
  • Part-of-speech tagging
  • Entity extraction
  • Tokenization
  • Parsing
  • Stemming
  • Semantic reasoning
from nltk.tokenize import word_tokenize

sample_text = "this text needs to be tokenized"
word_tokenize(sample_text)

# ----- Expected output -----
# ['this', 'text', 'needs', 'to', 'be', 'tokenized']
from nltk.stem.snowball import SnowballStemmer

dutchStemmer = SnowballStemmer("dutch")
dutchStemmer.stem("artikelen")

# ----- Expected output -----
# 'artikel'

SpaCy

SpaCy, which stands for Python for convenience and Cython for speed, is the next step of the NLTK evolution. NLTK is clumsy and slow when it comes to more complex business applications.

We also prefer this library above NLTK because of its speed, since it is written in Cython. It’s a relatively young library designed for production usage. But that makes it also more accessible than other Python libraries.

SpaCy is good at syntactic analysis, which is handy for aspect-based sentiment analysis and conversational user interface optimization.

SpaCy is also an excellent choice for named-entity recognition. You can use SpaCy for business insights and market research.

It’s a perfect match for comparing customer profiles, product profiles, or text documents.

It includes almost every feature found in those competing frameworks:

  • Part-of-speech tagging
  • Dependency parsing
  • Named entity recognition
  • Tokenization
  • Sentence segmentation
  • Rule-based match operations
  • Word vectors

You can also build word vectors that are used in e.g. topic modelling. It’s a real advantage with this library. Unlike OpenNLP and CoreNLP, SpaCy works with word2vec and doc2vec.

The biggest advantage over the other NLP tools is its API. SpaCy got all functions combined at once, so you don’t need to select modules on your own.

However, there is also a big downside to this tool. It supports the smallest number of languages. But this should improve as its popularity increases.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("ML2Grow is a fast growing startup located in Ghent")

for ent in doc.ents:
    print(ent.text, ent.label_)

# ----- Expected output -----
ML2Grow ORG
Ghent GPE  

ORG: Companies, agencies, institutions
GPE: Geopolitical entity, i.e. countries, cities, states.

Gensim

Sometimes you need to extract particular information to discover business insights. GenSim is the perfect tool for such things.

We mainly use Gensim for finding similarities in text documents and topic modelling. It’s a great library for identifying similarities between two documents through vector space modelling and topic modelling. It sees the content of the documents as sequences of vectors and clusters. And then, GenSim classifies them. It provides a beautiful visualization of the topics when combined with python library Pyldavis.

It also has incredible memory usage optimization and processing speed. That’s why it can handle large amounts of text data.

The main GenSim use cases are:

  • Data analysis
  • Semantic search applications
  • Text generation applications (chatbot, service customization, text summarization, etc.)
import gensim

# Load pre-trained Word2Vec model
model = gensim.models.Word2Vec.load("modelName.model")
model.similarity('Complement', 'Compliment')

# ----- Expected output -----
0.961089779453727

Flair

Flair is a simple NLP library. Flair’s framework builds directly on PyTorch, one of the best deep learning frameworks out there.

Flair is a great library for entity recognition and part-of-speech tagging. Works very well on English text, but gives horrible results on Dutch documents. Since the majority of our customers has Dutch-language sources, it makes little sense for us to use this library but we still love it.

Main NLP tasks:

  • Name-Entity Recognition
  • Parts-of-Speech Tagging
  • Text classification
  • Training Custom Models
from flair.models import TextClassifier
from flair.data import Sentence

classifier = TextClassifier.load('en-sentiment')
sentence = Sentence('NLP libraries are awesome!')
classifier.predict(sentence)

# ----- Expected output -----
[Positive (1.0)]

Side note

The above-mentioned libraries are still strongly focused on English, despite being often multilingual. Most NLP studies are therefore strongly focused on English and other widely spoken languages such as Chinese or Spanish. This naturally leads to further marginalization of other languages.

An NLP library that is certainly worth mentioning but didn’t make it to the list is BERT. An open-source neural network-based technique for NLP developed by Google.

BERT is an acronym for Bidirectional Encoder Representations from Transformers. The term bidirectional means that the context of a word is given by both the words that follow it and by the words preceding it. This technique makes this algorithm hard to train but very effective.

In 2019 BERT had been adopted by Google Search for over 70 languages. Last year, almost every single English-based search query was processed by BERT.

In our next and final post of this series, we will delve deeper into the practical side of NLP in business and industry.

Gilles Deweerdt

Newsletter

Sign up for our newsletter.