15 Natural Language Processing Libraries Worth a Try

With the rise of machine learning, NLP has become accessible to a wider developer community. This post gives an overview of 15 libraries worth a try in 2020.

The list is vaguely sorted by popularity and adoption in academia or industry.

	NLTK – Toolkit for human text analysis.
	spaCy – Opinionated NLP framework, “Ruby on Rails for NLP”.
	scikit-learn – Machine learning library used in NLP tools.
	gensim – Performant library for finding similarities in documents.
	TextBlob – Simplified text processing on top of NLTK.
	Pattern – Web mining tool, includes text analysis API.
	Polyglot – Basic NLP pipeline on a large number of human languages.
	CoreNLP – Feature-rich NLP library, pre-trained models for sentiment analysis.
	OpenNLP – Standard NLP pipeline, similar to NLTK
	PyTorch – Machine learning framework suitable for NLP thanks to a vast eco-system.
	AllenNLP – Deep learning and high-quality models for NLP.
	PyNLPI – Library for various NLP tasks, extensive support for linguistic annotations.
	Stanza – Toolkit for an accurate text analysis and efficient model training.
	Quepy – Transforms questions in plain English into database queries.
	textaCy – Adds features on top of spaCy – readability test, text statistics etc.

15 Most popular NLP libraries – the order is vaguely based on popularity.

*NLTK itself is licensed for non-commercial use, but there are commercial licenses for some corporas.

**Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.

***Use of Quepy requires attribution to its authors.

I tried text summarization with NLTK, spaCy and gensim. See how they compare to each other.

Overview in Alphabetical Order

AllenNLP

CoreNLP

gensim

NLTK

OpenNLP

Pattern

Polyglot

PyNLPI

PyTorch

Quepy

scikit-learn

spaCy

Stanza

textaCy

TextBlob

NLTK – Natural Language Toolkit

https://www.nltk.org

License: Apache 2.0

Commercial use: No*

Purpose: Toolkit for human text analysis.

* NLTK itself is licensed for non-commercial use, but there are commercial licenses for some corporas.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
Your NLP initiation. Source: NLTK

Pros

Great for learning (and teaching) of core principles
Suitable for testing of various algorithms
Large corpora in several languages
Other tools / libraries use NLTK internally

Cons

Burden of legacy makes it slow and tricky to use compared to nowadays industry standard
Steep learning curve, but there is a bunch of tutorials and, of course, the one and only handbook!

spaCy

https://spacy.io

License: MIT

Commercial use: Yes

Purpose: Opinionated NLP framework – “Ruby on Rails for NLP”.

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it.
Getting things done with spaCy. Source: spaCy

Pros

Designed for busy developers
Performant, suitable for large-scale text analysis
Integrates well with other libraries

Cons

Supports less human anguages compared to other tools
Opinionated – meaning less options for tweaking algorithms (if you care to)

scikit–learn

https://scikit-learn.org

License: BSD

Commercial use: Yes

Purpose: Machine learning library used in NLP tools.

Simple and efficient tools for predictive data analysis. Open source, commercially usable.
I don’t play no games! Source: scikit-learn

Pros

Versatile, range of models and algorithms
Solid foundations, built on SciPy and NumPy
Well documented, proven track of real-life applications

Cons

Limited support for deep and automatic learning
Tricky to use for complex pipelines

gensim

https://radimrehurek.com/gensim

License: LGPLv2

Commercial use: Yes

Purpose: Performant library for finding similarities in documents.

By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.
Embattled and refined. Source: The author

Pros

Robust and scalable
Streamed processing of large documents
Built for a specific job, does it well

Cons

As a specialised tool it lacks support for full-fledged NLP pipelines

TextBlob

https://textblob.readthedocs.io/en/dev

License: MIT

Commercial use: Yes

Purpose: Simplified text processing on top of NLTK.

TextBlob stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
Reuse, refine and simplify. Source: TextBlob

Pros

Pragmatic and easy to use
Consistent API on top of disparate (underlying) libraries

Cons

Prone to the same limitations as its foundation, NLTK

Pattern

https://github.com/clips/pattern

License: BSD

Commercial use: Yes

Purpose: Web mining tool, includes text analysis API.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser)
Web mining for Python. Source: Pattern

Pros

Designed for information mining from Twitter, Wikipedia, Google searches etc.
Excels at finding valuable insights while scraping the web – sentiment, superlatives, opinion mining etc.
Hands-on, great documentation with examples

Cons

Being a specialised tool it lacks support for some NLP pipelines

Polyglot

https://github.com/aboSamoor/polyglot

License: GPLv3

Commercial use: Yes

Purpose: Basic NLP pipeline on a large number of human languages.

Supports massive multilingual applications … Tokenization (165 Languages) … Language detection (196 Languages) …
True globetrotter. Source: The author

Pros

Your usual NLP pipeline with an important distinction – it’s multilingual (close to 200 human languages with some tasks)
Accurate and performant, built on top of NumPy

Cons

Smaller community, compared to other general purpose libraries (NLTK, spaCy ..)

CoreNLP

https://stanfordnlp.github.io/CoreNLP

License: GPLv3

Commercial use: Yes*

Purpose: Feature-rich NLP library, pre-trained models for sentiment analysis.

* Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.

One stop shop for natural language processing in Java!
Yay! Finally something that runs on JVM .. Source: Stanford NLP

Pros

Rare opportunity to use NLP in other language but Python
Reliable and proven, used both in academia and commercially

Cons

Slow when compared to spaCy
Isolated, because everyone else speaks Python. In fact, there is a Python wrapper for CoreNLP

OpenNLP

https://opennlp.apache.org

License: Apache 2.0

Commercial use: Yes

Purpose: Standard NLP pipeline, similar to NLTK.

.. a machine learning based toolkit for the processing of natural language text.
NLTK in Java. Source: Apache OpenNLP

Pros

Focuses on elementary NLP tasks and does them well: tokenization, sentence detection or even entity recognition
Feature rich tool for model training

Cons

Lacks advanced features, a transition to CoreNLP is the next logical step if you want to stick to JVM

PyTorch

https://pytorch.org

License: BSD

Commercial use: Yes

Purpose: Machine learning framework suitable for NLP thanks to a vast eco-system.

An optimized tensor library for deep learning using GPUs and CPUs.
Go deep and enjoy a rich ecosystem. Source: PyTorch

Pros

Robust framework, rich in tooling
Cloud platform and ecosystem make it suitable for production use

Cons

General machine learning toolkit. Use for NLP requires in-depth knowledge of core NLP algorithms.

AllenNLP

https://github.com/allenai/allennlp

License: Apache 2.0

Commercial use: Yes

Purpose: Deep learning and high-quality models for NLP.

.. design and evaluate new deep learning models for nearly any NLP problem.
Stage of the art models. Source: AllenNLP

Pros

Great for exploration and prototyping using state of the art models
Built on top of PyTorch
Used both in academia and commercially

Cons

Not suitable for large scale projects running in production

PyNLPl

https://github.com/proycon/pynlpl

License: GPLv3

Commercial use: Yes

Purpose: Library for various NLP tasks, extensive support for linguistic annotations.

‘Pineapple’ contains various modules useful for common, and less common, NLP tasks.
Produce models compatible with other NLP tools. Source: PyNLPl

Pros

Suitable for extraction of n-grams, frequency lists and other basic tasks
modular structure

Cons

Limited documentation and the project seems to be stalled

Stanza

https://stanfordnlp.github.io/stanza

License: Apache 2.0

Commercial use: Yes*

Purpose: Toolkit for an accurate text analysis and efficient model training.

* Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.

A collection of accurate and efficient tools for many human languages in one place.
More than just a Python wrapper around CoreNLP. Source: Stanford NLP

Pros

Goes beyond the basic NLP tasks and provides high accuracy
Performant, supports GPU processing
Brings CoreNLP to the Python world

Cons

Evolving, the community is yet to grow.

Quepy

https://github.com/machinalis/quepy

License: custom

Commercial use: Yes*

Purpose: Transforms questions in plain English into database queries.

* Use of Quepy requires attribution to its authors.

With little coding you can build your own system for natural language access to your database.
From human queries to SQL(like) .. Source: Quepy

Pros

Unique and pragmatic
Powerful through the support of SPARQL, e.g. queries across multiple data sources

Cons

The project seems to be stalled

textaCy

https://github.com/chartbeat-labs/textacy

License: Apache 2.0

Commercial use: Yes

Purpose: Adds features on top of spaCy – readability test, text statistics etc.

With the fundamentals delegated to another library, textacy focuses primarily on the tasks that come before and follow after.
Beyond the obvious. Source: textaCy

Pros

Built on top of spaCy, follows the same philosophy of easy to use API
Additional features that are hard to get otherwise: document similarity by a variety of metrics, text stats etc.

Cons

Still evolving, limited documentation

Overview in Alphabetical Order

NLTK – Natural Language Toolkit

Pros

Cons

spaCy

Pros

Cons

scikit–learn

Pros

Cons

gensim

Pros

Cons

TextBlob

Pros

Cons

Pattern

Pros

Cons

Polyglot

Pros

Cons

CoreNLP

Pros

Cons

OpenNLP

Pros

Cons

PyTorch

Pros

Cons

AllenNLP

Pros

Cons

PyNLPl

Pros

Cons

Stanza

Pros

Cons

Quepy

Pros

Cons

textaCy

Pros

Cons

Similar Posts