|

15 Natural Language Processing Libraries Worth a Try

With the rise of machine learning, NLP has become accessible to a wider developer community. This post gives an overview of 15 libraries worth a try in 2020.


The list is vaguely sorted by popularity and adoption in academia or industry.

Python logoNLTK – Toolkit for human text analysis.
Python logospaCy – Opinionated NLP framework, “Ruby on Rails for NLP”.
Python logoscikit-learn – Machine learning library used in NLP tools.
Python logogensim – Performant library for finding similarities in documents.
Python logoTextBlob – Simplified text processing on top of NLTK.
Python logoPattern – Web mining tool, includes text analysis API.
Python logoPolyglot – Basic NLP pipeline on a large number of human languages.
Java logoCoreNLP – Feature-rich NLP library, pre-trained models for sentiment analysis.
Java logoOpenNLP – Standard NLP pipeline, similar to NLTK
Python logoPyTorch – Machine learning framework suitable for NLP thanks to a vast eco-system.
Python logoAllenNLP – Deep learning and high-quality models for NLP.
Python logoPyNLPI – Library for various NLP tasks, extensive support for linguistic annotations.
Python logoStanza – Toolkit for an accurate text analysis and efficient model training.
Python logoQuepy – Transforms questions in plain English into database queries.
Python logotextaCy – Adds features on top of spaCy – readability test, text statistics etc.
15 Most popular NLP libraries – the order is vaguely based on popularity.

*NLTK itself is licensed for non-commercial use, but there are commercial licenses for some corporas.

**Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.

***Use of Quepy requires attribution to its authors.

I tried text summarization with NLTK, spaCy and gensim. See how they compare to each other.

Overview in Alphabetical Order

NLTK – Natural Language Toolkit

https://www.nltk.org

License: Apache 2.0

Commercial use: No*

Purpose: Toolkit for human text analysis.

* NLTK itself is licensed for non-commercial use, but there are commercial licenses for some corporas.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

Your NLP initiation. Source: NLTK

Pros

  • Great for learning (and teaching) of core principles
  • Suitable for testing of various algorithms
  • Large corpora in several languages
  • Other tools / libraries use NLTK internally

Cons

  • Burden of legacy makes it slow and tricky to use compared to nowadays industry standard
  • Steep learning curve, but there is a bunch of tutorials and, of course, the one and only handbook!

spaCy

https://spacy.io

License: MIT

Commercial use: Yes

Purpose: Opinionated NLP framework – “Ruby on Rails for NLP”.

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it.

Getting things done with spaCy. Source: spaCy

Pros

  • Designed for busy developers
  • Performant, suitable for large-scale text analysis
  • Integrates well with other libraries

Cons

  • Supports less human anguages compared to other tools
  • Opinionated – meaning less options for tweaking algorithms (if you care to)

scikit–learn

https://scikit-learn.org

License: BSD

Commercial use: Yes

Purpose: Machine learning library used in NLP tools.

Simple and efficient tools for predictive data analysis. Open source, commercially usable.

I don’t play no games! Source: scikit-learn

Pros

  • Versatile, range of models and algorithms
  • Solid foundations, built on SciPy and NumPy
  • Well documented, proven track of real-life applications

Cons

  • Limited support for deep and automatic learning
  • Tricky to use for complex pipelines

gensim

https://radimrehurek.com/gensim

License: LGPLv2

Commercial use: Yes

Purpose: Performant library for finding similarities in documents.

By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.

Embattled and refined. Source: The author

Pros

  • Robust and scalable
  • Streamed processing of large documents
  • Built for a specific job, does it well

Cons

  • As a specialised tool it lacks support for full-fledged NLP pipelines

TextBlob

https://textblob.readthedocs.io/en/dev

License: MIT

Commercial use: Yes

Purpose: Simplified text processing on top of NLTK.

TextBlob stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.

Reuse, refine and simplify. Source: TextBlob

Pros

  • Pragmatic and easy to use
  • Consistent API on top of disparate (underlying) libraries

Cons

  • Prone to the same limitations as its foundation, NLTK

Pattern

https://github.com/clips/pattern

License: BSD

Commercial use: Yes

Purpose: Web mining tool, includes text analysis API.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser)

Web mining for Python. Source: Pattern

Pros

  • Designed for information mining from Twitter, Wikipedia, Google searches etc.
  • Excels at finding valuable insights while scraping the web – sentiment, superlatives, opinion mining etc.
  • Hands-on, great documentation with examples

Cons

  • Being a specialised tool it lacks support for some NLP pipelines

Polyglot

https://github.com/aboSamoor/polyglot

License: GPLv3

Commercial use: Yes

Purpose: Basic NLP pipeline on a large number of human languages.

Supports massive multilingual applications … Tokenization (165 Languages) … Language detection (196 Languages) …

True globetrotter. Source: The author

Pros

  • Your usual NLP pipeline with an important distinction – it’s multilingual (close to 200 human languages with some tasks)
  • Accurate and performant, built on top of NumPy

Cons

  • Smaller community, compared to other general purpose libraries (NLTK, spaCy ..)

CoreNLP

https://stanfordnlp.github.io/CoreNLP

License: GPLv3

Commercial use: Yes*

Purpose: Feature-rich NLP library, pre-trained models for sentiment analysis.

* Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.

One stop shop for natural language processing in Java!

Yay! Finally something that runs on JVM .. Source: Stanford NLP

Pros

  • Rare opportunity to use NLP in other language but Python
  • Reliable and proven, used both in academia and commercially

Cons

  • Slow when compared to spaCy
  • Isolated, because everyone else speaks Python. In fact, there is a Python wrapper for CoreNLP

OpenNLP

https://opennlp.apache.org

License: Apache 2.0

Commercial use: Yes

Purpose: Standard NLP pipeline, similar to NLTK.

.. a machine learning based toolkit for the processing of natural language text.

NLTK in Java. Source: Apache OpenNLP

Pros

  • Focuses on elementary NLP tasks and does them well: tokenization, sentence detection or even entity recognition
  • Feature rich tool for model training

Cons

  • Lacks advanced features, a transition to CoreNLP is the next logical step if you want to stick to JVM

PyTorch

https://pytorch.org

License: BSD

Commercial use: Yes

Purpose: Machine learning framework suitable for NLP thanks to a vast eco-system.

An optimized tensor library for deep learning using GPUs and CPUs.

Go deep and enjoy a rich ecosystem. Source: PyTorch

Pros

  • Robust framework, rich in tooling
  • Cloud platform and ecosystem make it suitable for production use

Cons

  • General machine learning toolkit. Use for NLP requires in-depth knowledge of core NLP algorithms.

AllenNLP

https://github.com/allenai/allennlp

License: Apache 2.0

Commercial use: Yes

Purpose: Deep learning and high-quality models for NLP.

.. design and evaluate new deep learning models for nearly any NLP problem.

Stage of the art models. Source: AllenNLP

Pros

  • Great for exploration and prototyping using state of the art models
  • Built on top of PyTorch
  • Used both in academia and commercially

Cons

  • Not suitable for large scale projects running in production

PyNLPl

https://github.com/proycon/pynlpl

License: GPLv3

Commercial use: Yes

Purpose: Library for various NLP tasks, extensive support for linguistic annotations.

‘Pineapple’ contains various modules useful for common, and less common, NLP tasks.

Produce models compatible with other NLP tools. Source: PyNLPl

Pros

  • Suitable for extraction of n-grams, frequency lists and other basic tasks
  • modular structure

Cons

  • Limited documentation and the project seems to be stalled

Stanza

https://stanfordnlp.github.io/stanza

License: Apache 2.0

Commercial use: Yes*

Purpose: Toolkit for an accurate text analysis and efficient model training.

* Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.

A collection of accurate and efficient tools for many human languages in one place.

More than just a Python wrapper around CoreNLP. Source: Stanford NLP

Pros

  • Goes beyond the basic NLP tasks and provides high accuracy
  • Performant, supports GPU processing
  • Brings CoreNLP to the Python world

Cons

  • Evolving, the community is yet to grow.

Quepy

https://github.com/machinalis/quepy

License: custom

Commercial use: Yes*

Purpose: Transforms questions in plain English into database queries.

* Use of Quepy requires attribution to its authors.

With little coding you can build your own system for natural language access to your database.

From human queries to SQL(like) .. Source: Quepy

Pros

  • Unique and pragmatic
  • Powerful through the support of SPARQL, e.g. queries across multiple data sources

Cons

  • The project seems to be stalled

textaCy

https://github.com/chartbeat-labs/textacy

License: Apache 2.0

Commercial use: Yes

Purpose: Adds features on top of spaCy – readability test, text statistics etc.

With the fundamentals delegated to another library, textacy focuses primarily on the tasks that come before and follow after.

Beyond the obvious. Source: textaCy

Pros

  • Built on top of spaCy, follows the same philosophy of easy to use API
  • Additional features that are hard to get otherwise: document similarity by a variety of metrics, text stats etc.

Cons

  • Still evolving, limited documentation

Similar Posts