15 Natural Language Processing Libraries Worth a Try
With the rise of machine learning, NLP has become accessible to a wider developer community. This post gives an overview of 15 libraries worth a try in 2020.
The list is vaguely sorted by popularity and adoption in academia or industry.
NLTK – Toolkit for human text analysis. | |
spaCy – Opinionated NLP framework, “Ruby on Rails for NLP”. | |
scikit-learn – Machine learning library used in NLP tools. | |
gensim – Performant library for finding similarities in documents. | |
TextBlob – Simplified text processing on top of NLTK. | |
Pattern – Web mining tool, includes text analysis API. | |
Polyglot – Basic NLP pipeline on a large number of human languages. | |
CoreNLP – Feature-rich NLP library, pre-trained models for sentiment analysis. | |
OpenNLP – Standard NLP pipeline, similar to NLTK | |
PyTorch – Machine learning framework suitable for NLP thanks to a vast eco-system. | |
AllenNLP – Deep learning and high-quality models for NLP. | |
PyNLPI – Library for various NLP tasks, extensive support for linguistic annotations. | |
Stanza – Toolkit for an accurate text analysis and efficient model training. | |
Quepy – Transforms questions in plain English into database queries. | |
textaCy – Adds features on top of spaCy – readability test, text statistics etc. |
*NLTK itself is licensed for non-commercial use, but there are commercial licenses for some corporas.
**Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.
***Use of Quepy requires attribution to its authors.
I tried text summarization with NLTK, spaCy and gensim. See how they compare to each other.
Overview in Alphabetical Order
NLTK – Natural Language Toolkit
License: Apache 2.0
Commercial use: No*
Purpose: Toolkit for human text analysis.
* NLTK itself is licensed for non-commercial use, but there are commercial licenses for some corporas.
NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
Your NLP initiation. Source: NLTK
Pros
- Great for learning (and teaching) of core principles
- Suitable for testing of various algorithms
- Large corpora in several languages
- Other tools / libraries use NLTK internally
Cons
- Burden of legacy makes it slow and tricky to use compared to nowadays industry standard
- Steep learning curve, but there is a bunch of tutorials and, of course, the one and only handbook!
spaCy
License: MIT
Commercial use: Yes
Purpose: Opinionated NLP framework – “Ruby on Rails for NLP”.
spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it.
Getting things done with spaCy. Source: spaCy
Pros
- Designed for busy developers
- Performant, suitable for large-scale text analysis
- Integrates well with other libraries
Cons
- Supports less human anguages compared to other tools
- Opinionated – meaning less options for tweaking algorithms (if you care to)
scikit–learn
License: BSD
Commercial use: Yes
Purpose: Machine learning library used in NLP tools.
Simple and efficient tools for predictive data analysis. Open source, commercially usable.
I don’t play no games! Source: scikit-learn
Pros
- Versatile, range of models and algorithms
- Solid foundations, built on SciPy and NumPy
- Well documented, proven track of real-life applications
Cons
- Limited support for deep and automatic learning
- Tricky to use for complex pipelines
gensim
https://radimrehurek.com/gensim
License: LGPLv2
Commercial use: Yes
Purpose: Performant library for finding similarities in documents.
By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.
Embattled and refined. Source: The author
Pros
- Robust and scalable
- Streamed processing of large documents
- Built for a specific job, does it well
Cons
- As a specialised tool it lacks support for full-fledged NLP pipelines
TextBlob
https://textblob.readthedocs.io/en/dev
License: MIT
Commercial use: Yes
Purpose: Simplified text processing on top of NLTK.
TextBlob stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
Reuse, refine and simplify. Source: TextBlob
Pros
- Pragmatic and easy to use
- Consistent API on top of disparate (underlying) libraries
Cons
- Prone to the same limitations as its foundation, NLTK
Pattern
https://github.com/clips/pattern
License: BSD
Commercial use: Yes
Purpose: Web mining tool, includes text analysis API.
It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser)
Web mining for Python. Source: Pattern
Pros
- Designed for information mining from Twitter, Wikipedia, Google searches etc.
- Excels at finding valuable insights while scraping the web – sentiment, superlatives, opinion mining etc.
- Hands-on, great documentation with examples
Cons
- Being a specialised tool it lacks support for some NLP pipelines
Polyglot
https://github.com/aboSamoor/polyglot
License: GPLv3
Commercial use: Yes
Purpose: Basic NLP pipeline on a large number of human languages.
Supports massive multilingual applications … Tokenization (165 Languages) … Language detection (196 Languages) …
True globetrotter. Source: The author
Pros
- Your usual NLP pipeline with an important distinction – it’s multilingual (close to 200 human languages with some tasks)
- Accurate and performant, built on top of NumPy
Cons
- Smaller community, compared to other general purpose libraries (NLTK, spaCy ..)
CoreNLP
https://stanfordnlp.github.io/CoreNLP
License: GPLv3
Commercial use: Yes*
Purpose: Feature-rich NLP library, pre-trained models for sentiment analysis.
* Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.
One stop shop for natural language processing in Java!
Yay! Finally something that runs on JVM .. Source: Stanford NLP
Pros
- Rare opportunity to use NLP in other language but Python
- Reliable and proven, used both in academia and commercially
Cons
- Slow when compared to spaCy
- Isolated, because everyone else speaks Python. In fact, there is a Python wrapper for CoreNLP
OpenNLP
License: Apache 2.0
Commercial use: Yes
Purpose: Standard NLP pipeline, similar to NLTK.
.. a machine learning based toolkit for the processing of natural language text.
NLTK in Java. Source: Apache OpenNLP
Pros
- Focuses on elementary NLP tasks and does them well: tokenization, sentence detection or even entity recognition
- Feature rich tool for model training
Cons
- Lacks advanced features, a transition to CoreNLP is the next logical step if you want to stick to JVM
PyTorch
License: BSD
Commercial use: Yes
Purpose: Machine learning framework suitable for NLP thanks to a vast eco-system.
An optimized tensor library for deep learning using GPUs and CPUs.
Go deep and enjoy a rich ecosystem. Source: PyTorch
Pros
- Robust framework, rich in tooling
- Cloud platform and ecosystem make it suitable for production use
Cons
- General machine learning toolkit. Use for NLP requires in-depth knowledge of core NLP algorithms.
AllenNLP
https://github.com/allenai/allennlp
License: Apache 2.0
Commercial use: Yes
Purpose: Deep learning and high-quality models for NLP.
.. design and evaluate new deep learning models for nearly any NLP problem.
Stage of the art models. Source: AllenNLP
Pros
- Great for exploration and prototyping using state of the art models
- Built on top of PyTorch
- Used both in academia and commercially
Cons
- Not suitable for large scale projects running in production
PyNLPl
https://github.com/proycon/pynlpl
License: GPLv3
Commercial use: Yes
Purpose: Library for various NLP tasks, extensive support for linguistic annotations.
‘Pineapple’ contains various modules useful for common, and less common, NLP tasks.
Produce models compatible with other NLP tools. Source: PyNLPl
Pros
- Suitable for extraction of n-grams, frequency lists and other basic tasks
- modular structure
Cons
- Limited documentation and the project seems to be stalled
Stanza
https://stanfordnlp.github.io/stanza
License: Apache 2.0
Commercial use: Yes*
Purpose: Toolkit for an accurate text analysis and efficient model training.
* Stanford provide paid licenses for a commercial use of CoreNLP and Stanza.
A collection of accurate and efficient tools for many human languages in one place.
More than just a Python wrapper around CoreNLP. Source: Stanford NLP
Pros
- Goes beyond the basic NLP tasks and provides high accuracy
- Performant, supports GPU processing
- Brings CoreNLP to the Python world
Cons
- Evolving, the community is yet to grow.
Quepy
https://github.com/machinalis/quepy
License: custom
Commercial use: Yes*
Purpose: Transforms questions in plain English into database queries.
* Use of Quepy requires attribution to its authors.
With little coding you can build your own system for natural language access to your database.
From human queries to SQL(like) .. Source: Quepy
Pros
- Unique and pragmatic
- Powerful through the support of SPARQL, e.g. queries across multiple data sources
Cons
- The project seems to be stalled
textaCy
https://github.com/chartbeat-labs/textacy
License: Apache 2.0
Commercial use: Yes
Purpose: Adds features on top of spaCy – readability test, text statistics etc.
With the fundamentals delegated to another library, textacy focuses primarily on the tasks that come before and follow after.
Beyond the obvious. Source: textaCy
Pros
- Built on top of spaCy, follows the same philosophy of easy to use API
- Additional features that are hard to get otherwise: document similarity by a variety of metrics, text stats etc.
Cons
- Still evolving, limited documentation