Analyse Financial Tweets with Stanford CoreNLP – Part 1
Most of what I’ve done in the space of NLP was for obvious reasons coded up in Python. In my day job I primarily work with JVM. Wouldn’t it be cool if one could build something close to NLP / machine learning with .. say Scala? Well, Stanford CoreNLP has a good reputation and is written in Java – not bad for starters.
There is a lot to unpack. I will delve into details as we go through this series. In this introductory post I give a bird’s eye view of what CoreNLP is and how can it be useful.
CoreNLP in a Nutshell
CoreNLP is a library for extracting of essential linguistics features from a piece of text. It’s a project by a renowned group of Stanford’s researchers, and as such is fairly popular with NLP community.
The library is written in Java, which is of a particular interest to anyone who appreciates advantages of strongly typed languages, myself included. In the course of this tutorial I will diverge from Java in favour of Scala. At the moment, let’s just assume we stick to JVM and enjoy the benefits it has to offer.
Key Concepts
Here is how CoreNLP works in a nutshell.
CoreNLP makes use of linguistic annotations. As a result, raw text is turned into a tree structure. Think nouns, verbs, coreferences, named entities etc.
CoreNLP chains individual analytical steps into pipelines. That’s not dissimilar from how other libraries do it. The exact pipeline is determined by configuration. For example, “tokenize, ssplit, parse” means that the raw text will be tokenized, split into sentences and analysed from the linguistics point of view. Indeed, the order matters as annotators (processing steps) further down the line depend on the output of their ancestors. See the full list of annotators and dependencies between them.
A parsed document provides access to all annotations and can be efficiently serialised as a Google Protocol Buffers object.
The API provides wrappers and guarantees:
- Lazy computation – allows to apply transformations before the pipeline is executed.
- Fast and robust serialization via GPB
- Thread safety
- Optional over null – lazy computation and the use of Optional guarantees a function always returns a value.
Comparison with Popular NLP Libraries
Let’s look at licensing and high-level features. Delving into details reveals that using the library commercially incurs additional cost. On the other hand, CoreNLP provides pre-trained models for sentiment analysis (general models only!). This makes for an easy start in the upcoming parts of this tutorial.
Feature | CoreNLP | OpenNLP | NLTK | spaCy |
---|---|---|---|---|
API | ||||
License | GNU GPL | Apache 2.0 | Apache 2.0 | MIT |
Commercial use | Paid | Yes | Yes | Yes |
General pre-trained models | Yes | Yes | Yes | Yes |
Domain specific pre-trained models | No | No | No | Yes |
Pre-trained models for sentiment analysis | Yes | No | No | No |
Training on GPU | No | No | No | Yes |
No. of languages supported out of the box | 6 | 7 | 10+ | 10+ |
Resources:
- Comparing the Functionality of Open Source Natural Language Processing Libraries
- OpenNLP Language Models
- spaCy Models and Languages
- Devopedia – NLTK
A head-to-head comparison with spaCy yields slightly disappointing results when it comes to performance:
Perfomance Metric | CoreNLP | spaCy |
---|---|---|
Tokenizer | 2 ms | 0.2 ms |
Tagger | 1 ms | 10 ms |
Parser | 19 ms | 49 ms |
NER Accuracy | 0.79 | 0.72 |
If you are looking for a Python alternative to CoreNLP, then go check the recent (March 2020) release of Stanza. Supposedly, it outperforms spaCy in some of the performance metrics.
Summary
In this post, I gave you ten thousand view of CoreNLP, Stanford’s NLP Java library. I hope it helped you understand how does it compare with other popular solutions and that there are both benefits and drawbacks when using it. My next post will provide a detailed look at sentiment analysis. Specifically at what useful metadata we can collect and how to arrive at a sentiment of a larger piece of text.
Thanks for reading. Did you find this post useful? Is there anything you want to know in particular? Please comment in the section below. Thank you and stayed tuned for my next post.