Analyse Financial Tweets with Stanford CoreNLP - Part 1

Most of what I’ve done in the space of NLP was for obvious reasons coded up in Python. In my day job I primarily work with JVM. Wouldn’t it be cool if one could build something close to NLP / machine learning with .. say Scala? Well, Stanford CoreNLP has a good reputation and is written in Java – not bad for starters.

There is a lot to unpack. I will delve into details as we go through this series. In this introductory post I give a bird’s eye view of what CoreNLP is and how can it be useful.

CoreNLP in a Nutshell

CoreNLP is a library for extracting of essential linguistics features from a piece of text. It’s a project by a renowned group of Stanford’s researchers, and as such is fairly popular with NLP community.

The library is written in Java, which is of a particular interest to anyone who appreciates advantages of strongly typed languages, myself included. In the course of this tutorial I will diverge from Java in favour of Scala. At the moment, let’s just assume we stick to JVM and enjoy the benefits it has to offer.

Key Concepts

Here is how CoreNLP works in a nutshell.

CoreNLP makes use of linguistic annotations. As a result, raw text is turned into a tree structure. Think nouns, verbs, coreferences, named entities etc.

“`The brown fox is quick and he is jumping over the lazy dog`” – Source: StackOverflow.

CoreNLP chains individual analytical steps into pipelines. That’s not dissimilar from how other libraries do it. The exact pipeline is determined by configuration. For example, “tokenize, ssplit, parse” means that the raw text will be tokenized, split into sentences and analysed from the linguistics point of view. Indeed, the order matters as annotators (processing steps) further down the line depend on the output of their ancestors. See the full list of annotators and dependencies between them.

Raw text processing pipeline. Source: Stanford CoreNLP Natural Processing Toolkit

A parsed document provides access to all annotations and can be efficiently serialised as a Google Protocol Buffers object.

The API provides wrappers and guarantees:

Lazy computation – allows to apply transformations before the pipeline is executed.
Fast and robust serialization via GPB
Thread safety
Optional over null – lazy computation and the use of Optional guarantees a function always returns a value.

Comparison with Popular NLP Libraries

Let’s look at licensing and high-level features. Delving into details reveals that using the library commercially incurs additional cost. On the other hand, CoreNLP provides pre-trained models for sentiment analysis (general models only!). This makes for an easy start in the upcoming parts of this tutorial.

Feature	CoreNLP	OpenNLP	NLTK	spaCy
API
License	GNU GPL	Apache 2.0	Apache 2.0	MIT
Commercial use	Paid	Yes	Yes	Yes
General pre-trained models	Yes	Yes	Yes	Yes
Domain specific pre-trained models	No	No	No	Yes
Pre-trained models for sentiment analysis	Yes	No	No	No
Training on GPU	No	No	No	Yes
No. of languages supported out of the box	6	7	10+	10+

CoreNLP vs other established NLP libraries.

Resources:

A head-to-head comparison with spaCy yields slightly disappointing results when it comes to performance:

Perfomance Metric	CoreNLP	spaCy
Tokenizer	2 ms	0.2 ms
Tagger	1 ms	10 ms
Parser	19 ms	49 ms
NER Accuracy	0.79	0.72

CoreNLP vs spaCy: Speed in ms and accuracy. Source: EKbana

If you are looking for a Python alternative to CoreNLP, then go check the recent (March 2020) release of Stanza. Supposedly, it outperforms spaCy in some of the performance metrics.

Summary

In this post, I gave you ten thousand view of CoreNLP, Stanford’s NLP Java library. I hope it helped you understand how does it compare with other popular solutions and that there are both benefits and drawbacks when using it. My next post will provide a detailed look at sentiment analysis. Specifically at what useful metadata we can collect and how to arrive at a sentiment of a larger piece of text.

Thanks for reading. Did you find this post useful? Is there anything you want to know in particular? Please comment in the section below. Thank you and stayed tuned for my next post.

Tomas Zezula

Analyse Financial Tweets with Stanford CoreNLP – Part 1

CoreNLP in a Nutshell

Key Concepts

Comparison with Popular NLP Libraries

Summary

4 Useful Ways to Automate Text Summarization

COVID-19 Test Locations

Natural Language Processing as Part of Daily Life

Build a Cool Face Detection App in Your Browser

Build a Multilingual Chatbot with Rasa and Heroku – Training the Model

First Steps with Rasa for Busy Developers

CoreNLP in a Nutshell

Key Concepts

Comparison with Popular NLP Libraries

Summary

Similar Posts