Analyse Financial Tweets with Stanford CoreNLP – Part 3

So far I’ve looked at analysing tweets without paying much attention to data management. Today, I will show you how to organise tweets into a structured schema and store them in a database. This post is part of a series. I encourage you to catch up with the part 1 and part 2, if you’ve missed them.

StockNet dataset tracks tweets and daily prices for 88 selected stocks during 2014 – 2016. Even though it’s historical data the challenges described in this post are applicable when running an ingest pipeline at real time.

StockNet Dataset

Stock movement prediction is a challenging problem: the market is highly stochastic, and we make temporally-dependent predictions from chaotic data.

Yumo Xu and Shay B. Cohen. 2018. Stock Movement Prediction from Tweets and Historical Prices.

The dataset comes with both raw and preprocessed data. I’ve chosen to work with the raw format in order to access a richer set of information.

  • Tweets come as JSON: schema
  • Price records are stored as CSV and they keep track of date, price changes and traded volumes

Fetching Data

The data is available on GitHub, but the download isn’t straightforward – unless you are happy to clone the entire repository. In the end, I worked around the limitations by streaming and filtering the repository’s content. Here is a neat script that downloads raw tweets and prices and stores them in the data directory on your local file system.

Tackling Granularity

Another challenge lies in data partitioning. Tweets are broken down into daily batches per stock.

In order to load all of the tweets into the database we need to merge stock daily data into a single file. That’s exactly what the script below does. It takes a while to run and when completed, you’ll find all daily tweets (of all stocks) in data/tweets.json.

Prices, on the other hand, are easier to handle. There is only a single level of partitioning, by stock.

Defining the Schema

Now, with all raw data in place, we are in a position to think about designing the target MySQL database. First of all, let’s get the encoding right. Tweets are full of unusual characters that can easily turn into garbage if we go with the default UTF-8 encoding.

MySQL’s “utf8” isn’t UTF-8. The “utf8” encoding only supports three bytes per character. The real UTF-8 encoding — which everybody uses, including you — needs up to four bytes per character.

Adam Hooper: Never use “utf8” in MySQL

Don’t use “utf8” whose design is flawed and it was only kept for backwards compatibility. Choose “utf8mb4” instead.

Next, consider the actual schema. Out of many attributes a tweet possibly has, we only need a few.

  • Tweet’s unique identifier and a timestamp
  • Text and language
  • Popularity measured by the number of retweets and likes

In order to reason about changes in stock prices based on tweet sentiment we need to capture daily price spread along with other information. Namely:

  • Trading day and traded volume on that day
  • Opening and closing price
  • Price drop and peak of the day

The resulting schema reflects all of these requirements.

If you follow along, now it’s a good time to install MySQL on your local machine. I use Workbench for all admin tasks and data queries.

Flattening JSON and Taming Date / Time

Last thing that remains to be done is loading data into our database. There are few more outstanding issues though.

  • Tweets are stored as JSON, whereas MySQL works with CSV
  • Dates and timestamps in the raw data are not in the expected format

The file format issue – JSON vs CSV – can be easily solved by a utility called jq. Go ahead and install it, if you don’t happen to have it yet. It’s indispensable. I also recommend checking this amazing tutorial which shows all possible data transformations.

Flattening of JSON into CSV is straightforward.

Looking at the data reveals that timestamps require a bit of work.

MySQL expects dates in the format of “YYYY-MM-DD hh:mm:ss“. Well, it’s not gonna be easy, because months for example should be represented by their ordinal number. Also, I don’t want to iterate the file more than once for the sake of performance. I’ve found a satisfactory solution using bash and sed. There is always more than one way to skin a cat, especially when it comes to dates and regular expressions. Feel free to let me know of a better solution in the comment section below. My solution relies on the convenience of chaining multiple transformations within a single sed command. Besides, a streaming editor surely is an appropriate tool for this kind of a job.

After all the hassle with regexes, the data are ready to be imported into MySQL.

Final Thoughts

Data preparation is hardly ever fun. I hope you’ve found some of the advice and scripts helpful. One thing I don’t like as much is a high number of tedious manual steps. That leaves space for human error. In my next post, I’ll automate most of the steps by building a data pipeline with Apache NiFi. Stay tuned and thanks for reading thus far.

Similar Posts