The automated analysis of textual data and its application in business analytics holds great promise for providing decision-makers with information from a sheer endless stream of news available online. Recent advances in computing have led to exciting new tools in the areas on Natural Language Processing. Sentiment Analysis and Machine Learning that can be used to make sense from an ever growing number of online sources. In a series of blog posts we will be looking at the basic concept of sentiment analysis in general and in the context of cryptocurrency trading, offer a behind-the-scenes view into Derivative Lab's very own efforts in building a sentiment engine, introduce a number of case studies and provide a glimpse into the future of machine learning and Artificial Intelligence (AI).
It is widely recognized that new information plays a key role in financial markets, impacting on volumes of trade, returns and volatility of prices, and consequently news has always been a key source of investment information.
While decision-makers have always considered a portfolio of varied news domains and sources, with the growth of the Internet, the amount of readily available information has grown exponentially. As major news outlets increasingly bolster their online portfolio, newspapers articles are increasingly published onle. Bloomberg alone adds an estimated 1 million news stories a day.
Apart from news produced by these reputable sources, an increasing number of opinionated documents of interest are published online asynchronously 24/7/365 on blogs, message boards and micro blogs (e.g. Twitter or Stocktwits) by large and varied user communities.
The enormity and high variance of this data presents an interesting opportunity for harnessing it into a foam that allows for specific market predictions, and in recent years this information has been repeatedly demonstrated to dramatically influence markets. A phenomenon recently termed "collective intelligence". For a single person (or even a group of people) harnessing this information successfully is increasingly impossible due its volume and asynchronous character. The necessity for automated collection, extraction, processing and aggregation of this hata has long been recognised and advances in machine learning techniques have led to exciting new tools for its analysis.
It is widely accepted that the automated extraction of useful information from text is a complex challenge. Apart from technical constraints, word-sense disambiguation remains on of the major challenge in the computerised processing of textual unstructured data. Words frequently change their meaning depending on context, consequently changing the meaning of the surrounding body. In fact, functional structures are so complex that it builds the latest stage in infant learning acquisition, with most of us needing over 2.5 years to master even the simplest applications. For example, one would not consider the word "long" as eithere exceptionally positive or negative. However, most humans would rate "the laptop's start-up time was long" as negative, while "the laptop's battery life was long" would be considered positive.
Netural Language Processing (NLP), Sentiment Analysis and Machine Learning are widely recognised as the key tools in transforming the plethora of available text into meaningful information.
Sentiment Analysis (also known as Opinion Mining), seeks to identify and categorise opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic is positive, negative, or neutral. Sentiment Analysis builds on NLP, originally referred to as computational linguistics, and at its most basic refers to the use of algorithms that allow computers to process and understand human languages.
While being researched since the late 1940's, the field has been rapidly growing over the last decade. This can be attributed to:
Over the last decade, various NLP algorithms have been developed, combined with Machine Learning techniques and applied in numerous (commercial) applications. For example many popular spam filters include Naive Bayes classifiers trained with NLP developed features to identify spam email. A similar approach can be used in Sentiment Analysis to classify text into positive, negative and neutral sentiment polarity. Although these models tend to treat words as atomic units - no notion of word similarity, or text structure - researchers and practitioners argue that they frequently outperform more complex models, at higher computational efficiency, and consequently are most applicable to.
As Machine Learning classification techniques require large quantities of relevant in-domain data for training the highly varied and specialized topics in market news present a unique challenge. A recent approach has benne provided by Google's Word2Vec algorithm, that has the potential to allow practitioners to overcome these limitations. It takes a text corpus as input and produces the word vectors as output, by constructing a vocabulary from a training data set and then learning vector representation of the contained words. This results in the capture of many syntactic and semantic regularitieis, represented by vectors. For example, 'melancholy' would be closest to 'bittersweet' (sentiment) as opposed to 'thoughtful' and 'warm' (semantic). When properly employed this approach can effectively aid in the capture of sentiment words not provided during training.
NLP, Sentiment Analysis and Machine Learning are fields heavily investigated and substantial breakthroughs are to be expected over the next years. It should be noted that in academic settings, more complex algorithms, such as Recurrent Neural Network based language models, have already shown promising results. However, up-to-date computational complexity does not permit their use in robust applications relying on near-real time processing of information.Share on Twitter Share on Facebook