Process of real-time social news analysis


Every day, millions of news are generated, and all these news are produced by humans in natural language. These news contain the information that can help on the field of decision making, algorithmic trading and risk management. Consider one piece of news may trigger a cryptocurrency price explosion. Or a hiding trend may be buried in a huge pile of news data. For now, these works are often done by professional analysts. But the task become harder and harder as soaring flows of news. That's why we need the automated and quantitative news analysis.

Let's first put aside how to do the automated analysis of news. Instead, think about what does a reader need to know from a piece of news. Normally, at the starting point, he or she may answer these two questions:

1. What's objective that the news is talking about? For example, Bitcoin or Ethereum?

2. In generatl, is it bad or good?

The technology of Named Entity Recognition is for answering the first question. More specifically, quickly determining which item in the text maps to proper names, such as people or places. For us, we need to go further to determine which cryptocurrency is involved in the news. We decouple the task into two parts:

1. Use the popular community package like NLTK and Stanford NER to narrow down the searching

2. Search for the cryptocurrency name using our crypto synonym database.

After the Named Entity Recognition process, the news will be documented under the identified cryptocurrency name for delivery or further analysis. Sometimes, on news mentioned several cryptocurrencies. In this scenario, relevance measure conducted. The relevance measure considers the location of a term in the text. For example, intuitively, one news may be more relevant to the cryptocurrency asset when the name of the token occurs in the title

To know whether a news is bad or good to a cryptocurrency, a common way is to search for the emotional states such as "angry", "sad", and "happy" and count on the occurrence of these states. In our case, we first collect a library of these emotional states specialized in the financial community. Next, we count on all the words that both in the library and text. Then normalize the counting result for both positive and negative words, where score -100 means that all words are negative and score 100 means that all are positive. These scores can be treated as quantitative measure of sentiment that can be used to compare between companies and time.

Finally, both Named Entity Recognition and sentiment scoring process is completed on the distributed computational clusters so that the analyzing result can be delivered and documented in real time.

Currently unrated


There are currently no comments

New Comment


required (not published)