In this blog post, I will learn you how you can mine opinions about companies from news articles. I will share how I scraped thousands of news articles in a few minutes and how one could classify the opinion expressed in the titles of the news articles. This information could be used for example to help with watching competitors of a company or to predict global trends.
What components are needed?
The following components are needed:
- A method for scraping news.
- A method for scraping opinions (ratings, etcetera).
- A model for sentiment for answering the following question: “What words/phrases are related to what sentiment?”
First, the news scraper is explained. Then, the opinion scraper and sentiment model are explained and then the ensemble is tested in the real world!
RSS webscraper for news headlines
I used Scrapy for scraping webpages and RSS. The first thing to do, is to setup the project using Scrapy. This is done by executing the following code:
Then, move into the project folder (as explained in the message when setting up the project) and create a spider pointing at the domain to scrape the news from.
I ended up with the following RSS spider:
And the following item:
Then, I can execute the spider and export all headlines to a CSV file:
HTML webscraper for Dutch opinions
Then, I created a webscraper to scrape Dutch opinions. This resulted into the following Python code:
And the following item:
I then executed the code and created a CSV file containing thousands of Dutch opinions (opinions.csv).
Training a sentiment model
After scraping all the data, the sentiment model can be build. I used Chainer to create the Neural Network. Then, the opinions.csv is loaded:
After loading the CSV, preprocessing is done. The preprocessing converts the raw text to character identifiers. Also the rating (scale 1 – 10) is normalized (from 0.0 to 1.0) such that the model can work with it. Then, this is continuous variable is replaced by its discrete counterpart such that there are three classes: negative sentiment (< 0.25), positive sentiment (> 0.75) and neutral sentiment (between 0.25 and 0.75). Also accents are removed and characters which are not in the specified alphabet are removed. This results into the following preprocessing code:
Then, the vocabulary (the alphabet) is defined. An “UNK” marker is added for unknown characters (which will in fact never occur due to the preprocessing). Then, the characters are converted to identifiers:
Creating the dataset
The next step is to create an iterator and converter for the data. The iterator is a dataset object in Chainer. The dataset has a __len__() method which computes the number of items in the dataset and the get_example(i) method fetches the ith example. The converter is used to generate batches from the dataset iterator. This results into the following code:
This gives the following output:
As you can see, this is a batch consisting of 5 items. All the items have negative sentiment (< 0.25). The character identifiers per opinion are shown.
The model code and training process
Now it is time to define the model and train it! This is done with the following Chainer code:
And now we can train it! The following loss plot was generated:
I run it until there is no more unseen data available (so for 1 epoch). Sadly, I could get better results if I had collected more data. The loss noise can be reduced if I use larger batch sizes.
Using the model
Now the model can compute the sentiment for a given sentence. It can also be used to predict the sentiment for a given news headline. The following sentiment is found using the model (on a sample of a few news headlines):
0.81 CompanyX neemt specialist in zakelijk CompanyY in zijn geheel over 0.76 CompanyX boekt meer winst op dalende omzet 0.72 Topman ondanks concurrentie 'uitermate tevreden' met jaarcijfers CompanyX 0.29 CompanyX haalt eigen doelstelling voor uitbreiding snel internet niet 0.23 Storing netwerk CompanyX in noorden houdt tot 3.00 uur 's nachts aan
Translated into English:
0.81 CompanyX acquires specialist for CompanyY in its entirety 0.76 CompanyX is more profitable on declining sales 0.72 Top executive of CompanyX 'extremely satisfied' with CompanyX annual figures 0.29 CompanyX does not achieve its own objective for fast internet 0.23 Network malfunctioning of CompanyX in the north will last until 3 AM
As you can see, these results are correct (1.00 means maximum positive sentiment and 0.00 means maximum negative sentiment). However, there are some major drawbacks. The sentence sentiment does not tell you anything about the sentiment per entity. It would be interesting to classify the sentiment per entity.
With some effort, it is possible to detect sentiment in news articles (in any language). One improvement of the model is to compute sentiment per entity, but that is left as future work.