Tweelyzer. An Approach to Sentiment Analysis of Tweets

Samariya, Durgesh

Tweelyzer. An Approach to Sentiment Analysis of Tweets

Communications - Multimedia, Internet, New Technologies

Summary

The ongoing trend of people using microblogging to express their thoughts on various topics has increased the need for developing computerised techniques for automatic sentiment analysis on texts that do not exceed 200 characters. Twitter is a "micro-blogging" social networking site that has a large and rapidly growing base of users. Twitter's tweets or messages are limited to 140 characters. Because of this limitation, it is more difficult to express sentiment and the classification of the tweets is difficult as well. Sentiment analysis can be done on two types: emotion and opinion. This research completely focuses on sentiment analysis of opinions. These opinions can be divided in three different classes: positive, negative and neutral ( somewhere between positive and negative).
The main goal of this study is to build a model that predicts election movement and provide sentiment score from Twitter messages (which can not exceed 140 characters). In this project, the author applies a novel approach that classifies sentiment and emotions of Twitter tweets automatically in positive, negative or neutral classes. For the sentiment, first of all, tweets from twitter were retrieved and converted into the dataset. After pre-processing the data the proposed algorithm named TWEELYZER was applied to the dataset. At the end, the performance of TWEELYZER was measured in terms of accuracy and recall.
In this project, all tweets of people regarding to movies, brands, actors and actresses were collected from twitter and then cleaned and analysed according to the proposed algorithm. These tweets were collected using R Studio software. Several processes took place in pre-processing the tweets. After pre-processing the data, using R Studio led to several insights.

Excerpt

IMPLEMENTATION 25

4.1 Creating Twitter Application

4.2 Working with R/RStudio

4.3 Connecting Twitter API to R

4.4 Saving Tweets in Local Drive

4.5 Cleaning Function

4.6 Sentiment Function of TWEELYZER

4.7 Scoring Tweets

4.8 Visualization of Tweets

4.9 Text Analysis

RESULT AND DISCUSSION

CONCLUSION AND FUTURE WORK

APPENDIX 1: KEYWORD LEXICON

REFERENCE 66

LIST OF TABLES

Table No

Title

Page No

4.1

Number of tweets fetched

4.2

Sample Emotion Set

4.3

Sample Positive Emotion set

4.4

Sample negative emotion set

4.5

Sample Positive word set

4.6

Sample Negative word set

4.7

Sample Positive emotion word set

4.8

Sample Negative Emotion word

set

5.1 Sample

Dataset

5.2 Example

LIST OF FIGURES

Figure No

Title

Page No

1.1

General Sentiment Analysis Process

3.1 System

Architecture

3.2

Flowchart of Tweelyzer

3.3

Data Collection Process

3.4 Data

Pre-Processing

Process

3.5

Sample Tweets on Bihar Election

4.1

Link for create Twitter App

4.2

Already created Application List

4.3

Create app screen

4.4

Application creation form

4.5 Developer

Agreement

4.6

Twitter Application Details

4.7

Twitter Application Keys/Token

4.8

R download page

4.9 R

Console

4.10 RStudio

Console

4.11 Twitter

Connection

4.12

Emotion Category Visualization Demo

4.13 Wordcloud

Demo

5.1

Freq vs word graph for dataset 1

5.2

Wordcloud for dataset 1

5.3

Emotion category graph Dataset 1

5.4

Sentiment Analysis for Dataset 1

5.5

Freq vs word graph for dataset 2

5.6

Wordcloud for dataset 2

5.7

Emotion category graph Dataset 2

5.8

Sentiment Analysis for Dataset 2

5.9

Freq vs word graph for dataset 3

5.10

Wordcloud for dataset 3

5.11

Emotion category graph Dataset 3

5.12

Sentiment Analysis for Dataset 3

5.13

Freq vs word graph for dataset 4

5.14

Wordcloud for dataset 4

5.15

Emotion category graph Dataset 4

5.16

Sentiment Analysis for Dataset 4

5.17

Precision, recall and F- measure of proposed

algorithm

5.18

Classification accuracy comparison using

different sentiment technique

vii

LIST OF ABBREVIATIONS

ACRONYM EXPANSION

Sentiment Analysis

WWW

World Wide Web

Machine Learning

SVM

Support Vector Machine

CRF

Condition Random Field

Naïve Bayes

APP

Application

API

Application Program Interface

viii

1. INTRODUCTION

Over the last years, survey has been answer of the question what do people

think?. Now days, we find many social networking site on wide area of network like

Twitter

, LinkedIn

, Facebook

, Instagram

, YouTube

, Myspace

and Google+

have

gained too much popularity. In the last few years the social medium Twitter has

become more popular day by day and we cannot ignore. This thesis contribute to field

of Sentiment Analysis(SA), which main aim is to extract emotions and opinion from

text (tweets) .

1.1 BACKGROUND

Twitter is most used microblogging social networking domain that has own

large number of uses. Those who all use twitter, they can read, write tweets (140

character message). Due to popularity Twitter providing rich amount of data in form

of tweets. Twitter, allowing people to share their opinion, thoughts and emotion freely

in form of tweets. That's why twitter is a good medium to find interesting trends. At

the Twitter's official developer conference in April 2010 [4], they present some

statistics on users, according to statistics Twitter had 106 million registered user , and

180 million unique visitor in every month in April 2010.

In the past two decade, Sentiment Analysis (SA) has become a hot favourite

research topic. For years, polls are standard method to measure emotion, opinion on

product or individual. These old method have a disadvantage : costly and time

consuming [1]. The SA can be classified in different categories. An example of five

https://www.twitter.com

https://www.linkdin.com

https://www.facebook.com

https://www.instagram.com

https://www.youtube.com

https://www.myspace.com

https://www.plus.google.com

categories are Very negative, negative, neutral, positive and very positive. The SA

research field is completely related to Natural Language Processing (NPL) . Natural

Language Processing

is concerned with connection between human and computers,

by retrieving useful information from NLP message [2]. I proposed a method that

automatically extract tweets and perform sentiment (very positive, positive, neutral,

negative and very negative). This model very useful for consumers, because they can

use SA to research on products and individuals before making purchase or decision.

Marketers also use this model for public opinion on their company or product.

Organizations can used this model to gather feed back on their new products [3].

In this thesis I mainly focus on election of Bihar (India), and some other

trending topic. This model automatically labelled tweets(texts) : BJP, Congress,

Mahagathbandhan, RJD and JDU. Every day people and critics or political person

make tweets, that tweets I used as a dataset. Through this process, model convert

information in verbal to numeric. From this numbers model shows sentiment(very

positive, positive, neutral, negative, very negative).

The classification model which is developed in this project will determine

positive and negative opinion from tweet status update on twitter by person. This paper

will use hybrid sentiment analysis methodologies.

1.2 MOTIVATION

In this project we shows prediction power of Twitter regarding the Indian

Election. Here I predict the Bihar CM election of 2015. For this prediction I analysed

1 lakh tweets. In thesis I proposed novel approach for sentiment analysis, called

"TWEELYZER" : (TWEE)ts + ana(LYZER). This thesis, chose tweets with hash

tag "#BJPBihar", "#Mahagathbandhan", "#RJD", "#JDU" and "#CongressBihar" for

analysis. For training data I fetched real time tweets on given hashtags in duration of

January 2015 to September 2015. Thus, my result show twitter is the one of the well-

https://en.wikipedia.org/wiki/Natural_language_processing

known social media which provide valid source for predictive future for event like

election.

1.3

BLOGGERS AND MICRO BLOGGERS

Blogging can be described as a platform where people can share their hobbies

and personal experience on the World Wide Web(WWW). It has become one of the

social phenomena with Web 2.0. Also known as webblog, blogs are updated in a

regular pattern in an attempt to incorporate most recent archived posts. One can define

micro-blogging as the type of blogging that allows people to share their opinions and

actions at the time of writing as short messages. In other word, it fills the gap between

instant messaging and blogging. This relatively new type of blogging makes it possible

for individuals to post laconic text updates, using a variety of communication channels

ranging from text messages for mobiles phones and instant messaging to e-mail and

the Web.

The main difference between the regular blogging and micro-blogging is the

text size restriction appearing in the micro-blog posts. Micro-bloggers are permitted

and confined to present their post in a limited size of text message. This feature enable

micro-blogs to be amendable by sending text messages from mobile clients such as

mobile phones. Appearing as an easily accessible system via mobile clients, micro-

blogging has become very popular with the contribution of a wide range of users

composed of average person, celebrities, and commercial organizations. For distinct

purpose , individual users such as actor, musician, politician, academics and students

are use this blogging type regularly.

Microblogging website have involved to become a source of varied king of

information. This is due to nature of microblogs on which people post real time

message about their opinions on variety of topics, discuss current issue, complain and

express positive sentiment for products they use in daily life. In fact, companies

manufacturing such products have started to poll these microblogs to get sense of

general sentiment for their products.

Micro-blogs may indicate what the micro-blogger is doing and thinking.

Microblogs may also provide information about the news, entertainment sector and

good deals. The one providing specific data, in general, provide reference to an

external resource owing to their limited size, which makes it hard to convey the news

by themselves. As broadcasting is briefly defined as spreading information over a large

range of audience, micro-blogs can be used as a source of broadcasting information

about anything the users want to learn about.

Twitter is a "Micro-blogging" social networking website that has a large and

rapidly growing user base. Those who use Twitter can write short 140 characters long

or less update called "Tweets". Tweets are seen by those who 'follow' the person who

'tweeted'.

Due to the growing popularity of the website, Twitter can provide a rich bank

of data in the form of harvested tweets. Twitter by its very nature, allows people to

convey their opinions and thoughts openly about any topic, discussion point or product

that they are interested in sharing their opinion. Therefore twitter is a good medium to

search for potentially interesting trends regarding prominent topics in the news or

popular culture. Our advantage of this data, over previously used data-set is that the

tweets are collected in a streaming fashion and therefore represent a true sample of

actual tweets in terms of language use and content.

The value of twitter in recent years has increased as business, political groups

and curious Internet users alike have started to assess the public's general sentiment

for their products and services from twitter posts. SA provides a means if tracking

opinion and attitude on the web and determines if they are positively or negatively

received by the public.

1.4

SENTIMENT ANALYSIS (SA)

1.4.1 What is sentiment Analysis?

SA is mostly used with social media for monitoring to opinion of public

on certain topic. Social media monitoring tools like Brandwatch

analytics

make that process quicker and easier than ever before. With the SA we can

distinguish poor content from high quality content. SA is discipline that can

be defined as a set of structured properties that we want to find inside a text.

When we read an opinion we want to know what is talking about(object), and

what are the characteristics of this object (features). For each one of these

features we want to know the opinion direction (positive, negative), and finally

we want to add all this opinion directions in summary.

1.4.2 Origin of Sentiment Analysis

Since the year 2001, the amount of research done on sentiment analysis

is rising. SA research is divide towards different internet platforms: review

site (Dave, Lawrence & Pennock, 2003; Hu & Liu 2004), webpages (Morinaga,

Yamanishi, Tateishi & Fukushima, 2002), webpages and news articles

(Nasukawa & Yi, 2003), and stock message boards (Das & Chen, 2007).

Pang and Lee (2008) give three reasons why the interest in sentiment

analysis is flourishing. The first reason is the rise of machine learning methods

in the field of information retrieval and natural language processing. The

second reason is that via internet many review sites emerged. This resulted in

a wide availability of datasets that can be used for machine learning algorithms.

Finally, the area offers interesting commercial and intelligence applications.

https://www.brandwatch.com/brandwatchanalytics/Brandwatch . (1st Feb. 2016)

1.4.3 Basic of Sentiment Analysis

To better understand the principles of SA, Liu(2012) uses six steps to

describe SA. These six steps will be explained in the following section using

example tweet. The example tweet is: I think the Samsung S6 edge's Camera

gets lots of love. Using it for the last two days and I really like it.

The first step that is described by Liu(2012) is the entity extraction and

categorization. In the case of the example tweets this means that the SA tool

should extract the words Samsung S6 edge. With the categorization is meant

that synonyms that are similar to Samsung should also be extracted and

categorized together, or put into clusters.

The second step of SA is aspect extraction and categorization. Here all

the aspect that are connected to Samsung S6 edge should be extracted from the

text and be connected to the entity. In this case this would mean that the word

Camera should be extracted. So in the second step every aspect that tells

something about the entity should be extracted. Other example that say

something about Samsung S6 edge could be 'Battery Life', 'Screen', and

'Picture Quality'.

The third step is the opinion holder extraction and categorization. This

is recognizing the opinion of the writer of the text, also referred as the

sentiment of the text. Recognizing an opinion or sentiment I done by

comparing words to a lexicon with words that have a known sentimental value

(Raijmakers, 2013). Words in a lexicon can have positive sentiment such as

good and beautiful, or they can have negative sentiment such as useless and

bad. In the case of example tweet the word that gives sentimental value is 'good'

and the word 'love'.

The fourth step is time extraction and standardization. The sentiment

analysis tool should find the time and date the message is posted. In the case

of example this is on 15:10, September 10th, 2015. The Fifth step is aspect

sentiment classification. Here the sentiment of the entire text message is

determined which could either be positive, negative, or neutral. In case of the

example tweet should give the positive classification. The sixth step is to create

an overview of all the previous described steps. This way the user of the

sentiment analysis tool can see that the tweet is positive tweet about the camera

of the Samsung S6 edge.

1.5

General Sentiment Analysis Process

From previous researcher I make below Fig 1.1 as a Sentiment Analysis

process.

Figure. 1.1 General Sentiment Analysis Process

1.5.1 Data

Collection

At this stage of sentiment analysis, unstructured text data as input

taken. All source of data can be text file or database retrieved from micro

blogging site or twitter.

1.5.2 Data

Preprocessing

All data once collected, then Data pre-processing is done as per user

requirement. The meaning of preprocessing is making unstructured data to

some meaning full data.

1.5.3 Sentiment Identification / Extraction

Sentiment affects the outcome and type of sentiment depends on type

of data that we get from twitter. In this stage I try to extract different words

from tweets.

1.5.4

Feature

Extraction

This section used for better understanding of, what people are

thinking or sentiment on particular feature of product.

1.6 PROBLEM

FORMATION

Even though the SA systems claim to detect opinions from online text

messages and measure the sentimental values of those message it is unclear how well

these system perform. This research tries to identify how well SA tools work, and how

this reflects to the actual opinions towards a brand and its product.

For any organization it is becoming difficult to get an insight on what people

feels and think about the particular thing hence the ability to extract insights from

social data like twitter is very helpful for various organizations.

Currently, social media is growing and becoming increasingly advanced, more

and more people are starting to use it as a way to predict future trends, or as seen more

recently, to predict who is going to win the upcoming Bihar Election. Hence SA

extremely useful in social media monitoring as it allows us to gain an overview of the

wider public opinion.

SA is an automated process of analysing conversations that are taking place

online and is therefore analysts can voice their opinion on who the winner of the

Election might be.

1.7 AIMS

In order to conduct any kind of analysis on twitter the construction of a suitable

dataset of tweets needs to be built. Twitter API is an app which extracts tweets from

twitter and loads them into a dataset. The aims of this paper are threefold. To construct

a database of tweets on the keywords that will be built using twitter API app.

R Studio will perform a series of analysis on the data such as a knowledge

based techniques which uses a sentiment lexicon dictionary to determine the number

of positive and negative tweets. Machine learning techniques which are based on a

training set and will determine the number of tweets which are positive and negative.

Use the result from the knowledge based techniques and those of the machine

learning techniques to ensure a thorough analysis of the dataset.

1.8 LIMITATION

While we are doing twitter sentiment analysis using R. we get some problem

for SA. Firstly, we only get some fixed number of tweets from Twitter API in R.

Another problem is some time number of received tweets is lesser than requested

tweets for particular keyword. In addition, one more problem will identify, some time

older tweets not retrieve for requested tweets.

1.9

ORGANIZATION OF THE REPORT

· Chapter 2 Related work describes the literature that I reviewed around in the

field of SA and various methodologies used.

· Chapter 3 Design and Architecture describe how overall system, layout of

TWEELYZER will be achieved.

· Chapter 4 Implementation provides step by step guide of all the processes of

project which also includes code and output.

· Chapter 5 Results describe the result (output) that come after the various

methodologies have been applied to the dataset.

· Chapter 6 Conclusion this chapter describe the advantages and disadvantages

of the project.

2. LITERATURE SURVEY

In recent years a many work has been done in the field of SA. There are many text

mining techniques used to mine the twitter feeds. In fact, many researchers started

their work since the beginning of century.

Twitter is different to other forms of raw data which are used for sentiment analysis

as sentiments are conveyed in one or two sentence blurbs rather than paragraphs.

Twitter is much more informal and less consistent in term of language. Users cover a

wide array of topic which interest them and use many symbols such as emotions to

express their views on many aspects of their life (Agarwal et al. 2011). When using

human generated status updates, sentiment is not always obvious; many tweets are

ambiguous and can use humour to maximize the opinion to a machine learning

algorithm (Agarwal et al. 2011). Another consideration when using a dataset generated

from twitter is that a considerably large amount of tweets which convey no sentiment

such as linking to a news article, which can lead to difficulties in data gathering,

training and testing (Parikh, Movassate. 2009). SA provide a means of tracking

opinions and attitudes on the web and determines if they are positively or negatively

received by the public.

Muhammad Asif Razzaq et al [6] using SA they tried to find twitter prediction

power. With use of some classifier they achieved 70% of accuracy for predicting

positive and negative sentiments. They deduced some political parties and leaders who

have low electability, but high popularity from elections results in actual and their

predictions.

Geetika Gautam and Divakar Yadav [6] proposed a set of techniques, that are

combination of machine learning and semantic analysis for classifying the sentence

and provide review of product based on twitter data. They compare their proposed

algorithm with existing machine learning technique like Naïve Bayes, Maximum

Entropy, Support Vector Machine.

Turney et al. [7] predict review using unsupervised learning algorithm which

classify thumbs up and thumbs down review. They did average semantic orientation

of sentence that combines adjective and adverb thus from this they identify the

sentence is positive or negative using Unsupervised Learning Algorithm.

Nguyen Thien Hai et al. [8] build a model that predict stock market price using

sentiment of different social media. They extract all relevant data from the text

available in message, and or extract data the implement own method. Also they show

evolution of SA in stock market prediction via large scale data. Farhan Hassan Khan

et al. [9] they propose a method to predict the polarity of twitter tweet's word and

classify them into positive and negative.

Nair, Deepu S. et al. [10] works on review of Malayalam films. They applied ML

approaches. They mainly compare two statistical method SVM (Support Vector

Machine) and CRF ( Condition Random Field). From their study found that SVM is

better than CRF for SA.

According to Mejova (2009)[15] SA is usually conducted between two levels; a

coarse level and a fine level. Coarse level SA deals with determining the sentiment of

an entire document and Fine level deals with attribute level SA. Neethu, Rajasree

(2013) [16] Sentence level SA comes in between these two. Mejova (2009) [15] SA in

twitter provides a dramatically different data set where interesting challenges can arise.

According to Boiy et al. (2007) [17], Symbolic and ML techniques are the two

basic methodologies used in SA from text. The next two section deal with these

techniques in further details.

A. Symbolic Techniques

Symbolic techniques in supervised classification models make use of

available lexical resources. In his SA Turney (2002) [7] used bag-of-words

approach. In this approach the document was treated as a collection of words

where relationships between words are not considered important. To determine

the overall sentiment, sentiment of every word are given a value and using

Details

Pages
Type of Edition: Erstausgabe
Publication Year: 2016
ISBN (PDF): 9783960675907
File size: 10.6 MB
Language: English
Institution / College: VIT University
Publication date: 2016 (October)
Grade: A
Keywords: Microblogging Sentiment Analysis Sentiment detection Opinion mining Tweet Twitter Data analysis Classification Big Data Social media R Studio
Product Safety: Anchor Academic Publishing

Author

Durgesh Samariya (Author)

Tweelyzer. An Approach to Sentiment Analysis of Tweets

Summary

Excerpt

Table Of Contents

Details

Author

Durgesh Samariya (Author)