Loading...

Tweelyzer. An Approach to Sentiment Analysis of Tweets

©2016 Textbook 79 Pages

Summary

The ongoing trend of people using microblogging to express their thoughts on various topics has increased the need for developing computerised techniques for automatic sentiment analysis on texts that do not exceed 200 characters. Twitter is a "micro-blogging" social networking site that has a large and rapidly growing base of users. Twitter's tweets or messages are limited to 140 characters. Because of this limitation, it is more difficult to express sentiment and the classification of the tweets is difficult as well. Sentiment analysis can be done on two types: emotion and opinion. This research completely focuses on sentiment analysis of opinions. These opinions can be divided in three different classes: positive, negative and neutral ( somewhere between positive and negative).
The main goal of this study is to build a model that predicts election movement and provide sentiment score from Twitter messages (which can not exceed 140 characters). In this project, the author applies a novel approach that classifies sentiment and emotions of Twitter tweets automatically in positive, negative or neutral classes. For the sentiment, first of all, tweets from twitter were retrieved and converted into the dataset. After pre-processing the data the proposed algorithm named TWEELYZER was applied to the dataset. At the end, the performance of TWEELYZER was measured in terms of accuracy and recall.
In this project, all tweets of people regarding to movies, brands, actors and actresses were collected from twitter and then cleaned and analysed according to the proposed algorithm. These tweets were collected using R Studio software. Several processes took place in pre-processing the tweets. After pre-processing the data, using R Studio led to several insights.

Excerpt

Table Of Contents


4
IMPLEMENTATION 25
4.1 Creating Twitter Application
25
4.2 Working with R/RStudio
29
4.3 Connecting Twitter API to R
32
4.4 Saving Tweets in Local Drive
34
4.5 Cleaning Function
35
4.6 Sentiment Function of TWEELYZER
38
4.7 Scoring Tweets
46
4.8 Visualization of Tweets
48
4.9 Text Analysis
50
5
RESULT AND DISCUSSION
51
6
CONCLUSION AND FUTURE WORK
64
APPENDIX 1: KEYWORD LEXICON
65
REFERENCE 66
iv

LIST OF TABLES
Table No
Title
Page No
4.1
Number of tweets fetched
35
4.2
Sample Emotion Set
37
4.3
Sample Positive Emotion set
38
4.4
Sample negative emotion set
38
4.5
Sample Positive word set
38
4.6
Sample Negative word set
38
4.7
Sample Positive emotion word set
46
4.8
Sample Negative Emotion word
set
46
5.1 Sample
Dataset
51
5.2 Example
Tweet
54
v

LIST OF FIGURES
Figure No
Title
Page No
1.1
General Sentiment Analysis Process
7
3.1 System
Architecture
17
3.2
Flowchart of Tweelyzer
18
3.3
Data Collection Process
19
3.4 Data
Pre-Processing
Process
20
3.5
Sample Tweets on Bihar Election
21
4.1
Link for create Twitter App
25
4.2
Already created Application List
26
4.3
Create app screen
26
4.4
Application creation form
27
4.5 Developer
Agreement
27
4.6
Twitter Application Details
28
4.7
Twitter Application Keys/Token
28
4.8
R download page
29
4.9 R
Console
30
4.10 RStudio
Console
31
4.11 Twitter
Connection
33
4.12
Emotion Category Visualization Demo
49
4.13 Wordcloud
Demo
50
5.1
Freq vs word graph for dataset 1
52
5.2
Wordcloud for dataset 1
52
5.3
Emotion category graph Dataset 1
53
5.4
Sentiment Analysis for Dataset 1
54
5.5
Freq vs word graph for dataset 2
55
5.6
Wordcloud for dataset 2
55
5.7
Emotion category graph Dataset 2
56
5.8
Sentiment Analysis for Dataset 2
57
5.9
Freq vs word graph for dataset 3
57
5.10
Wordcloud for dataset 3
58
5.11
Emotion category graph Dataset 3
58
vi

5.12
Sentiment Analysis for Dataset 3
59
5.13
Freq vs word graph for dataset 4
60
5.14
Wordcloud for dataset 4
60
5.15
Emotion category graph Dataset 4
61
5.16
Sentiment Analysis for Dataset 4
61
5.17
Precision, recall and F- measure of proposed
algorithm
62
5.18
Classification accuracy comparison using
different sentiment technique
63
vii

LIST OF ABBREVIATIONS
ACRONYM EXPANSION
SA
Sentiment Analysis
WWW
World Wide Web
ML
Machine Learning
SVM
Support Vector Machine
CRF
Condition Random Field
NB
Naïve Bayes
APP
Application
API
Application Program Interface
viii

1. INTRODUCTION
Over the last years, survey has been answer of the question what do people
think?. Now days, we find many social networking site on wide area of network like
Twitter
1
, LinkedIn
2
, Facebook
3
, Instagram
4
, YouTube
5
, Myspace
6
and Google+
7
have
gained too much popularity. In the last few years the social medium Twitter has
become more popular day by day and we cannot ignore. This thesis contribute to field
of Sentiment Analysis(SA), which main aim is to extract emotions and opinion from
text (tweets) .
1.1 BACKGROUND
Twitter is most used microblogging social networking domain that has own
large number of uses. Those who all use twitter, they can read, write tweets (140
character message). Due to popularity Twitter providing rich amount of data in form
of tweets. Twitter, allowing people to share their opinion, thoughts and emotion freely
in form of tweets. That's why twitter is a good medium to find interesting trends. At
the Twitter's official developer conference in April 2010 [4], they present some
statistics on users, according to statistics Twitter had 106 million registered user , and
180 million unique visitor in every month in April 2010.
In the past two decade, Sentiment Analysis (SA) has become a hot favourite
research topic. For years, polls are standard method to measure emotion, opinion on
product or individual. These old method have a disadvantage : costly and time
consuming [1]. The SA can be classified in different categories. An example of five
1
https://www.twitter.com
2
https://www.linkdin.com
3
https://www.facebook.com
4
https://www.instagram.com
5
https://www.youtube.com
6
https://www.myspace.com
7
https://www.plus.google.com
1

categories are Very negative, negative, neutral, positive and very positive. The SA
research field is completely related to Natural Language Processing (NPL) . Natural
Language Processing
8
is concerned with connection between human and computers,
by retrieving useful information from NLP message [2]. I proposed a method that
automatically extract tweets and perform sentiment (very positive, positive, neutral,
negative and very negative). This model very useful for consumers, because they can
use SA to research on products and individuals before making purchase or decision.
Marketers also use this model for public opinion on their company or product.
Organizations can used this model to gather feed back on their new products [3].
In this thesis I mainly focus on election of Bihar (India), and some other
trending topic. This model automatically labelled tweets(texts) : BJP, Congress,
Mahagathbandhan, RJD and JDU. Every day people and critics or political person
make tweets, that tweets I used as a dataset. Through this process, model convert
information in verbal to numeric. From this numbers model shows sentiment(very
positive, positive, neutral, negative, very negative).
The classification model which is developed in this project will determine
positive and negative opinion from tweet status update on twitter by person. This paper
will use hybrid sentiment analysis methodologies.
1.2 MOTIVATION
In this project we shows prediction power of Twitter regarding the Indian
Election. Here I predict the Bihar CM election of 2015. For this prediction I analysed
1 lakh tweets. In thesis I proposed novel approach for sentiment analysis, called
"TWEELYZER" : (TWEE)ts + ana(LYZER). This thesis, chose tweets with hash
tag "#BJPBihar", "#Mahagathbandhan", "#RJD", "#JDU" and "#CongressBihar" for
analysis. For training data I fetched real time tweets on given hashtags in duration of
January 2015 to September 2015. Thus, my result show twitter is the one of the well-
8
https://en.wikipedia.org/wiki/Natural_language_processing
2

known social media which provide valid source for predictive future for event like
election.
1.3
BLOGGERS AND MICRO BLOGGERS
Blogging can be described as a platform where people can share their hobbies
and personal experience on the World Wide Web(WWW). It has become one of the
social phenomena with Web 2.0. Also known as webblog, blogs are updated in a
regular pattern in an attempt to incorporate most recent archived posts. One can define
micro-blogging as the type of blogging that allows people to share their opinions and
actions at the time of writing as short messages. In other word, it fills the gap between
instant messaging and blogging. This relatively new type of blogging makes it possible
for individuals to post laconic text updates, using a variety of communication channels
ranging from text messages for mobiles phones and instant messaging to e-mail and
the Web.
The main difference between the regular blogging and micro-blogging is the
text size restriction appearing in the micro-blog posts. Micro-bloggers are permitted
and confined to present their post in a limited size of text message. This feature enable
micro-blogs to be amendable by sending text messages from mobile clients such as
mobile phones. Appearing as an easily accessible system via mobile clients, micro-
blogging has become very popular with the contribution of a wide range of users
composed of average person, celebrities, and commercial organizations. For distinct
purpose , individual users such as actor, musician, politician, academics and students
are use this blogging type regularly.
Microblogging website have involved to become a source of varied king of
information. This is due to nature of microblogs on which people post real time
message about their opinions on variety of topics, discuss current issue, complain and
express positive sentiment for products they use in daily life. In fact, companies
manufacturing such products have started to poll these microblogs to get sense of
general sentiment for their products.
3

Micro-blogs may indicate what the micro-blogger is doing and thinking.
Microblogs may also provide information about the news, entertainment sector and
good deals. The one providing specific data, in general, provide reference to an
external resource owing to their limited size, which makes it hard to convey the news
by themselves. As broadcasting is briefly defined as spreading information over a large
range of audience, micro-blogs can be used as a source of broadcasting information
about anything the users want to learn about.
Twitter is a "Micro-blogging" social networking website that has a large and
rapidly growing user base. Those who use Twitter can write short 140 characters long
or less update called "Tweets". Tweets are seen by those who 'follow' the person who
'tweeted'.
Due to the growing popularity of the website, Twitter can provide a rich bank
of data in the form of harvested tweets. Twitter by its very nature, allows people to
convey their opinions and thoughts openly about any topic, discussion point or product
that they are interested in sharing their opinion. Therefore twitter is a good medium to
search for potentially interesting trends regarding prominent topics in the news or
popular culture. Our advantage of this data, over previously used data-set is that the
tweets are collected in a streaming fashion and therefore represent a true sample of
actual tweets in terms of language use and content.
The value of twitter in recent years has increased as business, political groups
and curious Internet users alike have started to assess the public's general sentiment
for their products and services from twitter posts. SA provides a means if tracking
opinion and attitude on the web and determines if they are positively or negatively
received by the public.
4

1.4
SENTIMENT ANALYSIS (SA)
1.4.1 What is sentiment Analysis?
SA is mostly used with social media for monitoring to opinion of public
on certain topic. Social media monitoring tools like Brandwatch
9
analytics
make that process quicker and easier than ever before. With the SA we can
distinguish poor content from high quality content. SA is discipline that can
be defined as a set of structured properties that we want to find inside a text.
When we read an opinion we want to know what is talking about(object), and
what are the characteristics of this object (features). For each one of these
features we want to know the opinion direction (positive, negative), and finally
we want to add all this opinion directions in summary.
1.4.2 Origin of Sentiment Analysis
Since the year 2001, the amount of research done on sentiment analysis
is rising. SA research is divide towards different internet platforms: review
site (Dave, Lawrence & Pennock, 2003; Hu & Liu 2004), webpages (Morinaga,
Yamanishi, Tateishi & Fukushima, 2002), webpages and news articles
(Nasukawa & Yi, 2003), and stock message boards (Das & Chen, 2007).
Pang and Lee (2008) give three reasons why the interest in sentiment
analysis is flourishing. The first reason is the rise of machine learning methods
in the field of information retrieval and natural language processing. The
second reason is that via internet many review sites emerged. This resulted in
a wide availability of datasets that can be used for machine learning algorithms.
Finally, the area offers interesting commercial and intelligence applications.
9
https://www.brandwatch.com/brandwatchanalytics/Brandwatch . (1st Feb. 2016)
5

1.4.3 Basic of Sentiment Analysis
To better understand the principles of SA, Liu(2012) uses six steps to
describe SA. These six steps will be explained in the following section using
example tweet. The example tweet is: I think the Samsung S6 edge's Camera
gets lots of love. Using it for the last two days and I really like it.
The first step that is described by Liu(2012) is the entity extraction and
categorization. In the case of the example tweets this means that the SA tool
should extract the words Samsung S6 edge. With the categorization is meant
that synonyms that are similar to Samsung should also be extracted and
categorized together, or put into clusters.
The second step of SA is aspect extraction and categorization. Here all
the aspect that are connected to Samsung S6 edge should be extracted from the
text and be connected to the entity. In this case this would mean that the word
Camera should be extracted. So in the second step every aspect that tells
something about the entity should be extracted. Other example that say
something about Samsung S6 edge could be 'Battery Life', 'Screen', and
'Picture Quality'.
The third step is the opinion holder extraction and categorization. This
is recognizing the opinion of the writer of the text, also referred as the
sentiment of the text. Recognizing an opinion or sentiment I done by
comparing words to a lexicon with words that have a known sentimental value
(Raijmakers, 2013). Words in a lexicon can have positive sentiment such as
good and beautiful, or they can have negative sentiment such as useless and
bad. In the case of example tweet the word that gives sentimental value is 'good'
and the word 'love'.
The fourth step is time extraction and standardization. The sentiment
analysis tool should find the time and date the message is posted. In the case
of example this is on 15:10, September 10th, 2015. The Fifth step is aspect
sentiment classification. Here the sentiment of the entire text message is
6

determined which could either be positive, negative, or neutral. In case of the
example tweet should give the positive classification. The sixth step is to create
an overview of all the previous described steps. This way the user of the
sentiment analysis tool can see that the tweet is positive tweet about the camera
of the Samsung S6 edge.
1.5
General Sentiment Analysis Process
From previous researcher I make below Fig 1.1 as a Sentiment Analysis
process.
Figure. 1.1 General Sentiment Analysis Process
7

1.5.1 Data
Collection
At this stage of sentiment analysis, unstructured text data as input
taken. All source of data can be text file or database retrieved from micro
blogging site or twitter.
1.5.2 Data
Preprocessing
All data once collected, then Data pre-processing is done as per user
requirement. The meaning of preprocessing is making unstructured data to
some meaning full data.
1.5.3 Sentiment Identification / Extraction
Sentiment affects the outcome and type of sentiment depends on type
of data that we get from twitter. In this stage I try to extract different words
from tweets.
1.5.4
Feature
Extraction
This section used for better understanding of, what people are
thinking or sentiment on particular feature of product.
1.6 PROBLEM
FORMATION
Even though the SA systems claim to detect opinions from online text
messages and measure the sentimental values of those message it is unclear how well
these system perform. This research tries to identify how well SA tools work, and how
this reflects to the actual opinions towards a brand and its product.
8

For any organization it is becoming difficult to get an insight on what people
feels and think about the particular thing hence the ability to extract insights from
social data like twitter is very helpful for various organizations.
Currently, social media is growing and becoming increasingly advanced, more
and more people are starting to use it as a way to predict future trends, or as seen more
recently, to predict who is going to win the upcoming Bihar Election. Hence SA
extremely useful in social media monitoring as it allows us to gain an overview of the
wider public opinion.
SA is an automated process of analysing conversations that are taking place
online and is therefore analysts can voice their opinion on who the winner of the
Election might be.
1.7 AIMS
In order to conduct any kind of analysis on twitter the construction of a suitable
dataset of tweets needs to be built. Twitter API is an app which extracts tweets from
twitter and loads them into a dataset. The aims of this paper are threefold. To construct
a database of tweets on the keywords that will be built using twitter API app.
R Studio will perform a series of analysis on the data such as a knowledge
based techniques which uses a sentiment lexicon dictionary to determine the number
of positive and negative tweets. Machine learning techniques which are based on a
training set and will determine the number of tweets which are positive and negative.
Use the result from the knowledge based techniques and those of the machine
learning techniques to ensure a thorough analysis of the dataset.
1.8 LIMITATION
While we are doing twitter sentiment analysis using R. we get some problem
for SA. Firstly, we only get some fixed number of tweets from Twitter API in R.
9

Another problem is some time number of received tweets is lesser than requested
tweets for particular keyword. In addition, one more problem will identify, some time
older tweets not retrieve for requested tweets.
1.9
ORGANIZATION OF THE REPORT
· Chapter 2 Related work describes the literature that I reviewed around in the
field of SA and various methodologies used.
· Chapter 3 Design and Architecture describe how overall system, layout of
TWEELYZER will be achieved.
· Chapter 4 Implementation provides step by step guide of all the processes of
project which also includes code and output.
· Chapter 5 Results describe the result (output) that come after the various
methodologies have been applied to the dataset.
· Chapter 6 Conclusion this chapter describe the advantages and disadvantages
of the project.
10

2. LITERATURE SURVEY
In recent years a many work has been done in the field of SA. There are many text
mining techniques used to mine the twitter feeds. In fact, many researchers started
their work since the beginning of century.
Twitter is different to other forms of raw data which are used for sentiment analysis
as sentiments are conveyed in one or two sentence blurbs rather than paragraphs.
Twitter is much more informal and less consistent in term of language. Users cover a
wide array of topic which interest them and use many symbols such as emotions to
express their views on many aspects of their life (Agarwal et al. 2011). When using
human generated status updates, sentiment is not always obvious; many tweets are
ambiguous and can use humour to maximize the opinion to a machine learning
algorithm (Agarwal et al. 2011). Another consideration when using a dataset generated
from twitter is that a considerably large amount of tweets which convey no sentiment
such as linking to a news article, which can lead to difficulties in data gathering,
training and testing (Parikh, Movassate. 2009). SA provide a means of tracking
opinions and attitudes on the web and determines if they are positively or negatively
received by the public.
Muhammad Asif Razzaq et al [6] using SA they tried to find twitter prediction
power. With use of some classifier they achieved 70% of accuracy for predicting
positive and negative sentiments. They deduced some political parties and leaders who
have low electability, but high popularity from elections results in actual and their
predictions.
Geetika Gautam and Divakar Yadav [6] proposed a set of techniques, that are
combination of machine learning and semantic analysis for classifying the sentence
and provide review of product based on twitter data. They compare their proposed
algorithm with existing machine learning technique like Naïve Bayes, Maximum
Entropy, Support Vector Machine.
11

Turney et al. [7] predict review using unsupervised learning algorithm which
classify thumbs up and thumbs down review. They did average semantic orientation
of sentence that combines adjective and adverb thus from this they identify the
sentence is positive or negative using Unsupervised Learning Algorithm.
Nguyen Thien Hai et al. [8] build a model that predict stock market price using
sentiment of different social media. They extract all relevant data from the text
available in message, and or extract data the implement own method. Also they show
evolution of SA in stock market prediction via large scale data. Farhan Hassan Khan
et al. [9] they propose a method to predict the polarity of twitter tweet's word and
classify them into positive and negative.
Nair, Deepu S. et al. [10] works on review of Malayalam films. They applied ML
approaches. They mainly compare two statistical method SVM (Support Vector
Machine) and CRF ( Condition Random Field). From their study found that SVM is
better than CRF for SA.
According to Mejova (2009)[15] SA is usually conducted between two levels; a
coarse level and a fine level. Coarse level SA deals with determining the sentiment of
an entire document and Fine level deals with attribute level SA. Neethu, Rajasree
(2013) [16] Sentence level SA comes in between these two. Mejova (2009) [15] SA in
twitter provides a dramatically different data set where interesting challenges can arise.
According to Boiy et al. (2007) [17], Symbolic and ML techniques are the two
basic methodologies used in SA from text. The next two section deal with these
techniques in further details.
A. Symbolic Techniques
Symbolic techniques in supervised classification models make use of
available lexical resources. In his SA Turney (2002) [7] used bag-of-words
approach. In this approach the document was treated as a collection of words
where relationships between words are not considered important. To determine
the overall sentiment, sentiment of every word are given a value and using
12

Details

Pages
Type of Edition
Erstausgabe
Year
2016
ISBN (PDF)
9783960675907
File size
10.6 MB
Language
English
Institution / College
VIT University
Publication date
2016 (October)
Grade
A
Keywords
Microblogging Sentiment Analysis Sentiment detection Opinion mining Tweet Twitter Data analysis Classification Big Data Social media R Studio
Previous

Title: Tweelyzer. An Approach to Sentiment Analysis of Tweets
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
79 pages
Cookie-Einstellungen