Your customers are constantly talking about you, on a variety of channels, and sharing their experience, opinion and feedback . It could be on social media, or it could be on your own managed channels such as forums and feedback forms. These are valuable inputs, that can help shape your product and service road map, identify customer issues and pain point, and improve engagement and loyalty. However, how do you synthesize this incredible volume of data, and make it meaningful for your business. In this post, we are going to talk about one specific technique, called sentiment analysis, and discuss how this can be used to track your brand’s perception in the market. Sentiment Analysis, also called Opinion Mining, is a useful tool within natural language processing that allow us to identify, quantify, and study subjective information.
In this practical example, we are going to mine through recent twitter feeds for a Brand, and answer a few specific questions. While there are many sentiment analysis models, we will use a simple but popular package called VADER to illustrate the approach. The entire code is available on my Colab file.
The general outline for our analysis is as follows
Frame the objective of the analysis / what business problem are we trying to solve
Gather the required data, in our case extract recent tweets using Twitter API
Do some data pre processing and clean unwanted words
Pick a technique/ model - We will use VADER in our case
Apply the model to our data
Summarize / Vizualize results
Develop a plan of action
1. Objective of the analysis
For this analysis, we want to answer three specific questions. Firstly, what’s the general perception of my brand in the market. Second, how does that compare to my competitors. Lastly, what are some pain points and frustrations, my customers are experiencing, and figure out a plan to address them.
2. Gathering the data
We are going to use the twitter API to extract recent tweets for my brand and its competitors. The process is quite straightforward. You will need to apply for a free developer access on twitter, and generate API credentials, which then is used by tweepy, a python library to extract tweets. Details steps are in this well written article. We will use a popular car brand - Audi for our analysis.
3. Data pre-processing
For any Natural language processing (NLP) techniques, the input text data needs to be cleaned of any unwanted characters, symbols or stop words, tjat don’t impact the meaning of our underlying data. Usually we can apply simple regex (regular expressions) transformations like the one below
import re def remove_pattern(input_txt, pattern): r = re.findall(pattern, input_txt) for i in r: input_txt = re.sub(i, '', input_txt) return input_txt def clean_tweet(lst): # remove twitter Return handles (RT @xxx:) lst = np.vectorize(remove_pattern)(lst, "RT @[\w]*:") # remove twitter handles (@xxx) lst = np.vectorize(remove_pattern)(lst, "@[\w]*") # remove URL links (httpxxx) lst = np.vectorize(remove_pattern)(lst, "https?://[A-Za-z0-9./]*") # remove twitter handles (#xxx) lst = np.vectorize(remove_pattern)(lst, "#[\w]*") # remove twitter special characters, numbers, punctuations lst = np.core.defchararray.replace(lst, "[^0-9a-zA-Z]+", "") return lst
4. Picking a technique / model
This is probably the most challenging part of the process. There are no perfect techniques/models that work for all cases. Broadly, there are 3 types of techniques. Firstly, a lexicon / rule based models that score each word in a sentence independently, and then determine the overall sentiment of a sentence. Second, machine learning based models, that are trained using a labeled data set of sentences. These labels are either categorical values, such as positive, negative or neutral, or a numeric range of positive and negative values. Last, there are hybrid models that use a combination of rule based and machine learning models. For this example, we will use a simple lexicon and rule based model called VADER (Valence Aware Dictionary and sEntiment Reasoner), which is known to work well on text from social media. But it’s recommended to try a few models, and evaluate them for accuracy, before deploying them at a larger scale
5. Applying the model to the data
VADER is really simple to use. Install the relevant libraries, define a function to convert VADER’s numeric scores into a grade, apply them on your dataframe of tweets.
Install all relevant libraries
from nltk.sentiment.vader import SentimentIntensityAnalyzer import nltk nltk.download('vader_lexicon')
2. Define a function to convert VADER’s numeric scores into a grade
def vader_sentiment_analyzer_grade(sentence): analyser = SentimentIntensityAnalyzer() grade = '' result = analyser.polarity_scores(sentence) if((result['compound'] >= 0.05) | (result['pos'] >= 0.5)) : #lower positive threshold grade = 'positive' elif result['compound'] <= -0.05: grade = 'negative' else: grade = 'neutral' return grade
tweet_df_cleaned['vader_sentiment'] = tweet_df_cleaned['text'].apply(vader_sentiment_analyzer_grade).value_counts()
6. Summarize / Vizualize results
Now lets vizualize and interpret the results. Overall, our brand looks quite positive, however looking at the wordcloud, there are clearly a few areas of improvement. We can do the same for our competitors, and compare perception of our brand with ours.
1. Plot a pie chart with sentiment values
import matplotlib.pyplot as plt def plot_result(df): value_count = df['vader_sentiment'].value_counts() # Pie chart labels = ['Positive','Neutral','Negative'] sizes = [value_count.positive, value_count.neutral, value_count.negative] explode = (0, 0.1, 0) colors = ['#1db954','#f9d038','#f37778'] fig1, ax1 = plt.subplots() ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=colors) # Equal aspect ratio ensures that pie is drawn as a circle ax1.axis('equal') plt.tight_layout() plt.show()
2. Plot a wordcloud of all negative sentiments
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt def plot_cloud(text): wordcloud = WordCloud( width = 3000, height = 2000, background_color = 'black', stopwords = STOPWORDS).generate(str(text)) fig = plt.figure( figsize = (10, 10), facecolor = 'k', edgecolor = 'k') plt.imshow(wordcloud, interpolation = 'bilinear') plt.axis('off') plt.tight_layout(pad=0) plt.show()
mask = (tweet_df_cleaned['vader_sentiment'] == 'negative') plot_cloud(audi_tweet_df_cleaned[mask])
7. Developing a plan of action
As seen above, there are some clear areas of improvement. There are some issues around service, recall, Fire, accident and dealership. You can easily confirm the incident by looking inside the dataframe, for specific comments, and verify their authenticity. And based on that develop an action plan.
For example, if the comments around service, particularly in the dealership is a pain point identified by customers, we should look into the feedback. We should follow up with the specific twitter users, to get more details. Identify if it’s isolated in a specific location, or more systemic across a lot of locations, and then develop some clear service guidelines for the dealerships, as part of their contract. And review them periodically through follow up surveys to customers.
We could also look at complaints for our competitor brands, and feature them prominently, as key differentiation for our brand in our messaging.
There are so many strategic actions one can take, once we have mined this data, but do make sure you are validating the comments, before making any major changes.
In conclusion, as seen above, even if we don’t have a perfect model/ technique, a lot of interesting analysis and decisions can still be made, by synthesizing and validating the large corpus of social media comments about our brand and our competitors. Happy exploring!