Twitter post sentiment analysis using Python

Sentiment Analysis means to figure out if the text is something positive or something negative (and in some cases neutral). We can use python and various machine learning techniques to predict the text as positive or negative. You can find the full project here.

How this is different from normal data?

Handling text is different from normal data as first we have to convert text to numbers and also there is problem of variable length. All tweets have variable length. Luckily we have packages like NLTK in Python that make our work or processing the textual data really easy.

Process and coding

The process of predicting sentiment is quite the same as the other data problems. The only difference comes is in the preprocessing step. In the preprocessing step we have to convert our data from text/string format to numbers so that our computer can understand them. Let's start coding

data = pd.read_csv("Sentiment.csv")

## Negative    8493
## Neutral     3142
## Positive    2236
## Name: sentiment, dtype: int64

we can see that our data has more negative labels. Let's preprocess the data.

tweets = []
stopwords_set = set(stopwords.words("english"))
porterStemmer = PorterStemmer()

for index, row in data.iterrows():
    #lower the sentence 
    sent = row.text.lower()    
    #remove stopwords, hashtags and tags
    words_cleaned = ' '.join([word for word in sent.split() 
                     if 'http' not in word
                    and not word.startswith('#')
                    and not word.startswith('@')
                    and word not in stopwords_set
                    and len(word) > 3])
    #remove punctutions and stem the word
    words_cleaned = [porterStemmer.stem(word) for word in word_tokenize(words_cleaned) if word.isalpha()]
    #join sentence
    words_joined = ' '.join(words_cleaned)
    tweets.append((words_joined, row.sentiment))

tweets_dataframe  = pd.DataFrame(tweets, columns=["text", 'sentiment'])

So this will preprocess the data. So we started by lowering the sentence. Then we removed the stopwords, hashtags, and tags in the post as they are of no importance in the model. After removing the stopwords we removed the punctuation and stemmed the words (stemming means to remove the tenses and quantity from the word, basically reducing it to its root form. For example: like, likes and liked, all will be reduced to something like 'lik'). After this, we joined the words to make a complete processed sentence.

Wordcloud twitter sentiment analysis

Now that we have processed the sentence, we can convert it to numbers. To do that we can use Vectorizers. Vectorizer basically converts the words into an array. For example, we have three sentences 'i am good', 'he is a good person', 'i am a good person'. CountVectorizer (a type of a vectorizer) will count the instances of that word in a sentence and convert it into an array.

i am good1110000
he is a good person0011111
i am a good person1110011

We can use a count vectorizer or TF-IDF vectorizer. Tf-Idf vectorizer takes the whole of the data into consideration so I consider it better than the others.

cv = CountVectorizer()
X = cv.fit_transform(tweets_dataframe.text).toarray()

tfdifv = TfidfVectorizer()
X = tfdifv.fit_transform(tweets_dataframe.text).toarray()

We can either use count vectorizer or TFIDF vectorizer. After this let's convert our targets.

le = LabelEncoder()
Y = le.fit_transform(tweets_dataframe.sentiment)
# ['Negative' 'Neutral' 'Positive']

Now create the test-train split.

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.15)

Now we can fit the data. Let first create logistic model and then we will create a model in Keras.

clf = LogisticRegression(), y_train)
y_train_pred = clf.predict(x_train)
print("Train accuracy = ", accuracy_score(y_train, y_train_pred))
y_test_pred = clf.predict(x_test)
print("Test accuracy = ", accuracy_score(y_test, y_test_pred))

# Train accuracy =  0.7368956743002545
# Test accuracy =  0.6708313310908217

Now let's create a model in Keras.

inshape = X.shape[-1]
outshape = len(le.classes_)

def get_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(4192, input_shape=(inshape,), activation='relu'))
    model.add(tf.keras.layers.Dense(2096, input_shape=(inshape,), activation='relu'))
    model.add(tf.keras.layers.Dense(512, activation='relu'))
    model.add(tf.keras.layers.Dense(128, activation='relu'))
    model.add(tf.keras.layers.Dense(outshape, activation='softmax'))
    return model

model = get_model()
model.compile('adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ), y_train, epochs=10, batch_size=128, validation_data=(x_test, y_test))

# loss: 0.2261 - acc: 0.9025 - val_loss: 1.9789 - val_acc: 0.6463

So this is how we handle textual data and fit a model on it. Thank you for reading this post.