Sentiment Analysis means to figure out if the text is something positive or something negative (and in some cases neutral). We can use python and various machine learning techniques to predict the text as positive or negative. You can find the full project here.
How this is different from normal data?
Handling text is different from normal data as first we have to convert text to numbers and also there is problem of variable length. All tweets have variable length. Luckily we have packages like NLTK in Python that make our work or processing the textual data really easy.
Process and coding
The process of predicting sentiment is quite the same as the other data problems. The only difference comes is in the preprocessing step. In the preprocessing step we have to convert our data from text/string format to numbers so that our computer can understand them. Let's start coding
data = pd.read_csv("Sentiment.csv") data.sentiment.value_counts() ## Negative 8493 ## Neutral 3142 ## Positive 2236 ## Name: sentiment, dtype: int64
we can see that our data has more negative labels. Let's preprocess the data.
tweets =  stopwords_set = set(stopwords.words("english")) porterStemmer = PorterStemmer() for index, row in data.iterrows(): #lower the sentence sent = row.text.lower() #remove stopwords, hashtags and tags words_cleaned = ' '.join([word for word in sent.split() if 'http' not in word and not word.startswith('#') and not word.startswith('@') and word not in stopwords_set and len(word) > 3]) #remove punctutions and stem the word words_cleaned = [porterStemmer.stem(word) for word in word_tokenize(words_cleaned) if word.isalpha()] #join sentence words_joined = ' '.join(words_cleaned) tweets.append((words_joined, row.sentiment)) tweets_dataframe = pd.DataFrame(tweets, columns=["text", 'sentiment']) tweets_dataframe.head()
So this will preprocess the data. So we started by lowering the sentence. Then we removed the stopwords, hashtags, and tags in the post as they are of no importance in the model. After removing the stopwords we removed the punctuation and stemmed the words (stemming means to remove the tenses and quantity from the word, basically reducing it to its root form. For example: like, likes and liked, all will be reduced to something like 'lik'). After this, we joined the words to make a complete processed sentence.
Now that we have processed the sentence, we can convert it to numbers. To do that we can use Vectorizers. Vectorizer basically converts the words into an array. For example, we have three sentences 'i am good', 'he is a good person', 'i am a good person'. CountVectorizer (a type of a vectorizer) will count the instances of that word in a sentence and convert it into an array.
|i am good||1||1||1||0||0||0||0|
|he is a good person||0||0||1||1||1||1||1|
|i am a good person||1||1||1||0||0||1||1|
We can use a count vectorizer or TF-IDF vectorizer. Tf-Idf vectorizer takes the whole of the data into consideration so I consider it better than the others.
cv = CountVectorizer() X = cv.fit_transform(tweets_dataframe.text).toarray() tfdifv = TfidfVectorizer() X = tfdifv.fit_transform(tweets_dataframe.text).toarray()
We can either use count vectorizer or TFIDF vectorizer. After this let's convert our targets.
le = LabelEncoder() Y = le.fit_transform(tweets_dataframe.sentiment) print(le.classes_) # ['Negative' 'Neutral' 'Positive']
Now create the test-train split.
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.15)
Now we can fit the data. Let first create logistic model and then we will create a model in Keras.
clf = LogisticRegression() clf.fit(x_train, y_train) y_train_pred = clf.predict(x_train) print("Train accuracy = ", accuracy_score(y_train, y_train_pred)) y_test_pred = clf.predict(x_test) print("Test accuracy = ", accuracy_score(y_test, y_test_pred)) # Train accuracy = 0.7368956743002545 # Test accuracy = 0.6708313310908217
Now let's create a model in Keras.
inshape = X.shape[-1] outshape = len(le.classes_) def get_model(): model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(4192, input_shape=(inshape,), activation='relu')) model.add(tf.keras.layers.BatchNormalization(momentum=0.2)) model.add(tf.keras.layers.Dense(2096, input_shape=(inshape,), activation='relu')) model.add(tf.keras.layers.BatchNormalization(momentum=0.2)) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(tf.keras.layers.BatchNormalization(momentum=0.2)) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dropout(0.2)) model.add(tf.keras.layers.Dense(outshape, activation='softmax')) return model model = get_model() model.summary() model.compile('adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) model.fit(x_train, y_train, epochs=10, batch_size=128, validation_data=(x_test, y_test)) # loss: 0.2261 - acc: 0.9025 - val_loss: 1.9789 - val_acc: 0.6463
So this is how we handle textual data and fit a model on it. Thank you for reading this post.