Commit eca7ff97 authored by Daniel Sammon's avatar Daniel Sammon 🏀
Browse files

Adding part two and data for part two

parent dbc924e8
%% Cell type:markdown id:9fb73dd4 tags:
# Part One: Baseline sentiment analysis system
%% Cell type:markdown id:055cef73 tags:
## Libraries
### Libraries used throughout assignment
%% Cell type:code id:deaad94c tags:
``` python
!pip install sklearn
!pip install datasets
#!pip install sklearn
#!pip install datasets
import pandas as pd
import numpy as np
import string
import re
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata
#!pip install textblob
from textblob import TextBlob,Word
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
#nltk.download('stopwords')
```
%% Output
Requirement already satisfied: sklearn in c:\users\user\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: scikit-learn in c:\users\user\anaconda3\lib\site-packages (from sklearn) (1.0)
Requirement already satisfied: joblib>=0.11 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.20.1)
Requirement already satisfied: scipy>=1.1.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.6.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: datasets in c:\users\user\anaconda3\lib\site-packages (1.18.4)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.4.0)
Requirement already satisfied: numpy>=1.17 in c:\users\user\anaconda3\lib\site-packages (from datasets) (1.20.1)
Requirement already satisfied: responses<0.19 in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.18.0)
Requirement already satisfied: requests>=2.19.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (2.27.1)
Requirement already satisfied: pandas in c:\users\user\anaconda3\lib\site-packages (from datasets) (1.2.4)
Requirement already satisfied: multiprocess in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.70.12.2)
Requirement already satisfied: aiohttp in c:\users\user\anaconda3\lib\site-packages (from datasets) (3.8.1)
Requirement already satisfied: tqdm>=4.62.1 in c:\users\user\anaconda3\lib\site-packages (from datasets) (4.63.0)
Requirement already satisfied: dill in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.3.4)
Requirement already satisfied: packaging in c:\users\user\anaconda3\lib\site-packages (from datasets) (20.9)
Requirement already satisfied: pyarrow!=4.0.0,>=3.0.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (7.0.0)
Requirement already satisfied: xxhash in c:\users\user\anaconda3\lib\site-packages (from datasets) (3.0.0)
Requirement already satisfied: fsspec[http]>=2021.05.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (2022.2.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\user\anaconda3\lib\site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.7.4.3)
Requirement already satisfied: pyyaml in c:\users\user\anaconda3\lib\site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (5.4.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\user\anaconda3\lib\site-packages (from packaging->datasets) (2.4.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2.0.10)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (1.26.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2020.12.5)
Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm>=4.62.1->datasets) (0.4.4)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (6.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (1.7.2)
Requirement already satisfied: attrs>=17.3.0 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (20.3.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: aiosignal>=1.1.2 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (1.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (1.3.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas->datasets) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\user\anaconda3\lib\site-packages (from pandas->datasets) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
%% Cell type:markdown id:16fa8a9b tags:
# Given logistic regression model
%% Cell type:markdown id:4f79f0c2 tags:
* The original model is a basline sentiment analysis system that uses logistic regression.
* The model was trained on 80% of the dataset and tested on the remainder
* The model obtained a 77% accuracy score
%% Cell type:code id:acc60530 tags:
``` python
raw_datasets = load_dataset("imdb")
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
%% Cell type:code id:cc80fe9e tags:
``` python
train_dataset = raw_datasets['train'].shuffle(seed=42).select(range(25000))
train_data = []
train_data_labels = []
for item in train_dataset:
train_data.append(item['text'])
train_data_labels.append(item['label'])
print(len(train_data))
```
%% Output
Loading cached shuffled indices for dataset at C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow
25000
%% Cell type:code id:76638426 tags:
``` python
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
print(train_data[0])
features = vectorizer.fit_transform(train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))
```
%% Output
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
25000
200
%% Cell type:code id:d5f5111c tags:
``` python
X_train, X_test, y_train, y_test = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)
```
%% Cell type:code id:810eaa1e tags:
``` python
log_model = LogisticRegression()
```
%% Cell type:code id:726eed37 tags:
``` python
log_model = log_model.fit(X=X_train,y=y_train)
```
%% Cell type:code id:e5fd085c tags:
``` python
y_pred = log_model.predict(X_test)
```
%% Cell type:code id:c8d9255b tags:
``` python
print(accuracy_score(y_test,y_pred))
```
%% Output
0.7708
%% Cell type:markdown id:e8e426cc tags:
# My logistic regression model
%% Cell type:markdown id:f385cabc tags:
* My main goal is to improve the accuracy of the original model
* I must implement logistic regression
* Aswell as this I must demenstrate two ideas that I believe will aid this improvement
%% Cell type:code id:9a2d5e26 tags:
``` python
# Loading in the data as a dictionary
dic = load_dataset("imdb")
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
%% Cell type:code id:be7cf4a7 tags:
``` python
#Analysing the dictionary
dic["train"][0]["text"]
```
%% Output
'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.'
%% Cell type:code id:539ee0fb tags:
``` python
# Reading the dictionary into two seperate lists
train_dataset = dic['train'].shuffle(seed=42).select(range(25000))
train_data = []
train_data_labels = []
for item in train_dataset:
train_data.append(item['text'])
train_data_labels.append(item['label'])
print(len(train_data))
```
%% Output
Loading cached shuffled indices for dataset at C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow
25000
%% Cell type:code id:90b42be7 tags:
``` python
# Converting the lists to a single dataframe
df = pd.DataFrame(
{'text': train_data,
'label': train_data_labels
})
df.tail()
```
%% Output
text label
24995 The ultimate goal of Big Brother, that we know... 0
24996 After mob boss Vic Moretti (late great Anthony... 0
24997 Anyone who has said that it's better than Host... 0
24998 What do you get if you cross The Matrix with T... 1
24999 I remember watching this movie several times a... 0
%% Cell type:markdown id:d3ce4594 tags:
### Sentiment analysis pipeline
## 1) Removal of unwanted characters
%% Cell type:markdown id:c70ef93c tags:
**Remove punction**\
The below function will remove all !, ?, * etc from the text column.
**Removing numbers**\
Numbers in general do not add much to the sentiment so removing them is okay.