Commit 2cb5a7bd authored by Daniel Sammon's avatar Daniel Sammon 🏀
Browse files

Creating two seperate folders as part two will be in google colab

parent 5ce04111
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
%% Cell type:markdown id:055cef73 tags:
## Libraries
%% Cell type:code id:deaad94c tags:
``` python
!pip install sklearn
!pip install datasets
import pandas as pd
import numpy as np
import string
import re
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata
from textblob import TextBlob,Word
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
```
%% Output
Requirement already satisfied: sklearn in c:\users\user\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: scikit-learn in c:\users\user\anaconda3\lib\site-packages (from sklearn) (1.0)
Requirement already satisfied: joblib>=0.11 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.20.1)
Requirement already satisfied: scipy>=1.1.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (1.6.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: datasets in c:\users\user\anaconda3\lib\site-packages (1.18.4)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.4.0)
Requirement already satisfied: numpy>=1.17 in c:\users\user\anaconda3\lib\site-packages (from datasets) (1.20.1)
Requirement already satisfied: responses<0.19 in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.18.0)
Requirement already satisfied: requests>=2.19.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (2.27.1)
Requirement already satisfied: pandas in c:\users\user\anaconda3\lib\site-packages (from datasets) (1.2.4)
Requirement already satisfied: multiprocess in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.70.12.2)
Requirement already satisfied: aiohttp in c:\users\user\anaconda3\lib\site-packages (from datasets) (3.8.1)
Requirement already satisfied: tqdm>=4.62.1 in c:\users\user\anaconda3\lib\site-packages (from datasets) (4.63.0)
Requirement already satisfied: dill in c:\users\user\anaconda3\lib\site-packages (from datasets) (0.3.4)
Requirement already satisfied: packaging in c:\users\user\anaconda3\lib\site-packages (from datasets) (20.9)
Requirement already satisfied: pyarrow!=4.0.0,>=3.0.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (7.0.0)
Requirement already satisfied: xxhash in c:\users\user\anaconda3\lib\site-packages (from datasets) (3.0.0)
Requirement already satisfied: fsspec[http]>=2021.05.0 in c:\users\user\anaconda3\lib\site-packages (from datasets) (2022.2.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\user\anaconda3\lib\site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.7.4.3)
Requirement already satisfied: pyyaml in c:\users\user\anaconda3\lib\site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (5.4.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\user\anaconda3\lib\site-packages (from packaging->datasets) (2.4.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2.0.10)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (1.26.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2020.12.5)
Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm>=4.62.1->datasets) (0.4.4)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (6.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (1.7.2)
Requirement already satisfied: attrs>=17.3.0 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (20.3.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: aiosignal>=1.1.2 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (1.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in c:\users\user\anaconda3\lib\site-packages (from aiohttp->datasets) (1.3.0)
Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas->datasets) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\user\anaconda3\lib\site-packages (from pandas->datasets) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
%% Cell type:markdown id:16fa8a9b tags:
# Given logistic regression model
%% Cell type:markdown id:4f79f0c2 tags:
* The original model is a basline sentiment analysis system that uses logistic regression.
* The model was trained on 80% of the dataset and tested on the remainder
* The model obtained a 77% accuracy score
%% Cell type:code id:acc60530 tags:
``` python
raw_datasets = load_dataset("imdb")
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
%% Cell type:code id:cc80fe9e tags:
``` python
train_dataset = raw_datasets['train'].shuffle(seed=42).select(range(25000))
train_data = []
train_data_labels = []
for item in train_dataset:
train_data.append(item['text'])
train_data_labels.append(item['label'])
print(len(train_data))
```
%% Output
Loading cached shuffled indices for dataset at C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow
25000
%% Cell type:code id:76638426 tags:
``` python
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
print(train_data[0])
features = vectorizer.fit_transform(train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))
```
%% Output
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
25000
200
%% Cell type:code id:d5f5111c tags:
``` python
X_train, X_test, y_train, y_test = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)
```
%% Cell type:code id:810eaa1e tags:
``` python
log_model = LogisticRegression()
```
%% Cell type:code id:726eed37 tags:
``` python
log_model = log_model.fit(X=X_train,y=y_train)
```
%% Cell type:code id:e5fd085c tags:
``` python
y_pred = log_model.predict(X_test)
```
%% Cell type:code id:c8d9255b tags:
``` python
print(accuracy_score(y_test,y_pred))
```
%% Output
0.7708
%% Cell type:markdown id:e8e426cc tags:
# My logistic regression model
%% Cell type:markdown id:f385cabc tags:
* My main goal is to improve the accuracy of the original model
* I must implement logistic regression
* Aswell as this I must demenstrate two ideas that I believe will aid this improvement
%% Cell type:code id:9a2d5e26 tags:
``` python
# Loading in the data as a dictionary
dic = load_dataset("imdb")
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
%% Cell type:code id:be7cf4a7 tags:
``` python
#Analysing the dictionary
dic["train"][0]["text"]
```
%% Output
'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.'
%% Cell type:code id:539ee0fb tags:
``` python
# Reading the dictionary into two seperate lists
train_dataset = dic['train'].shuffle(seed=42).select(range(25000))
train_data = []
train_data_labels = []
for item in train_dataset:
train_data.append(item['text'])
train_data_labels.append(item['label'])
print(len(train_data))
```
%% Output
Loading cached shuffled indices for dataset at C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow
25000
%% Cell type:code id:90b42be7 tags:
``` python
# Converting the lists to a single dataframe
df = pd.DataFrame(
{'text': train_data,
'label': train_data_labels
})
df.tail()
```
%% Output
text label
24995 The ultimate goal of Big Brother, that we know... 0
24996 After mob boss Vic Moretti (late great Anthony... 0
24997 Anyone who has said that it's better than Host... 0
24998 What do you get if you cross The Matrix with T... 1
24999 I remember watching this movie several times a... 0
%% Cell type:markdown id:d3ce4594 tags:
### Sentiment analysis pipeline
%% Cell type:markdown id:c70ef93c tags:
**Remove punction**\
The below function will remove all !, ?, * etc from the text column.
%% Cell type:code id:4d10f674 tags:
``` python
def remove_punctuations(text):
translator = str.maketrans('', '', string.punctuation)
word = str(text).translate(translator)
return word
df['text'] = df['text'].apply(remove_punctuations)
```
%% Cell type:markdown id:5b902869 tags:
**Removing numbers**\
Numbers in general do not add much to the sentiment so removing them is okay.
%% Cell type:code id:152bd046 tags:
``` python
def remove_numbers(text):
pat = r'[0-9]'
nltk_cleaned = re.sub(pat,'',text)
return nltk_cleaned
df['text'] = df.apply(lambda x: remove_numbers(x['text']),axis =1)
```
%% Cell type:markdown id:8bd6b7ad tags:
**Removing characters**\
I noticed that the removal of special characters like ~ improved the performance
%% Cell type:code id:856ce2fe tags:
``` python
def remove_accented_chars(text):
nltk_accented = unicodedata.normalize('NFKD',text).encode('ascii','ignore').decode('utf-8','ignore')
return nltk_accented
df['text'] = df.apply(lambda x: remove_accented_chars(x['text']),axis =1)
```
%% Cell type:code id:2d20b5d8 tags:
``` python
def split_cell(text):
sentence = text.split()
return sentence
df['text'] = df['text'].apply(split_cell)
df["text"]
```
%% Output
0 [There, is, no, relation, at, all, between, Fo...
1 [This, movie, is, a, great, The, plot, is, ver...
2 [George, P, Cosmatos, Rambo, First, Blood, Par...
3 [In, the, process, of, trying, to, establish, ...
4 [Yeh, I, know, youre, quivering, with, excitem...
...
24995 [The, ultimate, goal, of, Big, Brother, that, ...
24996 [After, mob, boss, Vic, Moretti, late, great, ...
24997 [Anyone, who, has, said, that, its, better, th...
24998 [What, do, you, get, if, you, cross, The, Matr...
24999 [I, remember, watching, this, movie, several, ...
Name: text, Length: 25000, dtype: object
%% Cell type:markdown id:4e4e0155 tags:
**Tokenization**\
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context of the text.\
The tokenization function is commented out as we obtained better results using vectorization instead.
%% Cell type:code id:a609201f tags:
``` python
#def get_tokens(text):
# nltk_tokens = nltk.word_tokenize(str(text))
# return nltk_tokens
#df['text'] = df['text'].apply(get_tokens)
#df["text"]
```
%% Cell type:markdown id:94397165 tags:
**Removal of stop words**\
Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text giving more importance to the informative text.
%% Cell type:code id:42c7ad12 tags:
``` python
def remove_stopwords(text):
stop_words = set(stopwords.words('english'))
clean_text = [word for word in text if not word.lower() in stop_words]
return clean_text
df['text'] = df.apply(lambda x: remove_stopwords(x['text']),axis =1)
df["text"]
```
%% Output
0 [relation, Fortier, Profiler, fact, police, se...
1 [movie, great, plot, true, book, classic, writ...
2 [George, P, Cosmatos, Rambo, First, Blood, Par...
3 [process, trying, establish, audiences, empath...
4 [Yeh, know, youre, quivering, excitement, Well...
...
24995 [ultimate, goal, Big, Brother, know, think, th...
24996 [mob, boss, Vic, Moretti, late, great, Anthony...
24997 [Anyone, said, better, Hostel, talking, comple...
24998 [get, cross, Matrix, Truman, Showbr, br, Im, s...
24999 [remember, watching, movie, several, times, yo...
Name: text, Length: 25000, dtype: object
%% Cell type:markdown id:2e059907 tags:
**Lemmatization**\
The goal of lemmatization is similar to stemming, but with stemming a word sometimes loses the actual meaning of the word. Lemmatization usually refers to doing things properly using vocabulary and morphological analysis of words. It returns the dictionary form of a word, also known as the lemma.
%% Cell type:code id:76878fc7 tags:
``` python
def get_lem(text):
lem_word= [Word(word).lemmatize("v") for word in text]
return lem_word
df['text'] = df.apply(lambda x: get_lem(x['text']),axis =1)
```
%% Cell type:markdown id:8002e837 tags:
As the text column is in string format the below function converts the column back to string format.
%% Cell type:code id:05f1c69f tags:
``` python
def list_to_string(text):
listToStr = ' '.join([str(elem) for elem in text])
return listToStr
df['text'] = df.apply(lambda x: list_to_string(x['text']),axis =1)
```
%% Cell type:markdown id:3ce38ad2 tags:
**Count vectorization**\
Count Vectorizer is one of the simplest and relatively satisfactory techniques used in sentiment analysis. It counts the number of times a particular word appears in a given document and uses this as a weight. The Count Vectorizer technique provides tokenization of the text documents and builds a vocabulary of words.
%% Cell type:code id:c6de1fd5 tags:
``` python
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(df["text"])
features_nd = features.toarray()
labels = df["label"].tolist()
print(len(features_nd))
print(len(features_nd[0]))
```
%% Output
25000
200
%% Cell type:code id:6bd971d9 tags:
``` python
X_train, X_test, y_train, y_test = train_test_split(features_nd, labels, train_size=0.8, random_state=123)
log_model = LogisticRegression()
log_model = log_model.fit(X=X_train,y=y_train)
y_pred = log_model.predict(X_test)
print("This is the accuracy score: ", accuracy_score(y_test,y_pred))
```
%% Output
This is the accuracy score: 0.7938
%% Cell type:markdown id:70c56d01 tags:
### Eperiment observations
* We were given the task of implementing two ideas in an attempt to improve the original logistic regression model
* I wrote many functions that I believed would add to improved results
* I ran many experiments but ultimately the above pipeline and functions used gave me the best results
* The accuracy varied from 0.7726 to 0.7938
* My original idea was to remove stop words and punctuation but this only received an accuracy score of 0.7786
* After adding the additional functions I received an accuracy of 0.7938
%% Cell type:code id:88398265 tags:
``` python
import pkg_resources
import types
def get_imports():
for name, val in globals().items():
if isinstance(val, types.ModuleType):
# Split ensures you get root package,
# not just imported function
name = val.__name__.split(".")[0]
elif isinstance(val, type):
name = val.__module__.split(".")[0]
# Some packages are weird and have different
# imported names vs. system/pip names. Unfortunately,
# there is no systematic way to get pip names from
# a package's imported name. You'll have to add
# exceptions to this list manually!
poorly_named_packages = {
"PIL": "Pillow",
"sklearn": "scikit-learn"
}
if name in poorly_named_packages.keys():
name = poorly_named_packages[name]
yield name
imports = list(set(get_imports()))
# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
if m.project_name in imports and m.project_name!="pip":
requirements.append((m.project_name, m.version))
for r in requirements:
print("{}=={}".format(*r))
```
%% Output
textblob==0.17.1
seaborn==0.11.1
scikit-learn==1.0
pandas==1.2.4
numpy==1.20.1
nltk==3.6.1
matplotlib==3.5.1
%% Cell type:code id:5a6aec4f tags:
``` python
```
......
# Assignment 2
## Environment set up
```
git clone git@gitlab.computing.dcu.ie:sammond4/bert-neural-network.git
cd bert-neural-network
conda create --name bert --file requirements.txt
conda activate bert
jupyter notebook
```
## Part 1
For part 1 we were given a basic logistic regression model that was achieving an accurcacy score of 77%. We were given the task of improving this score while still using logistic regression.
The code and comments are recorded in the Logistic regression notebook.
## Part 2
BERT is a neural network language model architecture introduced by Google in 2018 (Devlin
et al. 2018). When training a BERT model, the network is trained not to predict the next
token in a sequence but to predict a masked token as in a cloze test.
### Project status
In progress
### Contributors
Daniel Sammon\
daniel.sammon4@mail.dcu.ie\
18364071.
This diff is collapsed.
# Assignment 2
## Environment set up
```
git clone git@gitlab.computing.dcu.ie:sammond4/bert-neural-network.git
cd bert-neural-network
conda create --name bert --file requirements.txt
conda activate bert
jupyter notebook
```
## Part 1
For part 1 we were given a basic logistic regression model that was achieving an accurcacy score of 77%. We were given the task of improving this score while still using logistic regression.
The code and comments are recorded in the Logistic regression notebook.
## Part 2
BERT is a neural network language model architecture introduced by Google in 2018 (Devlin
et al. 2018). When training a BERT model, the network is trained not to predict the next
token in a sequence but to predict a masked token as in a cloze test.
### Project status
In progress
### Contributors
Daniel Sammon\
daniel.sammon4@mail.dcu.ie\
18364071.
seaborn==0.11.1
scikit-learn==1.0
pandas==1.2.4
numpy==1.20.1
nltk==3.6.1
matplotlib==3.5.1
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment