Commit cd7aa95d authored by Daniel Sammon's avatar Daniel Sammon 🏀
Browse files

Part 1 done needs editing

parent 09b79353
%% Cell type:markdown id:055cef73 tags:
## Libraries
%% Cell type:code id:deaad94c tags:
``` python
#!pip install sklearn
#!pip install datasets
!pip install sklearn
!pip install datasets
import pandas as pd
import numpy as np
import string
import re
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata
from textblob import TextBlob,Word
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
```
%% Output
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
%% Cell type:markdown id:16fa8a9b tags:
# Given logistic regression model
%% Cell type:markdown id:4f79f0c2 tags:
* The original model is a basline sentiment analysis system that uses logistic regression.
* The model was trained on 80% of the dataset and tested on the remainder
* The model obtained a 77% accuracy score
%% Cell type:code id:acc60530 tags:
``` python
raw_datasets = load_dataset("imdb")
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
%% Cell type:code id:cc80fe9e tags:
``` python
train_dataset = raw_datasets['train'].shuffle(seed=42).select(range(25000))
train_data = []
train_data_labels = []
for item in train_dataset:
train_data.append(item['text'])
train_data_labels.append(item['label'])
print(len(train_data))
```
%% Output
Loading cached shuffled indices for dataset at C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow
25000
%% Cell type:code id:76638426 tags:
``` python
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
print(train_data[0])
features = vectorizer.fit_transform(train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))
```
%% Output
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
25000
200
%% Cell type:code id:d5f5111c tags:
``` python
X_train, X_test, y_train, y_test = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)
```
%% Cell type:code id:810eaa1e tags:
``` python
log_model = LogisticRegression()
```
%% Cell type:code id:726eed37 tags:
``` python
log_model = log_model.fit(X=X_train,y=y_train)
```
%% Cell type:code id:e5fd085c tags:
``` python
y_pred = log_model.predict(X_test)
```
%% Cell type:code id:c8d9255b tags:
``` python
print(accuracy_score(y_test,y_pred))
```