Commit 09b79353 authored by Daniel Sammon's avatar Daniel Sammon 🏀
Browse files

day 1

parent 734ebcf3
%% Cell type:markdown id:055cef73 tags:
## Libraries
%% Cell type:code id:deaad94c tags:
``` python
#!pip install sklearn
#!pip install datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datasets import load_dataset
```
%% Cell type:markdown id:16fa8a9b tags:
# Given logistic regression model
%% Cell type:markdown id:4f79f0c2 tags:
* The original model is a basline sentiment analysis system that uses logistic regression.
* The model was trained on 80% of the dataset and tested on the remainder
* The model obtained a 77% accuracy score
%% Cell type:code id:acc60530 tags:
``` python
raw_datasets = load_dataset("imdb")
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
%% Cell type:code id:cc80fe9e tags:
``` python
train_dataset = raw_datasets['train'].shuffle(seed=42).select(range(25000))
train_data = []
train_data_labels = []
for item in train_dataset:
train_data.append(item['text'])
train_data_labels.append(item['label'])
print(len(train_data))
```
%% Output
Loading cached shuffled indices for dataset at C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow
25000
%% Cell type:code id:76638426 tags:
``` python
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
print(train_data[0])
features = vectorizer.fit_transform(train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))
```
%% Output
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
25000
200
%% Cell type:code id:d5f5111c tags:
``` python
X_train, X_test, y_train, y_test = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)
```
%% Cell type:code id:810eaa1e tags:
``` python
log_model = LogisticRegression()
```
%% Cell type:code id:726eed37 tags:
``` python
log_model = log_model.fit(X=X_train,y=y_train)
```
%% Cell type:code id:e5fd085c tags:
``` python
y_pred = log_model.predict(X_test)
```
%% Cell type:code id:c8d9255b tags:
``` python
print(accuracy_score(y_test,y_pred))
```
%% Output
0.7708
%% Cell type:markdown id:e8e426cc tags:
# My logistic regression model
%% Cell type:markdown id:f385cabc tags:
* My main goal is to improve the accuracy of the original model
* I must implement logistic regression
* Aswell as this I must demenstrate two ideas that I believe will aid this improvement
%% Cell type:code id:9a2d5e26 tags:
``` python
df = load_dataset("imdb")
df
```
%% Output
Reusing dataset imdb (C:\Users\User\.cache\huggingface\datasets\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})
%% Cell type:code id:07bff52f tags:
``` python
df["unsupervised"]["label"][45000]
```
%% Output
-1
%% Cell type:code id:be7cf4a7 tags:
``` python
```
%% Cell type:code id:1e583230 tags:
``` python
```