Consultores estratégicos en Ciencia de Datos

 

 







Data model to understand recovery phase on epilepsy patients

Doctoral thesis by Susana Lara (cand.)

Germany – Barcelona 2021

Data Driven approach

Data Scientist: Dr. Juan Ignacio Barrios, MD MSc,

In [1]:
# Importing Python libraries needed for data analysis 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import os
import math
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
%matplotlib inline
C:\Users\Tommy\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
# reading dataset 
dataset = pd.read_excel('Análisis Estadístico Datos Doctorado_Susana Lara -JB Version 2.xlsx').iloc[:,:-1]
In [3]:
dataset.shape
Out[3]:
(189, 47)
In [4]:
# fixing columns spaces 
cols=[]
for col in dataset.columns:
    if col.startswith('Unnamed'):
        cols.append(cols[-1])
    else:
        cols.append(col.strip())
dataset.columns=cols
dataset.columns=[i+'--'+str(z).strip() for i,z in zip(dataset.columns,dataset.iloc[0,:].fillna('').values)]
dataset.columns=[col.strip('--') for col in dataset.columns]
In [5]:
# Data transforming 
dataset['Consciousness Time']=pd.to_numeric(dataset['Consciousness Time'],errors='coerce')
dataset.drop(0,inplace=True)
dataset.dropna(inplace=True)
dataset.reset_index(inplace=True,drop=True)
dataset.columns 
Out[5]:
Index(['Nr', 'Location', 'Sex', 'Age', 'Age onset', 'Years with ES',
       'Seizure Type', 'Laterality', 'Behavior before', 'Same day ES before',
       'ES before', 'Ictal Seconds', 'Ictal signs and symtoms--MA',
       'Ictal signs and symtoms--OA', 'Ictal signs and symtoms--SMA',
       'Ictal signs and symtoms--Laughing',
       'Ictal signs and symtoms--Coughing', 'Ictal signs and symtoms--NRR',
       'Ictal signs and symtoms--NRL', 'Ictal signs and symtoms--Vo',
       'Ictal signs and symtoms--Gaze', 'Ictal signs and symtoms--VA',
       'Ictal signs and symtoms--Hiccup', 'Consciousness Time',
       'Postictal signs and symptoms--MA', 'Postictal signs and symptoms--OA',
       'Postictal signs and symptoms--NRR',
       'Postictal signs and symptoms--NRL',
       'Postictal signs and symptoms--Smacking',
       'Postictal signs and symptoms--Smile',
       'Postictal signs and symptoms--Laughing',
       'Postictal signs and symptoms--Coughing',
       'Postictal signs and symptoms--Vo',
       'Postictal signs and symptoms--Gape',
       'Postictal signs and symptoms--Hipcup',
       'Postictal signs and symptoms--Motor restless',
       'Postictal signs and symptoms--Speaks incomprehensible',
       'Postictal signs and symptoms--Cloni Arm',
       'Postictal signs and symptoms--Stand up', 'Level of Consciousness',
       'Coughing Time seconds--Coughing #1',
       'Coughing Time seconds--Coughing #2',
       'Coughing Time seconds--Coughing #3',
       'Coughing Time seconds--Coughing #4', 'Disnomia seconds', 'Aphasia TT',
       'Aphasia After'],
      dtype='object')

DESCRIPTIVE ANALYTICS

In [6]:
dataset.describe().round()
dataset.shape
Out[6]:
(184, 47)
In [7]:
# Examining the dataframe 
dataset.head(25) 
Out[7]:
Nr Location Sex Age Age onset Years with ES Seizure Type Laterality Behavior before Same day ES before Postictal signs and symptoms–Cloni Arm Postictal signs and symptoms–Stand up Level of Consciousness Coughing Time seconds–Coughing #1 Coughing Time seconds–Coughing #2 Coughing Time seconds–Coughing #3 Coughing Time seconds–Coughing #4 Disnomia seconds Aphasia TT Aphasia After
0 1.0 TR 2.0 20.0 9.0 11.0 4.0 1.0 1.0 2.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
1 2.0 TR 1.0 25.0 22.0 3.0 9.0 1.0 1.0 3.0 0 0 5.0 0 0 0 0 0.0 0.0 0.0
2 3.0 TR 2.0 28.0 6.0 22.0 10.0 1.0 2.0 0.0 0 0 2.0 39 0 0 0 0.0 0.0 0.0
3 4.0 TR 2.0 42.0 1.0 41.0 11.0 1.0 1.0 2.0 0 0 5.0 0 0 0 0 0.0 0.0 0.0
4 5.0 TR 2.0 35.0 10.0 25.0 4.0 1.0 1.0 0.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
5 6.0 TR 2.0 29.0 20.0 9.0 4.0 2.0 2.0 0.0 0 0 4.0 0 0 0 0 0.0 0.0 0.0
6 7.0 TR 1.0 57.0 3.0 54.0 4.0 1.0 1.0 0.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
7 8.0 TR 1.0 24.0 4.0 20.0 2.0 1.0 1.0 2.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
8 9.0 TR 1.0 34.0 7.0 27.0 2.0 1.0 1.0 4.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
9 10.0 TR 2.0 42.0 1.0 41.0 2.0 1.0 1.0 4.0 1 0 5.0 0 0 0 0 0.0 0.0 0.0
10 11.0 TR 1.0 22.0 13.0 9.0 2.0 1.0 2.0 0.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
11 12.0 TR 2.0 20.0 10.0 10.0 2.0 1.0 2.0 1.0 0 0 2.0 2 0 0 0 0.0 0.0 0.0
12 13.0 TR 1.0 19.0 14.0 5.0 2.0 1.0 2.0 1.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
13 14.0 TR 2.0 21.0 10.0 11.0 6.0 2.0 1.0 1.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
14 15.0 TR 1.0 43.0 16.0 27.0 7.0 2.0 1.0 0.0 0 0 4.0 0 0 0 0 0.0 0.0 0.0
15 16.0 TR 2.0 20.0 10.0 10.0 3.0 1.0 1.0 0.0 0 0 10.0 0 0 0 0 0.0 0.0 0.0
16 18.0 TR 1.0 61.0 30.0 31.0 1.0 1.0 2.0 2.0 0 0 2.0 0 0 0 0 46.0 0.0 0.0
17 19.0 TR 2.0 20.0 9.0 11.0 1.0 1.0 1.0 5.0 0 0 5.0 0 0 0 0 0.0 60.0 0.0
18 20.0 TR 1.0 24.0 23.0 1.0 1.0 1.0 2.0 6.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
19 21.0 TR 2.0 30.0 16.0 14.0 1.0 1.0 1.0 1.0 0 0 4.0 0 0 0 0 0.0 0.0 0.0
20 24.0 TR 2.0 18.0 16.0 2.0 1.0 1.0 1.0 1.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
21 25.0 TR 2.0 53.0 51.0 2.0 1.0 1.0 1.0 1.0 0 0 2.0 7 0 0 0 0.0 0.0 0.0
22 26.0 TR 1.0 24.0 22.0 2.0 1.0 1.0 1.0 1.0 0.125 0.166667 2.0 0 0 0 0 0.0 0.0 0.0
23 27.0 TR 2.0 53.0 14.0 39.0 1.0 1.0 2.0 0.0 0 0 2.0 0 0 0 0 0.0 0.0 0.0
24 28.0 TR 1.0 52.0 1.0 51.0 1.0 3.0 1.0 1.0 0 0 4.0 0 0 0 0 0.0 0.0 0.0

25 rows × 47 columns

Distribution analysis CT with respect to sex and age

In [8]:
# scatterplot graph , features transforming 
df1=dataset.loc[:,'Sex':'Ictal Seconds']
for col in df1.columns[1:-1]:
    df1[col]=df1[col].astype(int).astype(str)
df1['Age']=df1['Age'].astype(int)
df1['Consciousness Time']=pd.to_numeric(dataset['Consciousness Time'])
sns.set(rc={'figure.figsize':(20,20)})
g = sns.FacetGrid(df1, col="Sex", height=8.27, aspect=11.7/8.27)
g.map(sns.scatterplot, 'Consciousness Time', 'Age', alpha=.7)
g.add_legend()
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x2298fe13d08>

Distribution analysis CT with respect to laterality

In [9]:
g = sns.FacetGrid(df1, col="Laterality", height=4, aspect=.5)
g.map(sns.barplot, "Sex", "Consciousness Time")
C:\Users\Tommy\Anaconda3\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x2299036e308>

Distribution analysis CT with respect to Behavior before

In [10]:
g = sns.FacetGrid(df1, col="Behavior before", height=4, aspect=.5)
g.map(sns.barplot, "Sex", "Consciousness Time")
C:\Users\Tommy\Anaconda3\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
Out[10]:
<seaborn.axisgrid.FacetGrid at 0x2299047c988>

Distribution analysis CT with respect to sex

In [11]:
##age wise distribution plot
g = sns.FacetGrid(df1, row="Sex",
                  height=1.7, aspect=4,)
g.map(sns.kdeplot, "Consciousness Time")
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x2299052f7c8>

Correlation comparison for main features

In [37]:
df=dataset.loc[:,'Sex':'Consciousness Time']
In [59]:
corr_df=df.corr()
fig = px.imshow(corr_df)
fig.update_layout(title='Correlation comparion for main features',width=1000,height=1000
 )
fig.show()
 

Machine Learning section - Supervised training -

Predicting Conscious time using Random Forest Regressor

In [15]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets

train_features, test_features, train_labels, test_labels = train_test_split(df.iloc[:,:-1], df.iloc[:,-1], test_size = 0.25, random_state = 42)
In [16]:
# data shape for test and training subsets
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (138, 21)
Training Labels Shape: (138,)
Testing Features Shape: (46, 21)
Testing Labels Shape: (46,)
In [17]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);
In [18]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))
Mean Absolute Error: 63.15
In [19]:
# Index sort the most important features
sorted_feature_weight_idxes = np.argsort(rf.feature_importances_)[::-1] # Reverse sort

# Get the most important features names and weights
most_important_features = np.take_along_axis(
    np.array(df.columns.tolist()), 
    sorted_feature_weight_idxes, axis=0)
most_important_weights = np.take_along_axis(
    np.array(rf.feature_importances_), 
    sorted_feature_weight_idxes, axis=0)
In [20]:
##feature importances for prediction
list(zip(most_important_features, most_important_weights))
Out[20]:
[('Ictal Seconds', 0.17901448698785813),
 ('Years with ES', 0.17194274218960087),
 ('Seizure Type', 0.12924139683710725),
 ('Age onset', 0.11933391836499692),
 ('Age', 0.11707535473460505),
 ('ES before', 0.09652508599570678),
 ('Ictal signs and symtoms--MA', 0.04263594173515929),
 ('Same day ES before', 0.03886861258907664),
 ('Laterality', 0.03470149205846669),
 ('Ictal signs and symtoms--OA', 0.026127441615677868),
 ('Sex', 0.01968745500065026),
 ('Behavior before', 0.017170088958457962),
 ('Ictal signs and symtoms--Laughing', 0.0023046791365437785),
 ('Ictal signs and symtoms--SMA', 0.00197076115074575),
 ('Ictal signs and symtoms--NRR', 0.0013268295804454912),
 ('Ictal signs and symtoms--Vo', 0.0007526875771582651),
 ('Ictal signs and symtoms--Coughing', 0.000645700670430751),
 ('Ictal signs and symtoms--Gaze', 0.0004542110344298992),
 ('Ictal signs and symtoms--Hiccup', 0.00019804421993007234),
 ('Ictal signs and symtoms--NRL', 2.3069562952340392e-05),
 ('Ictal signs and symtoms--VA', 0.0)]

Tree Map -Hierarchical main Features array (location, sex, age )

In [62]:
import plotly.express as px

##creating a tree map with location, sex , age .This shows the value count or number of patients in each group of location, age and sex
treedata=dataset.copy()
for col in ['Sex', 'Age', 'Age onset']:
    treedata[col]=treedata[col].astype(int).astype(str)
treedata=treedata.groupby(['Location', 'Sex', 'Age'])['Nr'].count().reset_index().rename(columns={'Nr':'count'})
fig = px.treemap(treedata, path=['Location', 'Sex', 'Age'], values='count')
fig.update_layout(autosize=False,width=800,height=700)
fig.show()
fig.write_html(r'treemap.html')

Non supervised algorhitms -K means clustering method

In [22]:
# Kmeans preparation
from sklearn.preprocessing import LabelEncoder
df=dataset.loc[:,'Location':'Ictal Seconds']
data2=df
encoder=LabelEncoder()
data2['Location']=encoder.fit_transform(data2['Location'])
data2
Out[22]:
Location Sex Age Age onset Years with ES Seizure Type Laterality Behavior before Same day ES before ES before Ictal Seconds
0 3 2.0 20.0 9.0 11.0 4.0 1.0 1.0 2.0 6.0 91.0
1 3 1.0 25.0 22.0 3.0 9.0 1.0 1.0 3.0 3.0 98.0
2 3 2.0 28.0 6.0 22.0 10.0 1.0 2.0 0.0 0.0 90.0
3 3 2.0 42.0 1.0 41.0 11.0 1.0 1.0 2.0 2.0 197.0
4 3 2.0 35.0 10.0 25.0 4.0 1.0 1.0 0.0 0.0 122.0
... ... ... ... ... ... ... ... ... ... ... ...
179 0 1.0 13.0 5.0 8.0 57.0 1.0 2.0 7.0 8.0 28.0
180 0 2.0 26.0 2.0 24.0 65.0 2.0 2.0 2.0 3.0 122.0
181 0 1.0 33.0 29.0 4.0 74.0 1.0 1.0 3.0 3.0 214.0
182 0 2.0 56.0 7.0 49.0 83.0 1.0 1.0 7.0 23.0 66.0
183 0 1.0 21.0 4.0 17.0 82.0 1.0 1.0 4.0 4.0 54.0

184 rows × 11 columns

In [63]:
# finding the best K value to the appropiste number of clusters 
inertias = [] 
K = range(1,10) 
## finding k with other attributes
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(data2) 
    kmeanModel.fit(data2)     
    inertias.append(kmeanModel.inertia_) 
    
plt.plot(K, inertias, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Inertia') 
plt.title('The Elbow Method using Inertia') 
Out[63]:
Text(0.5, 1.0, 'The Elbow Method using Inertia')
In [ ]:
# the elbow point is coming at 4

# Fitting K-Means to the dataset
from sklearn.cluster import KMeans

## the number of clusters is set of 4
kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(data2)
y_kmeans1=y_kmeans+1
cluster = pd.DataFrame(y_kmeans1)
# Adding cluster to the Dataset
dataset['cluster'] = cluster
data2['cluster'] = cluster
dataset
In [ ]:
##printing patients in each cluster
for i in range(1,5):
    print('****ID of patients in cluster {}*****'.format(str(i)))
    print(list(dataset[dataset.cluster==i]['Nr'].values))

Feature importances from k means data

In [ ]:
print(data2['cluster'].value_counts())
In [ ]:
# Train a classifier
from sklearn.ensemble import RandomForestClassifier
def forestmodel(df):
    clf = RandomForestClassifier(random_state=1)
    clf.fit(df.drop(columns=["Binary Cluster 0","cluster"]).values, df["Binary Cluster 0"].values)

    # Index sort the most important features
    sorted_feature_weight_idxes = np.argsort(clf.feature_importances_)[::-1] # Reverse sort

    # Get the most important features names and weights
    most_important_features = np.take_along_axis(
        np.array(df.columns.tolist()), 
        sorted_feature_weight_idxes, axis=0)
    most_important_weights = np.take_along_axis(
        np.array(clf.feature_importances_), 
        sorted_feature_weight_idxes, axis=0)

    # Show
    return list(zip(most_important_features, most_important_weights))
In [ ]:
for i in range(1,5):
    data2['Binary Cluster 0'] = np.where(data2['cluster']==i,1,0)
    print(f'## the feature importances for cluster {i}\n')
    feat_imp=forestmodel(data2)
    print(feat_imp)

hierarchial clustering

Agglomerative hierarchical clustering differs from k-means in a key way. Rather than choosing a number of clusters and starting out with random centroids, we instead begin with every point in our dataset as a “cluster.” Then we find the two closest points and combine them into a cluster. Then, we find the next closest points, and those become a cluster. We repeat the process until we only have one big giant cluster.

In [24]:
# create dendrogram
plt.figure(figsize=(15,8))

dendrogram = sch.dendrogram(sch.linkage(data2, method='ward'))

Now we know the number of clusters for our dataset, the next step is to group the data points into these four clusters. To do so we will use the AgglomerativeClustering class of the sklearn.cluster library. Take a look at the following script

In [25]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster.fit_predict(data2)
Out[25]:
array([0, 0, 0, 1, 0, 0, 0, 3, 3, 1, 3, 0, 3, 0, 1, 1, 3, 0, 3, 3, 3, 0,
       3, 3, 0, 0, 0, 3, 3, 3, 0, 3, 0, 3, 0, 3, 3, 3, 0, 0, 3, 1, 3, 3,
       2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
       3, 3, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 3, 0, 1, 3, 0, 0, 3,
       3, 0, 0, 0, 0, 1, 0, 2, 3, 1, 2, 0, 1, 3, 0, 3, 3, 0, 3, 3, 3, 3,
       0, 0, 0, 3, 0, 0, 0, 3, 0, 3, 3, 0, 0, 0, 3, 0, 3, 0, 3, 0, 3, 3,
       3, 0, 0, 3, 0, 3, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 0, 3, 0, 3, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2], dtype=int64)
In [26]:
dataset['cluster']=cluster.labels_+1
In [27]:
##printing patients in each cluster got by heirarchial clustering
for i in range(1,5):
    print('****ID of patients in cluster {}*****'.format(str(i)))
    print(list(dataset[dataset.cluster==i]['Nr'].values))
****ID of patients in cluster 1*****
[1.0, 2.0, 3.0, 5.0, 6.0, 7.0, 12.0, 14.0, 19.0, 25.0, 28.0, 29.0, 30.0, 34.0, 36.0, 38.0, 42.0, 43.0, 51.0, 52.0, 53.0, 55.0, 72.0, 73.0, 86.0, 89.0, 90.0, 93.0, 94.0, 95.0, 96.0, 98.0, 103.0, 106.0, 109.0, 114.0, 115.0, 116.0, 118.0, 120.0, 121.0, 123.0, 126.0, 127.0, 128.0, 130.0, 132.0, 134.0, 138.0, 139.0, 141.0, 172.0, 174.0, 176.0]
****ID of patients in cluster 2*****
[4.0, 10.0, 15.0, 16.0, 45.0, 84.0, 87.0, 97.0, 101.0, 104.0, 144.0, 146.0, 186.0]
****ID of patients in cluster 3*****
[48.0, 49.0, 50.0, 54.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 99.0, 102.0, 143.0, 145.0, 147.0, 148.0, 149.0, 150.0, 151.0, 152.0, 153.0, 154.0, 155.0, 156.0, 157.0, 158.0, 159.0, 160.0, 161.0, 162.0, 163.0, 164.0, 165.0, 166.0, 167.0, 168.0, 169.0, 177.0, 178.0, 179.0, 180.0, 181.0, 182.0, 183.0, 184.0, 185.0, 187.0, 188.0]
****ID of patients in cluster 4*****
[8.0, 9.0, 11.0, 13.0, 18.0, 20.0, 21.0, 24.0, 26.0, 27.0, 31.0, 32.0, 33.0, 35.0, 37.0, 39.0, 40.0, 41.0, 44.0, 46.0, 47.0, 67.0, 68.0, 69.0, 70.0, 71.0, 85.0, 88.0, 91.0, 92.0, 100.0, 105.0, 107.0, 108.0, 110.0, 111.0, 112.0, 113.0, 117.0, 122.0, 124.0, 125.0, 129.0, 131.0, 133.0, 135.0, 136.0, 137.0, 140.0, 142.0, 170.0, 171.0, 173.0, 175.0]

Using Kbest to select the most representative feature in the dataset

In [28]:
from sklearn.feature_selection import SelectKBest, chi2, f_regression
##from the dataset we take best 10 features that can correctly find consciousness time

df=dataset
encoder=LabelEncoder()
df['Location']=encoder.fit_transform(df['Location'])

X=df.drop(columns=['Nr','Consciousness Time'])
y=df['Consciousness Time']

# Create the object for SelectKBest and fit and transform the regression data
X_reg_new=SelectKBest(score_func=f_regression, k=10).fit_transform(X,y)
In [29]:
##the best 10 features to predict consciouous time 
X_reg_new=pd.DataFrame(X_reg_new)
for col in X_reg_new.columns:
    for col1 in dataset.columns:
        if all(dataset[col1]==X_reg_new[col]):
            print(col1)
Years with ES
Same day ES before
ES before
Postictal signs and symptoms--OA
Postictal signs and symptoms--NRR
Postictal signs and symptoms--Smacking
Postictal signs and symptoms--Hipcup
Level of Consciousness
Disnomia seconds
Aphasia After

Calculating incidence of main features into groups

In [30]:
#first check high incidence for popular attributes like age, sex etc

df1=dataset[['cluster','Location', 'Sex', 'Age', 'Age onset', 'Years with ES',
       'Seizure Type', ]]
df1['group_incidence'] = df1.groupby(['cluster','Location', 'Sex', 'Age', 'Age onset', 'Years with ES'])['cluster'].transform('size') / len(df)    
C:\Users\Tommy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [31]:
#we can see the most common combination for the occurance, sorted
##this means that most commonly repeated age, sex and other groups of patients are shown
df1.sort_values('group_incidence',ascending=False).head(50)
Out[31]:
cluster Location Sex Age Age onset Years with ES Seizure Type group_incidence
105 1 2 2.0 68.0 9.0 59.0 1.0 0.016304
163 3 2 1.0 47.0 35.0 12.0 82.0 0.016304
146 3 2 1.0 47.0 35.0 12.0 40.0 0.016304
157 3 2 1.0 47.0 35.0 12.0 74.0 0.016304
179 3 0 1.0 13.0 5.0 8.0 57.0 0.016304
85 1 2 2.0 68.0 9.0 59.0 2.0 0.016304
125 1 2 2.0 68.0 9.0 59.0 1.0 0.016304
177 3 0 1.0 13.0 5.0 8.0 49.0 0.016304
176 3 0 1.0 13.0 5.0 8.0 48.0 0.016304
128 4 2 1.0 51.0 1.0 50.0 23.0 0.010870
129 1 2 1.0 45.0 33.0 12.0 14.0 0.010870
130 4 2 2.0 37.0 9.0 28.0 14.0 0.010870
131 4 2 1.0 46.0 6.0 40.0 18.0 0.010870
61 3 3 2.0 53.0 51.0 2.0 82.0 0.010870
132 4 2 2.0 37.0 9.0 28.0 19.0 0.010870
137 4 2 2.0 43.0 6.0 37.0 26.0 0.010870
138 3 2 1.0 54.0 1.0 53.0 30.0 0.010870
55 3 3 2.0 53.0 51.0 2.0 80.0 0.010870
104 4 2 2.0 33.0 4.0 29.0 1.0 0.010870
145 3 2 1.0 54.0 1.0 53.0 39.0 0.010870
126 4 2 2.0 67.0 35.0 32.0 1.0 0.010870
75 3 1 2.0 48.0 5.0 43.0 55.0 0.010870
124 4 2 2.0 37.0 26.0 11.0 1.0 0.010870
114 1 2 2.0 48.0 16.0 32.0 1.0 0.010870
101 4 2 2.0 43.0 6.0 37.0 1.0 0.010870
100 2 2 2.0 45.0 12.0 33.0 1.0 0.010870
111 1 2 2.0 77.0 4.0 73.0 1.0 0.010870
113 4 2 1.0 46.0 6.0 40.0 1.0 0.010870
91 1 2 2.0 77.0 4.0 73.0 4.0 0.010870
90 1 2 1.0 45.0 33.0 12.0 4.0 0.010870
88 4 2 2.0 67.0 35.0 32.0 2.0 0.010870
74 3 1 2.0 48.0 5.0 43.0 54.0 0.010870
87 4 2 2.0 37.0 26.0 11.0 2.0 0.010870
86 1 2 2.0 48.0 16.0 32.0 2.0 0.010870
84 4 2 2.0 33.0 4.0 29.0 4.0 0.010870
83 2 2 2.0 45.0 12.0 33.0 2.0 0.010870
117 4 2 1.0 51.0 1.0 50.0 1.0 0.010870
39 1 3 1.0 57.0 3.0 54.0 1.0 0.010870
42 4 3 2.0 30.0 16.0 14.0 16.0 0.010870
0 1 3 2.0 20.0 9.0 11.0 4.0 0.010870
27 4 3 1.0 22.0 13.0 9.0 1.0 0.010870
17 1 3 2.0 20.0 9.0 11.0 1.0 0.010870
3 2 3 2.0 42.0 1.0 41.0 11.0 0.010870
19 4 3 2.0 30.0 16.0 14.0 1.0 0.010870
10 4 3 1.0 22.0 13.0 9.0 2.0 0.010870
9 2 3 2.0 42.0 1.0 41.0 2.0 0.010870
6 1 3 1.0 57.0 3.0 54.0 4.0 0.010870
171 1 0 1.0 19.0 18.0 1.0 38.0 0.005435
115 1 2 2.0 31.0 1.0 30.0 1.0 0.005435
116 1 2 1.0 50.0 49.0 1.0 1.0 0.005435
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

 

 

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

HTML Snippets Powered By : XYZScripts.com