2022 ~ Python and Machine Learning Blog

Friday, 7 October 2022

8. Data Analysis - Visualization with Matplotlib

October 07, 2022 No comments

Basic line plot

%matplotlib notebook -- to create visualization in jupyter notebook

import matplotlib.pyplot as plt

import numpy as np

X = np.arange(10)

plt.plot(X)

Create figure and Plot in two lines

1. Create Figure

fig = plt.figure()

2. Create Plot and add plot to figure

ax1 = fig.add_subplot(2,2,1) --- 2 rows, 2 column and we are selecting 1st plot.

Without creating subplot ( plt.plot)

fig = plt.figure()

plt.plot(np.random.rand(50), 'k-') -- It will go to rightmost bottom.

Creating figure and axis in the same line.

fig, axes = plt.subplots(2,3)

axes[0,1].hist()

Creating same X and Y axis for all subplots.

fig, axes = plt.subplots(2,3, sharex=True, sharey = True)

Remove space between subplots.

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0)

Adding color and linestyle.

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r')

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, "r--") -- short form.

Add Marker

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r', marker = "o")

Connecting 2 dots

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r', marker = "o", drawstyle="steps-post") -- steps-pre/steps-mid/steps

Add label

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r', marker = "o", drawstyle="steps-post", label="line")

X and Y axis label

fig = plt.figure()

ax1= fig.add_subplot(1,1,1)

ticks = ax1.set_xticks([0,10,20,30])

labels = ax1.set_xticklabels(["zero,", "ten","twnety","thirty"])

ax1.plot()

Change orientation and size of the X labels

fig = plt.figure()

ax1= fig.add_subplot(1,1,1)

ticks = ax1.set_xticks([0,10,20,30])

labels = ax1.set_xticklabels(["zero,", "ten","twnety","thirty"], rotation = 90, fontsize="large")

ax1.plot()

X-axis label, Title

ax1.set_xlabel("Xlabels")

ax1.set_title("Title)

Plots

veritical barplot

ax1.bar(["Car", "Truck", "Bus", "Auto"], [10,20,30,40]) -- Categorical + numeric data

horizontal bar plot

ax1.barh(["Car", "Truck", "Bus", "Auto"], [10,20,30,40]) -- Categorical + numeric data

histogram

ax1.hist(X, bin = 50)

pie chart

ax1.pie([10,20,30], labels=["car", "bus", "truck"])

scatter plot

ax1.scatter(x,y, marker="^", color="g")

Box/Violon plot

ax1.boxplot(X)

ax1.violinplot(X)

1. Python Machine learning - Regularized Linear Model

October 06, 2022 No comments

Linear Regression

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

lin_reg.fit(X, y)

print(lin_reg.intercept_, lin_reg.coef_)

lin_reg.predict(X_new)

Polynomial Regression

Transform to polynomial feature

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree =2, include_bias=False)

X_poly = poly_features.fit_transform(X) --- X_poly will have two terms - degree =1 and degree 2

Now use Linear Regression

lin_reg = LinearRegression()

lin_reg.fit(X_poly, y)

print(lin_reg.intercept_, lin_reg.coef_)

Training and test error

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

def plot_learning_curves(model, x, y):

X_train, X_val, Y_train, y_val = train_test_split(x, y, test_size=0.2)

train_error, val_errors = [], []

for m in range(1, len(X_train)):

model.fit(X_train[:m], y_train[:m])

y_train_predict = model.predict(X_train[:m]) -- no of sampling of training is changing from 1 to len(X_train)

y_val_predict = model.predict(X_val) -- always taken for all validation samples.

train_errors.append(mean_squared_error(y_train[:m], y_val_predict[:m]))

val_errors.append(mean_squared_error(y_val, y_val_predict))

plt.plot(np.sqrt(train_errors, "r-+", linewidth=2, label="train"))

plt.plot(np.sqrt(val_errors, "b-+", linewidth=2, label="validation"))

Using Pipeline

from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline ([

("poly features", PolynomialFeatures(degree=10, include_bias=False)),

("lin reg", LinearRegression())

])

plot_learning_curve(polynomial_regression, X, y)

Gradient Regression

Batch Gradient Descent

Stochastic Gradient Descent

Mini Batch Gradient Descent

Ridge Regression

from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=1, solver="cholesky")

ridge_reg.fit(X,y)

ridge_reg.predict([1.5])

Lasso Regression ( Lest Absolute Shrinkage and Selection Operator Regression )

from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha = 0.1)

lasso_reg.fit(X, y)

lasso_reg.predict([1.5])

Elastic net

from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)

elastic_net.fit(X,y)

elastic_net.predict([1.5])

7. Data Analysis - Data Aggregation & Grouping

October 04, 2022 No comments

Split, Apply, Combine

Mean

X["Values1"].groupby([X["Keys1"]]).mean() -- Mean of "Values1" based on Keys = "Key1"
X["Values1"].groupby([X["Keys1"],X["Keys2"]]).mean()
X.groupBy["Keys1"].mean()
X.groupBy(X["Keys1"], X["Keys2"]).mean()

Count

X.groupBy(X["Keys1"], X["Keys2"]).size()

GroupBy Clause with FOR Loop

for name, group in X.groupby([["Key1"]])

print (name)
print(group)

Accessing each group in dictionary key

dict(list(X.groupby([["Key1"]])))

Extracting based on column

X.groupby(X.dtypes, axis =1) -- column will be seperated based on data types.

Column Grouping

Create a mapping from column to grouping

mapColumns={'a':'Group1', 'b':'Group1', 'c':'Group2', 'd':'Group2', 'e':'Group3', 'f':'Group3'}

X.groupby(mapColumns, axis=1).mean()

Passing Lambda function in groupby

X.groupby(lambda x:x<'c', axis =1).mean() -- Two group will be created with TRUE/FALSE column.

Aggregate function

def max_min(x)

return x.max() - x.min()

X.agg([max, min, max_min]) -- For each column min, max, max_min function will be called.

Aggregate function with groupby

X1 = X.groupby(["key1"]) -- Will create dataframe for each groups in "key1"
X1.agg(min, max, max_min)) -- Will apply aggregate function for each of that column.

Aggregate function with custom defined column name

X1.agg(("Maximum", max),("Minimum", min),("Maximum_Minimum",max_min))

GroupBy with Apply

X.groupby().apply(fName)
X.groupby(["key1"]).apply(lambda x: x.min()) -- Passing lambda function, User defined function can also be applied.

GroupBy with Apply with passing arguments to function

def minimumFb(X, x):

return(X.max()-X.min() > x)

X.groupby(["key1"]).apply(minimumFb(10))

Apply function with bucket analysis

quartiles = X.cut(X.Values1, 2)
X.groupby(quartiles).apply(lambda x:x.min())

Pivot Table

X.pivot_table(values="AvgSizeOfTrip" , index = "Gender" , columns="Group" )

Pivot table with aggregation

X.pivot_table (values="AvgSizeOfTrip" , index = "Gender" , columns="Group", aggfunc=np.sum)
X.pivot_table (values="AvgSizeOfTrip" , index = "Gender" , columns="Group", aggfunc={"Col1":np.sum, 'Col2':np.mean}) - different column with different operation.

Pivot table with count

pd.crosstab(X.gender, X.group) -- This just counts the frequency.

6. Data Analysis - Data Wrangling

October 04, 2022 No comments

Hierarchical Indexing in Pandas Series

Create multi level indexing in pandas series

pd.Series(np.random.rand(9), index=[['a','b','c','d','a','b','c','d','a'],[1,2,4,4,5,2,7,8,8]],)

Access the values

X['a'] -- using exact index
X[:, 1] -- using slice
X['a':'b', 1] -- using slice

Access Index

X.index -- return tuple

Changing multi level indexing to row/col

X.unstack() -- to row/col representation.
X.unstack(level = -1) -- first level index will be ROW and second level index will be column
X.unstack(level=0) - first level index will be Column and second level index will be row

Changing dataframe to single pandas series

X.stack() -- to multi level index, row will become 1st level index and column will become second level index.

Hierarchical indexing in Pandas Dataframe

Create multi level indexing for both row and columns

X=pd.Series(np.random.rand(9, 6), index=[['a','b','c','d','a','b','c','d','a'],[1,2,4,4,5,2,7,8,8]],
columns=[["col1","col1","col2","col2","col3","col3"],['c1','c2', 'c3', 'c4', 'c5', 'c6']])

Access name of rows and columns

X.index.names
X.columns.name

Access row/column values

X.loc[["row1"],[["col1", "col2"]]]

Swap multilevel index

X1.swap_level(['Row", "Row2"]) -- Row1 and Row2 will interchange.

Sort index

X1.sort_index(level=0)
X1.sort_index(level=1)

Aggregation of data

X1.sum(level = 'Row1') -- Summation for Row1
X1.sum(level - 'Row2')
X1.sum(level = "ColName1", axis =1) -- Summation for Colname1
X1.sum(level = "ColName2", axis =1)

Changing column to index

X.set_index(['b', 'a'])
X.set_index(['b', 'a'], drop=False)

Change index to column

X.reset_index()

Merging data sources

If have common column - INNER JOIN

pd.merge(df1, df2 ) -- should have common column and other column are arranges as seperate column. (OR)
pd.merge(df1, df2, how="inner")

If does not have common column

pd.merge(df1, df2 , left_on="lkey")
pd.merge(df1, df2 , right_on="lkey1")

OUTER JOIN

pd.merge(df1, df2, how="outer") -- INNER JOIN + uncomon element.

LEFT/RIGHT Join

pd.merge(df1, df2, how="left") -- INNER JOIN + uncommon element of left dataframe
pd.merge(df1, df2, how="right") -- INNER JOIN + uncommon element of right dataframe.

It is possible to have more than one column as key.

pd.merge(df1, df2 , right_on=["lkey1", "lkey2"]) --

Concatenate

pd.concatenate([X, Y], axis=1) -- merged side by side
pd.concatenate([X,Y], axis =0) -- stacked top to bottom.
pd.concatenate([X,Y], ignore_index=True) - is stacked top to bottom ignoring matching of row index and pandas creates own index like 0,1,2 etc
pd.concatenate([X,Y], ignore_index=False) - existing index names of original dataframe is retained.

Combining two dataframe to fill missing values

X1.combine_first(X2) -- Value of X2 will be used to fill the NAN value in dataframe = X1.

Pivot and Melt function

X.pivot(index="c1", columns="c2", value = "col3") -- 'c1' becomes ROW, 'c2' becomes column and 'col3' becomes value across 'c1' and 'c2'
X.pivot(index="c1", columns="c2", value = ["col3","col4"]) -- 'col3' and 'col4' values are stacked side by side in the o/p

Melt function

pd.melt(X, ["col1"]) -- Three column comes -- "col1", variable(name of all other columns.), values (value of all other columns.)

5. Data Analysis - Data Preparation

September 29, 2022 No comments

Check Missing data

X.isnull()
X.notnull()

Delete Missing Data

X.drop_na() -- delete any row that has missing data
X.drop_na(how = all) - if all column in a row has missing data.
X.drop_na(axis=1) - will delete a column that has missing data
X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

X.fillna(99) - fill all missing value with same constant value.
X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
X.fillna(X.mean()), -- imputing with statistical values.
X.fillna(X.median())

Handling duplicate values

X.duplicated() -- returns boolean array to check if two rows are duplicated.

If two rows are duplicated then 2nd row will show as duplicate.

X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
X.duplicated["col1"]
X.drop_duplicates["col1"]
X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

X.replace(-999, np.NAN)
X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary

Update Row and Column names

X.columns.map(lambda x:x.upper()) -- works with series object
X.index.map(lambda x:x.upper) -- works for series object
X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.

Changing to same object ( Inplace=True )

X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT function

salary=[10,20,30,40,50,60]
salarybins=[0,20,40,60]
pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
pd.cut(salary, 8) -- 8 bins will be created.
pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.

Identify outliers

X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

data.sample(3) -- get 3 sample from the dataframe
data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

X1= pd.get_dummies(X["col1"], prefix = "dummy")
X["col1"].join(X1) -- joining to original dataframe.

4. Data Analysis - Data Ingestion

September 29, 2022 No comments

Data Ingestion

pd.read_csv("XYZ.csv", )
pd.read_table("XYZ.csv", sep=",")
pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.

Checkk NULL values in dataframe

X.isnull()

Read a particular character as NULL value from file

pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.

Reading large files

pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading

Defining max no of rows to be read from file

pd.options.display.max_rows = 10
pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5) - First 5 rows will be read.

Reading a chunk of file

fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
for temp_chunksize in fileChunk:

print(fileChunk)

Writing to a CSV file

X.to_csv("XYZ.csv") -- NAN value will be empty string
X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
X.to_csv("XYZ.csv", na_rep="NULL", index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

pd.read_json("iris.json") - READ JSON file
X = pd.read_html("*.html")
X.to_pickle('file') -- stores to pickle file
Y - pd.read_pickle('file') -- read from pickle file.

3. Data Analysis - Pandas Dataframe

September 28, 2022 No comments

Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
X= pd.DataFrame(data1) ------- dataframe creation with all features
X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column.

Dataframe creation using dictionary with column/index values

data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
X=pd.DataFrame(data1)
X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
rowIndex=['one', 'two']
X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

Access row/columns

X.columns
X.index

Accessing Column Names

colName = X.columns

Access one particular column

X["State"]
X.State

Assign same value to all column

X['State"] = "ABC"

Assign different value to all column

X["State"] = ["AP", "HP", "KA", "TN"]

Add a column

X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.

Delete a column

del X["newColumn"]

Check if row/column is present in dataframe

"newColumn" in X.columns
"one" in X.index

Index Object are immutable

Row reindexing - Use reindex to Change order of row

X.index[0] = 10 --- Error
X1 = X.reindex([10,20,30,0,1,1,3]) --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

Index value can be repeated

rowIndex=[0,1,1,1,3]

Transpose

Change the Heading of Columns

X.columns.name = "ColumnName"
X.index.name = "IndexName"

Element access from dataframe

Pandas Series Object

X["row"]
X[0]
X[["row1","row3"]]
X[[0,2]]
X[X>50]
X["row1":"row3"]
X[0:2]

Pandas Dataframe

X["col1"] -- will extract a column
X["col1":"col2"] - will extract a column
X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

Loc/At syntax

X.loc["row1"]
X.iloc[0]
X.loc["row1", ["col1", "col3"]]
X.iloc[0, [0,2]]
X[X>5]
X.loc[:"row3]
X.iloc[:3
X.loc[:"row3", "col1":]
X.iloc[:3, 2:]
X.at["row3", "col2"] ---- single value is extracted through AT command
X.iat[3,2]

Hierarchial indexing

X.iloc[:3, :2][:, [0,1]] -- accessing is stages.

Arithematic operation

X - X1frame ---- row wise subtraction
X1.sub(X1frame) --- same as X1-X1frame
X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

Apply

X1.apply(lambda x: x.max()) -- will return max value from each column
X1.apply(lambda x:x.max(), axis="columns") -- max value for each row is calculated.

How to get min/max for each row

def minmax(x):

return pd.Series([x.min(), x.max()], index=["min", "max"])

X1.apply(minmax, axis = "columns") -- will return min/max for each row.

Applymap

X.applymap(lambda x: x*10) -- applymap works for each element of array and in this case multiply each element by 10.

Sorting

X.sort_index() -- sorting of row index
X.sort_index(axis =1 ) -- sorting of column index
X.sort_values() -- will sort based on values. -- for a series object
X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
X.sort_values(by=["col2", "col1"])

Rank

X.rank() -- column wise ranking
X.rank(ascending=true)
X.rank(axis=1) --row wise ranking.

Summarizing data

X.sum()
X.sum(axis=1)
X.describe()
X.corr()
X["col1"].corr(X["corr2"])
X.cov()
X["col1"].cov(X["corr2"])
Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)

2. Data Analysis - Pandas Series

September 27, 2022 No comments

Series Creation

Series creation with default index

X= pandas.series([10,20,30,40]) -- by passing a list
X.index(), X.values()
print(X[0], X[[0,2,3]] , X[1:3] --- Access the series value

Series creation with labeled index

X = pd.series([10,20,30,40], Index = ['l1', 'l2', 'l3', 'l4'])
X['l1'], X[['l1', 'l2']], X['l2':'l4']

Series creation using dictionary

pd.Series(dict1) -- in same order as dictionary
pd.Series(dict1, index=['k2', 'k1']) -- change the order other than what specified in dictionary

Series Filter

X(X<0) -- return all negative element.

Mathematical operation

obj + obj2

if index is not matching then it will be NAN

check NULL value

X.isnull() -- returns boolean array
X.notnull() -- returns bollean array

change the index

X.index = ['k1', 'k2']
X.index.name = "indexName"

1. Data Analysis - NumPy Operations

September 26, 2022Numpy No comments

Numpy Operations

Numpy Array Creation -

X = np.array[[10,20,30,40]]
X = np.zeros(10)
X= np.ones(10)
X = np.empty (10)
X=np.arange([1,11]) ---- will create 10 element array starting from element=1 to 10

Array Creation using data type

X = np.array([10,20,30,40], dtype = np.floar64)

Changing data type

X = X.astype(np.int32)

Arithmetic operation

X = np.array([[10,20,30,40],[50,60.70,80])

Indexing and slicing

One dimension

X[1] ----- Will return X[1]
X[2:4] ------ will return X[2[, X[3]

Multiple dimension

X[0] ------- Will return 0th row
X[0][1] -- Will return 0th row and 1st column.

Propagation and Broadcast

X[2:4] = 100 ---------- X[2], X[3] will be assigned 100.

Row slicing & Column slicing using booleans

Y = np.array(["True", "True", "False", "False", "True"]),
X[Y=="True"] -- row slicing
X[:, Y="True"] -- column slicing

Manipulating all elements

X = X[X<0] = 0 --- Any element less than 0 will be assigned a value = 0

Fancy Indexing

Get rows

X[1]
X[[1,3,5]]

Get Columns

X[:, 1:4]
X[:, [1,4,5]]

Get Multiple rows and columns

X[[1,3,4], [2,4,5]]

Fancy Indexing using pipe

X[[1,4,5]][:, [0,1]]

Reshape

X.reshape(8,4)
X.transpose()

Universal Function

One input

np.sqrt(X)
np.exp(X)

Multiple Input

np.maximum(X,Y)
np.add(X,Y)

Finding NAN value

np.nan(X) -- Ouput will be [False, False,..., True] -- in this form. -- TRUE will indicate corresponding value in NAN.

Any element has NAN value

np.nan(X).any()

Vectorization

Mesh Grid

Xaxis, Yaxis = np.meshgrid(X,Y)

np.where

K = np.where(Z, X, Y)
K = np.where(X>0, Y ,Z )

np.where(X>99, X, 100) -> combination of scaler and numpy array

Statistical Functions

np.mean(X) --- Mean of all element of the array.
np.std(X)
np.sum()
X.mean(axis=0) -- Mean of each column
X.mean(axis=1) -- Mean of each row.

Boolean arrays

X.any() - IF any value is TRUE
X.sum() - Count of all TRUE value
X.all() - If all value is TRUE
X>0 -> returns a boolean array.

Sorting

np.sort(X, axis = None) -- sort all element of array and return one dimensional array
np.sort(X, axis = 0) - Column wise sorting
np.sort(X, axis = 1 ) - Row wise sorting

Remove duplicates

np.unique(X) -- removes the duplicates

Check common values between two array

np.in1d(X,Y) -.. returns bollean array with TRUE indicating common value
np.union1d(X,Y) - returns union of X and Y

Saving to a file

Saving one array in one file

np.save("OneDimension", X)
np.load("OneDimension.npy")

Save multiple array using key/value

np.savez("Onedimension", key1=X, key2=Y)
k = np.load("Onedimension")
k[key1], k[key2]

Other functionality

np.dot(X,Y)
k= inv(y) ---- from numpy.linalg import inv

Blogger templates

Friday, 7 October 2022

Basic line plot

Create figure and Plot in two lines

1. Create Figure

2. Create Plot and add plot to figure

Without creating subplot ( plt.plot)

Creating figure and axis in the same line.

Creating same X and Y axis for all subplots.

Remove space between subplots.

Adding color and linestyle.

Add Marker

Connecting 2 dots

Add label

X and Y axis label

Change orientation and size of the X labels

X-axis label, Title

Plots

veritical barplot

horizontal bar plot

histogram

pie chart

scatter plot

Box/Violon plot

Thursday, 6 October 2022

Linear Regression

Polynomial Regression

Transform to polynomial feature

Now use Linear Regression

Training and test error

Using Pipeline

from sklearn.pipeline import Pipeline

Gradient Regression

Stochastic Gradient Descent

Mini Batch Gradient Descent

Ridge Regression

Lasso Regression ( Lest Absolute Shrinkage and Selection Operator Regression )

Elastic net

Tuesday, 4 October 2022

Split, Apply, Combine

Mean

Count

GroupBy Clause with FOR Loop

Accessing each group in dictionary key

Extracting based on column

Column Grouping

Passing Lambda function in groupby

Aggregate function

Aggregate function with groupby

Aggregate function with custom defined column name

GroupBy with Apply

Apply function with bucket analysis

Pivot Table

Pivot table with aggregation

Pivot table with count

Hierarchical Indexing in Pandas Series

Swap multilevel index

Sort index

Aggregation of data

Changing column to index

Change index to column

Merging data sources

If have common column - INNER JOIN

OUTER JOIN

LEFT/RIGHT Join

It is possible to have more than one column as key.

Concatenate

Combining two dataframe to fill missing values

Pivot and Melt function

Melt function

Thursday, 29 September 2022

Check Missing data

Delete Missing Data

Impute missing value

Handling duplicate values

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

Replace

Binning - CUT and QCUT function

Taking Samples