Join my course at Udemy (Python Programming Bible-From beginner to advanced )

Blogger templates

Friday 7 October 2022

8. Data Analysis - Visualization with Matplotlib

 Basic line plot

%matplotlib notebook -- to create visualization in jupyter notebook
import matplotlib.pyplot as plt
import numpy as np

X = np.arange(10)
plt.plot(X)

Create figure and Plot in two lines

1. Create Figure

fig = plt.figure()

2. Create Plot and add plot to figure

ax1 = fig.add_subplot(2,2,1)  --- 2 rows, 2 column and we are selecting 1st plot.

Without creating subplot ( plt.plot)

fig = plt.figure()
plt.plot(np.random.rand(50), 'k-') -- It will go to rightmost bottom.

Creating figure and axis in the same line.

fig, axes = plt.subplots(2,3)
axes[0,1].hist()

Creating same X and Y axis for all subplots.

fig, axes = plt.subplots(2,3, sharex=True, sharey = True)

Remove space between subplots.

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0)

Adding color and linestyle.

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r')
fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, "r--") -- short form.

Add Marker

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r', marker = "o")

Connecting 2 dots

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r', marker = "o", drawstyle="steps-post") -- steps-pre/steps-mid/steps

Add label

fig, axes = plt.subplots(2,3, sharex=True, sharey = True, wspace =0, hspace=0, linestyle ="--" , color ='r', marker = "o", drawstyle="steps-post", label="line")


X and Y axis label

fig = plt.figure()
ax1= fig.add_subplot(1,1,1)
ticks = ax1.set_xticks([0,10,20,30])
labels = ax1.set_xticklabels(["zero,", "ten","twnety","thirty"])
ax1.plot()

Change orientation and size of the X labels

fig = plt.figure()
ax1= fig.add_subplot(1,1,1)
ticks = ax1.set_xticks([0,10,20,30])
labels = ax1.set_xticklabels(["zero,", "ten","twnety","thirty"], rotation = 90, fontsize="large")
ax1.plot()

X-axis label, Title

ax1.set_xlabel("Xlabels")
ax1.set_title("Title)

Plots

veritical barplot

ax1.bar(["Car", "Truck", "Bus", "Auto"], [10,20,30,40]) -- Categorical + numeric data

horizontal bar plot

ax1.barh(["Car", "Truck", "Bus", "Auto"], [10,20,30,40]) -- Categorical + numeric data

histogram

ax1.hist(X, bin = 50)

pie chart

ax1.pie([10,20,30], labels=["car", "bus", "truck"])

scatter plot

ax1.scatter(x,y, marker="^", color="g")

Box/Violon plot

ax1.boxplot(X)
ax1.violinplot(X)


Share:

Thursday 6 October 2022

1. Python Machine learning - Regularized Linear Model

Linear Regression

from sklearn.linear_model import LinearRegression     
lin_reg = LinearRegression()
lin_reg.fit(X, y)
print(lin_reg.intercept_, lin_reg.coef_)
lin_reg.predict(X_new)

Polynomial Regression

Transform to polynomial feature

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree =2, include_bias=False)
X_poly = poly_features.fit_transform(X) --- X_poly will have two terms - degree =1 and degree 2

Now use Linear Regression

lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
print(lin_reg.intercept_, lin_reg.coef_)

Training and test error

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, x, y):
      X_train, X_val, Y_train, y_val = train_test_split(x, y, test_size=0.2)
      train_error, val_errors = [], []
      for m in range(1, len(X_train)):
             model.fit(X_train[:m], y_train[:m])
             y_train_predict = model.predict(X_train[:m])  -- no of sampling of training is changing from 1 to len(X_train)
             y_val_predict = model.predict(X_val) -- always taken for all validation samples.
              train_errors.append(mean_squared_error(y_train[:m], y_val_predict[:m]))
              val_errors.append(mean_squared_error(y_val, y_val_predict))

plt.plot(np.sqrt(train_errors, "r-+", linewidth=2, label="train"))
plt.plot(np.sqrt(val_errors, "b-+", linewidth=2, label="validation"))           

Using Pipeline

from   sklearn.pipeline import Pipeline

polynomial_regression = Pipeline ([
                                           ("poly features", PolynomialFeatures(degree=10, include_bias=False)),
                                           ("lin reg", LinearRegression())
                                            ])
plot_learning_curve(polynomial_regression, X, y)


Gradient Regression

Batch Gradient Descent

Stochastic Gradient Descent

Mini Batch Gradient Descent

Ridge Regression

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X,y)
ridge_reg.predict([1.5])

Lasso Regression ( Lest Absolute Shrinkage and Selection Operator Regression )

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha = 0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([1.5])

Elastic net

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X,y)
elastic_net.predict([1.5])






Share:

Tuesday 4 October 2022

7. Data Analysis - Data Aggregation & Grouping

 Split, Apply, Combine

Mean

  • X["Values1"].groupby([X["Keys1"]]).mean() -- Mean of "Values1" based on Keys = "Key1"
  • X["Values1"].groupby([X["Keys1"],X["Keys2"]]).mean()
  • X.groupBy["Keys1"].mean()
  • X.groupBy(X["Keys1"], X["Keys2"]).mean()

Count

  • X.groupBy(X["Keys1"], X["Keys2"]).size()

GroupBy Clause with FOR Loop

  • for name, group in X.groupby([["Key1"]])
    • print (name)
    • print(group)

Accessing each group in dictionary key

  • dict(list(X.groupby([["Key1"]])))

Extracting based on column

  • X.groupby(X.dtypes, axis =1) -- column will be seperated based on data types.

Column Grouping

  • Create a mapping from column to grouping
    • mapColumns={'a':'Group1', 'b':'Group1', 'c':'Group2', 'd':'Group2', 'e':'Group3', 'f':'Group3'}
  • X.groupby(mapColumns, axis=1).mean()

Passing Lambda function in groupby

  •  X.groupby(lambda x:x<'c', axis =1).mean() -- Two group will be created with TRUE/FALSE column.

Aggregate function

  • def max_min(x)
    • return x.max() - x.min()
  • X.agg([max, min, max_min]) -- For each column min, max, max_min function will be called.

Aggregate function with groupby

  • X1 = X.groupby(["key1"]) -- Will create dataframe for each groups in "key1"
  • X1.agg(min, max, max_min)) -- Will apply aggregate function for each of that column.

Aggregate function with custom defined column name

  • X1.agg(("Maximum", max),("Minimum", min),("Maximum_Minimum",max_min))

GroupBy with Apply

  • X.groupby().apply(fName)
  • X.groupby(["key1"]).apply(lambda x: x.min())  -- Passing lambda function, User defined function can also be applied.
GroupBy with Apply with passing arguments to function
  • def minimumFb(X, x):
    • return(X.max()-X.min() > x)
  • X.groupby(["key1"]).apply(minimumFb(10))

Apply function with bucket analysis

  • quartiles = X.cut(X.Values1, 2)
  • X.groupby(quartiles).apply(lambda x:x.min())

Pivot Table

  • X.pivot_table(values="AvgSizeOfTrip" , index = "Gender" , columns="Group" )

Pivot table with aggregation

  • X.pivot_table (values="AvgSizeOfTrip" , index = "Gender" , columns="Group", aggfunc=np.sum)
  • X.pivot_table (values="AvgSizeOfTrip" , index = "Gender" , columns="Group", aggfunc={"Col1":np.sum, 'Col2':np.mean}) - different column with different operation.

Pivot table with count

  • pd.crosstab(X.gender, X.group)  -- This just counts the frequency.




Share:

6. Data Analysis - Data Wrangling

 Hierarchical Indexing in Pandas Series

  • Create multi level indexing in pandas series
    • pd.Series(np.random.rand(9), index=[['a','b','c','d','a','b','c','d','a'],[1,2,4,4,5,2,7,8,8]],)
  • Access the values
    • X['a']  -- using exact index
    • X[:, 1]  -- using slice
    • X['a':'b', 1] -- using slice
  • Access Index
    • X.index  -- return tuple
  • Changing multi level indexing to row/col 
    • X.unstack()   -- to row/col representation.
    • X.unstack(level = -1) -- first level index will be ROW and second level index will be column
    • X.unstack(level=0) first level index will be Column and second level index will be row
  • Changing dataframe to single pandas series
    • X.stack() -- to multi level index, row will become 1st level index and column will become second level index.
Hierarchical indexing in Pandas Dataframe
  • Create multi level indexing for both row and columns
    • X=pd.Series(np.random.rand(9, 6), index=[['a','b','c','d','a','b','c','d','a'],[1,2,4,4,5,2,7,8,8]],
    •                                                 columns=[["col1","col1","col2","col2","col3","col3"],['c1','c2', 'c3', 'c4', 'c5', 'c6']])
  • Access name of rows and columns
    • X.index.names
    • X.columns.name
  • Access row/column values
    • X.loc[["row1"],[["col1", "col2"]]]

Swap multilevel index

  • X1.swap_level(['Row", "Row2"])  -- Row1 and Row2 will interchange.

Sort index

  • X1.sort_index(level=0)
  • X1.sort_index(level=1)

Aggregation of data

  • X1.sum(level = 'Row1')  -- Summation for Row1
  • X1.sum(level - 'Row2')
  • X1.sum(level = "ColName1", axis =1) -- Summation for Colname1
  • X1.sum(level = "ColName2", axis =1)

Changing column to index

  • X.set_index(['b', 'a']) 
  • X.set_index(['b', 'a'], drop=False)

Change index to column

  • X.reset_index()

Merging data sources

If have common column - INNER JOIN

  • pd.merge(df1, df2 ) -- should have common column and other column are arranges as seperate column. (OR)
  • pd.merge(df1, df2, how="inner")
If does not have common column
  • pd.merge(df1, df2 , left_on="lkey") 
  • pd.merge(df1, df2 ,  right_on="lkey1") 

OUTER JOIN

  • pd.merge(df1, df2, how="outer")  -- INNER JOIN + uncomon element.

LEFT/RIGHT Join

  • pd.merge(df1, df2, how="left")  -- INNER JOIN + uncommon element of left dataframe
  • pd.merge(df1, df2, how="right")  -- INNER JOIN + uncommon element of right dataframe.

It is possible to have more than one column as key.

  • pd.merge(df1, df2 ,  right_on=["lkey1", "lkey2"]) -- 

Concatenate

  • pd.concatenate([X, Y], axis=1) -- merged side by side
  • pd.concatenate([X,Y], axis =0) -- stacked top to bottom.
  • pd.concatenate([X,Y], ignore_index=True) - is stacked top to bottom ignoring matching of row index and pandas creates own index like 0,1,2 etc
  • pd.concatenate([X,Y], ignore_index=False) - existing index names of original dataframe is retained.

Combining two dataframe to fill missing values

  • X1.combine_first(X2) -- Value of X2 will be used to fill the NAN value in dataframe = X1.

Pivot and Melt function

  • X.pivot(index="c1", columns="c2", value = "col3") -- 'c1' becomes ROW, 'c2' becomes column and 'col3' becomes value across 'c1' and 'c2'
  • X.pivot(index="c1", columns="c2", value = ["col3","col4"])  -- 'col3' and 'col4' values are stacked side by side in the o/p

Melt function

  • pd.melt(X, ["col1"]) -- Three column comes -- "col1", variable(name of all other columns.), values (value of all other columns.)
Share:

Thursday 29 September 2022

5. Data Analysis - Data Preparation

Check Missing data

  • X.isnull()
  • X.notnull()

Delete Missing Data

  • X.drop_na() -- delete any row that has missing data
  • X.drop_na(how = all) - if all column in a row has missing data.
  • X.drop_na(axis=1) - will delete a column that has missing data
  • X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

  • X.fillna(99) - fill all missing value with same constant value.
  • X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
  • x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
  • x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
  • X.fillna(X.mean()), -- imputing with statistical values.
  • X.fillna(X.median())

Handling duplicate values

  • X.duplicated() -- returns boolean array to check if two rows are duplicated.
    • If two rows are duplicated then 2nd row will show as duplicate.
  • X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
  • X.duplicated["col1"]
  • X.drop_duplicates["col1"]
  • X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

  • mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
  • X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

  • X.replace(-999, np.NAN)
  • X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
  • X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary
Update Row and Column names
  • X.columns.map(lambda x:x.upper())  -- works with series object
  • X.index.map(lambda x:x.upper) -- works for series object
  • X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.
Changing to same object ( Inplace=True )
  • X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT  function

  • salary=[10,20,30,40,50,60]
  • salarybins=[0,20,40,60]
  • pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
  • pd.cut(salary, 8) -- 8 bins will be created.
  • pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.
Identify outliers
  • X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

  • data.sample(3) -- get 3 sample from the dataframe
  • data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

  • X1= pd.get_dummies(X["col1"], prefix = "dummy")
  • X["col1"].join(X1) -- joining to original dataframe.


Share:

4. Data Analysis - Data Ingestion

Data Ingestion

Data Ingestion

  • pd.read_csv("XYZ.csv", )
  • pd.read_table("XYZ.csv", sep=",")
  • pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.
Checkk NULL values in dataframe 
  • X.isnull()
Read a particular character as NULL value from file
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
  • pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.
Reading large files
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading
Defining max no of rows to be read from file
  • pd.options.display.max_rows = 10
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5])  - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5)  - First 5 rows will be read.

Reading a chunk of file 

  • fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
  • for temp_chunksize in fileChunk:
    • print(fileChunk)

Writing to a CSV file

  • X.to_csv("XYZ.csv") -- NAN value will be empty string
  • X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
  • X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
  • X.to_csv("XYZ.csv", na_rep="NULL"index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

  • pd.read_json("iris.json") - READ JSON file
  • X = pd.read_html("*.html")
  • X.to_pickle('file') -- stores to pickle file
  • Y - pd.read_pickle('file') -- read from pickle file.

Share:

Wednesday 28 September 2022

3. Data Analysis - Pandas Dataframe

 Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

  • data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
  • X= pd.DataFrame(data1)  ------- dataframe creation with all features
  • X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
  • X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column. 

Dataframe creation using dictionary with column/index values

  •    data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
  • X=pd.DataFrame(data1)
  • X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

  • data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
  • rowIndex=['one', 'two']
  • X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

  • Access row/columns
    • X.columns
    • X.index
  • Accessing Column Names
    • colName = X.columns
  • Access one particular column
    • X["State"]
    • X.State
  • Assign same value to all column
    • X['State"] = "ABC"
  • Assign different value to all column
    • X["State"] = ["AP", "HP", "KA", "TN"]
  • Add a column
    • X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.
  • Delete a column
    • del X["newColumn"]
  • Check if row/column is present in dataframe
    • "newColumn" in X.columns
    • "one" in X.index

Index Object are immutable 

Row reindexing - Use reindex to Change order of row

  • X.index[0] = 10 --- Error
  • X1 = X.reindex([10,20,30,0,1,1,3])  --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

    • X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

    Index value can be repeated

    • rowIndex=[0,1,1,1,3]

    Transpose

    • X.T

    Change the Heading of Columns 

    • X.columns.name = "ColumnName"
    • X.index.name = "IndexName"

    Element access from dataframe

    Pandas Series Object

    • X["row"]
    • X[0]
    • X[["row1","row3"]]
    • X[[0,2]]
    • X[X>50]
    • X["row1":"row3"]
    • X[0:2]

    Pandas Dataframe

    • X["col1"] -- will extract a column
    • X["col1":"col2"] - will extract a column
    • X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

    Loc/At syntax

    • X.loc["row1"]
    • X.iloc[0]
    • X.loc["row1", ["col1", "col3"]]
    • X.iloc[0, [0,2]]
    • X[X>5]
    • X.loc[:"row3]
    • X.iloc[:3
    • X.loc[:"row3", "col1":]
    • X.iloc[:3, 2:]
    • X.at["row3", "col2"]   ---- single value is extracted through AT command
    • X.iat[3,2]

    Hierarchial indexing

    • X.iloc[:3, :2][:, [0,1]]  -- accessing is stages.

    Arithematic operation

    • X - X1frame ---- row wise subtraction
    • X1.sub(X1frame)  --- same as X1-X1frame
    • X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
    • X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

    Apply

    • X1.apply(lambda x: x.max()) -- will return max value from each column
    • X1.apply(lambda x:x.max(), axis="columns")  -- max value for each row is calculated.

    How to get min/max for each row

    • def minmax(x):
      • return pd.Series([x.min(), x.max()], index=["min", "max"])
    • X1.apply(minmax, axis = "columns")  -- will return min/max for each row.

    Applymap

    • X.applymap(lambda x: x*10)  -- applymap works for each element of array and in this case multiply each element by 10.

    Sorting

    • X.sort_index() -- sorting of row index
    • X.sort_index(axis =1 ) -- sorting of column index 
    • X.sort_values() -- will sort based on values. -- for a series object
    • X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
    • X.sort_values(by=["col2", "col1"])

    Rank

    • X.rank()  -- column wise ranking
    • X.rank(ascending=true)
    • X.rank(axis=1) --row wise ranking.

    Summarizing data

    • X.sum()
    • X.sum(axis=1)
    • X.describe()
    • X.corr()
    • X["col1"].corr(X["corr2"])
    • X.cov()
    • X["col1"].cov(X["corr2"])
    • Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)


    Share:

    Feature Top (Full Width)

    Pageviews

    Search This Blog