Join my course at Udemy (Python Programming Bible-From beginner to advanced )

Blogger templates

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

Check Missing data

  • X.isnull()
  • X.notnull()

Delete Missing Data

  • X.drop_na() -- delete any row that has missing data
  • X.drop_na(how = all) - if all column in a row has missing data.
  • X.drop_na(axis=1) - will delete a column that has missing data
  • X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

  • X.fillna(99) - fill all missing value with same constant value.
  • X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
  • x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
  • x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
  • X.fillna(X.mean()), -- imputing with statistical values.
  • X.fillna(X.median())

Handling duplicate values

  • X.duplicated() -- returns boolean array to check if two rows are duplicated.
    • If two rows are duplicated then 2nd row will show as duplicate.
  • X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
  • X.duplicated["col1"]
  • X.drop_duplicates["col1"]
  • X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

  • mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
  • X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

  • X.replace(-999, np.NAN)
  • X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
  • X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary
Update Row and Column names
  • X.columns.map(lambda x:x.upper())  -- works with series object
  • X.index.map(lambda x:x.upper) -- works for series object
  • X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.
Changing to same object ( Inplace=True )
  • X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT  function

  • salary=[10,20,30,40,50,60]
  • salarybins=[0,20,40,60]
  • pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
  • pd.cut(salary, 8) -- 8 bins will be created.
  • pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.
Identify outliers
  • X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

  • data.sample(3) -- get 3 sample from the dataframe
  • data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

  • X1= pd.get_dummies(X["col1"], prefix = "dummy")
  • X["col1"].join(X1) -- joining to original dataframe.


Share:

4. Data Analysis - Data Ingestion

Data Ingestion

Data Ingestion

  • pd.read_csv("XYZ.csv", )
  • pd.read_table("XYZ.csv", sep=",")
  • pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.
Checkk NULL values in dataframe 
  • X.isnull()
Read a particular character as NULL value from file
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
  • pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.
Reading large files
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading
Defining max no of rows to be read from file
  • pd.options.display.max_rows = 10
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5])  - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5)  - First 5 rows will be read.

Reading a chunk of file 

  • fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
  • for temp_chunksize in fileChunk:
    • print(fileChunk)

Writing to a CSV file

  • X.to_csv("XYZ.csv") -- NAN value will be empty string
  • X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
  • X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
  • X.to_csv("XYZ.csv", na_rep="NULL"index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

  • pd.read_json("iris.json") - READ JSON file
  • X = pd.read_html("*.html")
  • X.to_pickle('file') -- stores to pickle file
  • Y - pd.read_pickle('file') -- read from pickle file.

Share:

Wednesday, 28 September 2022

3. Data Analysis - Pandas Dataframe

 Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

  • data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
  • X= pd.DataFrame(data1)  ------- dataframe creation with all features
  • X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
  • X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column. 

Dataframe creation using dictionary with column/index values

  •    data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
  • X=pd.DataFrame(data1)
  • X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

  • data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
  • rowIndex=['one', 'two']
  • X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

  • Access row/columns
    • X.columns
    • X.index
  • Accessing Column Names
    • colName = X.columns
  • Access one particular column
    • X["State"]
    • X.State
  • Assign same value to all column
    • X['State"] = "ABC"
  • Assign different value to all column
    • X["State"] = ["AP", "HP", "KA", "TN"]
  • Add a column
    • X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.
  • Delete a column
    • del X["newColumn"]
  • Check if row/column is present in dataframe
    • "newColumn" in X.columns
    • "one" in X.index

Index Object are immutable 

Row reindexing - Use reindex to Change order of row

  • X.index[0] = 10 --- Error
  • X1 = X.reindex([10,20,30,0,1,1,3])  --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

    • X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

    Index value can be repeated

    • rowIndex=[0,1,1,1,3]

    Transpose

    • X.T

    Change the Heading of Columns 

    • X.columns.name = "ColumnName"
    • X.index.name = "IndexName"

    Element access from dataframe

    Pandas Series Object

    • X["row"]
    • X[0]
    • X[["row1","row3"]]
    • X[[0,2]]
    • X[X>50]
    • X["row1":"row3"]
    • X[0:2]

    Pandas Dataframe

    • X["col1"] -- will extract a column
    • X["col1":"col2"] - will extract a column
    • X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

    Loc/At syntax

    • X.loc["row1"]
    • X.iloc[0]
    • X.loc["row1", ["col1", "col3"]]
    • X.iloc[0, [0,2]]
    • X[X>5]
    • X.loc[:"row3]
    • X.iloc[:3
    • X.loc[:"row3", "col1":]
    • X.iloc[:3, 2:]
    • X.at["row3", "col2"]   ---- single value is extracted through AT command
    • X.iat[3,2]

    Hierarchial indexing

    • X.iloc[:3, :2][:, [0,1]]  -- accessing is stages.

    Arithematic operation

    • X - X1frame ---- row wise subtraction
    • X1.sub(X1frame)  --- same as X1-X1frame
    • X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
    • X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

    Apply

    • X1.apply(lambda x: x.max()) -- will return max value from each column
    • X1.apply(lambda x:x.max(), axis="columns")  -- max value for each row is calculated.

    How to get min/max for each row

    • def minmax(x):
      • return pd.Series([x.min(), x.max()], index=["min", "max"])
    • X1.apply(minmax, axis = "columns")  -- will return min/max for each row.

    Applymap

    • X.applymap(lambda x: x*10)  -- applymap works for each element of array and in this case multiply each element by 10.

    Sorting

    • X.sort_index() -- sorting of row index
    • X.sort_index(axis =1 ) -- sorting of column index 
    • X.sort_values() -- will sort based on values. -- for a series object
    • X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
    • X.sort_values(by=["col2", "col1"])

    Rank

    • X.rank()  -- column wise ranking
    • X.rank(ascending=true)
    • X.rank(axis=1) --row wise ranking.

    Summarizing data

    • X.sum()
    • X.sum(axis=1)
    • X.describe()
    • X.corr()
    • X["col1"].corr(X["corr2"])
    • X.cov()
    • X["col1"].cov(X["corr2"])
    • Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)


    Share:

    Tuesday, 27 September 2022

    2. Data Analysis - Pandas Series

     Series Creation

    Series creation with default index
    • X= pandas.series([10,20,30,40])  -- by passing a list
    • X.index(), X.values()
    • print(X[0], X[[0,2,3]] , X[1:3] --- Access the series value
    Series creation with labeled index
    • X = pd.series([10,20,30,40], Index = ['l1', 'l2', 'l3', 'l4'])
    • X['l1'], X[['l1', 'l2']], X['l2':'l4']
    Series creation using dictionary
    • pd.Series(dict1) -- in same order as dictionary
    • pd.Series(dict1, index=['k2', 'k1'])  -- change the order other than what specified in dictionary

    Series Filter

    • X(X<0)   -- return all negative element.

    Mathematical operation

    • obj + obj2
      • if index is not matching then it will be NAN

    check NULL value

    • X.isnull() -- returns boolean array
    • X.notnull() -- returns bollean array

    change the index

    • X.index = ['k1', 'k2']
    • X.index.name = "indexName"

    Share:

    Monday, 26 September 2022

    1. Data Analysis - NumPy Operations

     Numpy Operations  

    Numpy Array Creation - 
    • X = np.array[[10,20,30,40]]
    • X = np.zeros(10)
    • X= np.ones(10)
    • X = np.empty (10)
    • X=np.arange([1,11])  ---- will create 10 element array starting from element=1 to 10
    Array Creation using data type
    • X = np.array([10,20,30,40], dtype = np.floar64)
    Changing data type
    • X = X.astype(np.int32)
    Arithmetic operation
    • X = np.array([[10,20,30,40],[50,60.70,80])
    Indexing and slicing 
    • One dimension 
      • X[1] ----- Will return X[1]
      • X[2:4] ------ will return X[2[, X[3]

    • Multiple dimension
      • X[0] ------- Will return 0th row
      • X[0][1] -- Will return 0th row and 1st column.
    Propagation and Broadcast
    • X[2:4] = 100   ---------- X[2], X[3] will be assigned 100.
    Row slicing & Column slicing using booleans
    • Y = np.array(["True", "True", "False", "False", "True"]),     
    • X[Y=="True"]      -- row slicing
    • X[:, Y="True"]     -- column slicing
    Manipulating all elements
    • X = X[X<0] = 0  --- Any element less than 0 will be assigned a value = 0
    Fancy Indexing
    • Get rows
      • X[1]
      • X[[1,3,5]]

    • Get Columns
      • X[:, 1:4]
      • X[:, [1,4,5]] 

    • Get Multiple rows and columns
      • X[[1,3,4], [2,4,5]]
    Fancy Indexing using pipe
    • X[[1,4,5]][:, [0,1]]
    Reshape
    • X.reshape(8,4)
    • X.transpose()

    Universal Function

    • One input
      • np.sqrt(X)
      • np.exp(X)
    • Multiple Input
      • np.maximum(X,Y)
      • np.add(X,Y)
    • Finding NAN value 
      • np.nan(X)  -- Ouput will be [False, False,..., True] -- in this form. -- TRUE will indicate corresponding value in NAN.
    • Any element has NAN value
      • np.nan(X).any()

    Vectorization

    • Mesh Grid
      • Xaxis, Yaxis = np.meshgrid(X,Y)
    • np.where
      • K = np.where(Z, X, Y)
      • K = np.where(X>0, Y ,Z )
        • np.where(X>99, X, 100) -> combination of scaler and numpy array

    Statistical Functions

    • np.mean(X) --- Mean of all element of the array.
    • np.std(X)
    • np.sum()
    • X.mean(axis=0) -- Mean of each column
    • X.mean(axis=1) -- Mean of each row.

    Boolean arrays

    • X.any() - IF any value is TRUE
    • X.sum() - Count of all TRUE value
    • X.all() - If all value is TRUE
    • X>0 -> returns a boolean array.

    Sorting

    • np.sort(X, axis = None) -- sort all element of array and return one dimensional array
    • np.sort(X, axis = 0) - Column wise sorting
    • np.sort(X, axis = 1 ) - Row wise sorting

    Remove duplicates

    • np.unique(X) --  removes the duplicates

    Check common values between two array

    • np.in1d(X,Y) -.. returns bollean array with TRUE indicating common value
    • np.union1d(X,Y) - returns union of X and Y

    Saving to a file

    Saving one array in one file

    • np.save("OneDimension", X)
    • np.load("OneDimension.npy")

    Save multiple array using key/value

    • np.savez("Onedimension", key1=X, key2=Y)
    • k = np.load("Onedimension")
    • k[key1], k[key2]

    Other functionality

    • np.dot(X,Y)
    • k= inv(y) ---- from numpy.linalg import inv




    Share:

    Feature Top (Full Width)

    Pageviews

    Search This Blog