3. Data Analysis - Pandas Dataframe ~ Python and Machine Learning Blog

Wednesday, 28 September 2022

3. Data Analysis - Pandas Dataframe

September 28, 2022 No comments

Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
X= pd.DataFrame(data1) ------- dataframe creation with all features
X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column.

Dataframe creation using dictionary with column/index values

data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
X=pd.DataFrame(data1)
X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
rowIndex=['one', 'two']
X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

Access row/columns

X.columns
X.index

Accessing Column Names

colName = X.columns

Access one particular column

X["State"]
X.State

Assign same value to all column

X['State"] = "ABC"

Assign different value to all column

X["State"] = ["AP", "HP", "KA", "TN"]

Add a column

X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.

Delete a column

del X["newColumn"]

Check if row/column is present in dataframe

"newColumn" in X.columns
"one" in X.index

Index Object are immutable

Row reindexing - Use reindex to Change order of row

X.index[0] = 10 --- Error
X1 = X.reindex([10,20,30,0,1,1,3]) --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

Index value can be repeated

rowIndex=[0,1,1,1,3]

Transpose

X.T

Change the Heading of Columns

X.columns.name = "ColumnName"
X.index.name = "IndexName"

Element access from dataframe

Pandas Series Object

X["row"]
X[0]
X[["row1","row3"]]
X[[0,2]]
X[X>50]
X["row1":"row3"]
X[0:2]

Pandas Dataframe

X["col1"] -- will extract a column
X["col1":"col2"] - will extract a column
X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

Loc/At syntax

X.loc["row1"]
X.iloc[0]
X.loc["row1", ["col1", "col3"]]
X.iloc[0, [0,2]]
X[X>5]
X.loc[:"row3]
X.iloc[:3
X.loc[:"row3", "col1":]
X.iloc[:3, 2:]
X.at["row3", "col2"] ---- single value is extracted through AT command
X.iat[3,2]

Hierarchial indexing

X.iloc[:3, :2][:, [0,1]] -- accessing is stages.

Arithematic operation

X - X1frame ---- row wise subtraction
X1.sub(X1frame) --- same as X1-X1frame
X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

Apply

X1.apply(lambda x: x.max()) -- will return max value from each column
X1.apply(lambda x:x.max(), axis="columns") -- max value for each row is calculated.

How to get min/max for each row

def minmax(x):

return pd.Series([x.min(), x.max()], index=["min", "max"])

X1.apply(minmax, axis = "columns") -- will return min/max for each row.

Applymap

X.applymap(lambda x: x*10) -- applymap works for each element of array and in this case multiply each element by 10.

Sorting

X.sort_index() -- sorting of row index
X.sort_index(axis =1 ) -- sorting of column index
X.sort_values() -- will sort based on values. -- for a series object
X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
X.sort_values(by=["col2", "col1"])

Rank

X.rank() -- column wise ranking
X.rank(ascending=true)
X.rank(axis=1) --row wise ranking.

Summarizing data

X.sum()
X.sum(axis=1)
X.describe()
X.corr()
X["col1"].corr(X["corr2"])
X.cov()
X["col1"].cov(X["corr2"])
Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)

Share:

0 comments:

Post a Comment