Home »
» 3. Data Analysis - Pandas Dataframe
Pandas Dataframe creation
Dataframe creation using dictionary ( with only column values)
- data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
- X= pd.DataFrame(data1) ------- dataframe creation with all features
- X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
- X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column.
Dataframe creation using dictionary with column/index values
- data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
- X=pd.DataFrame(data1)
- X=pd.DataFrame(data1, columns=["State", "Year"])
By passing row index separately
- data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
- rowIndex=['one', 'two']
- X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)
Index and Column Update
- Access row/columns
- Accessing Column Names
- Access one particular column
- Assign same value to all column
- Assign different value to all column
- X["State"] = ["AP", "HP", "KA", "TN"]
- Add a column
- X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.
- Delete a column
- Check if row/column is present in dataframe
- "newColumn" in X.columns
- "one" in X.index
Index Object are immutable
Row reindexing - Use reindex to Change order of row
- X.index[0] = 10 --- Error
- X1 = X.reindex([10,20,30,0,1,1,3]) --- Row reindexing . Note that new dataframe object is created.
Column reindexing - Use reindex to Change order of columns
- X1 = X.reindex(columns=["Country", "State"]) -- column reindexing
Index value can be repeated
Transpose
Change the Heading of Columns
- X.columns.name = "ColumnName"
- X.index.name = "IndexName"
Element access from dataframe
Pandas Series Object
- X["row"]
- X[0]
- X[["row1","row3"]]
- X[[0,2]]
- X[X>50]
- X["row1":"row3"]
- X[0:2]
Pandas Dataframe
- X["col1"] -- will extract a column
- X["col1":"col2"] - will extract a column
- X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???
Loc/At syntax
- X.loc["row1"]
- X.iloc[0]
- X.loc["row1", ["col1", "col3"]]
- X.iloc[0, [0,2]]
- X[X>5]
- X.loc[:"row3]
- X.iloc[:3
- X.loc[:"row3", "col1":]
- X.iloc[:3, 2:]
- X.at["row3", "col2"] ---- single value is extracted through AT command
- X.iat[3,2]
Hierarchial indexing
- X.iloc[:3, :2][:, [0,1]] -- accessing is stages.
Arithematic operation
- X - X1frame ---- row wise subtraction
- X1.sub(X1frame) --- same as X1-X1frame
- X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
- X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.
Apply
- X1.apply(lambda x: x.max()) -- will return max value from each column
- X1.apply(lambda x:x.max(), axis="columns") -- max value for each row is calculated.
How to get min/max for each row
- def minmax(x):
- return pd.Series([x.min(), x.max()], index=["min", "max"])
- X1.apply(minmax, axis = "columns") -- will return min/max for each row.
Applymap
- X.applymap(lambda x: x*10) -- applymap works for each element of array and in this case multiply each element by 10.
Sorting
- X.sort_index() -- sorting of row index
- X.sort_index(axis =1 ) -- sorting of column index
- X.sort_values() -- will sort based on values. -- for a series object
- X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
- X.sort_values(by=["col2", "col1"])
Rank
- X.rank() -- column wise ranking
- X.rank(ascending=true)
- X.rank(axis=1) --row wise ranking.
Summarizing data
- X.sum()
- X.sum(axis=1)
- X.describe()
- X.corr()
- X["col1"].corr(X["corr2"])
- X.cov()
- X["col1"].cov(X["corr2"])
- Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)
0 comments:
Post a Comment