Join my course at Udemy (Python Programming Bible-From beginner to advanced )

Blogger templates

Thursday, 29 September 2022

4. Data Analysis - Data Ingestion

Data Ingestion

Data Ingestion

  • pd.read_csv("XYZ.csv", )
  • pd.read_table("XYZ.csv", sep=",")
  • pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.
Checkk NULL values in dataframe 
  • X.isnull()
Read a particular character as NULL value from file
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
  • pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.
Reading large files
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading
Defining max no of rows to be read from file
  • pd.options.display.max_rows = 10
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5])  - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5)  - First 5 rows will be read.

Reading a chunk of file 

  • fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
  • for temp_chunksize in fileChunk:
    • print(fileChunk)

Writing to a CSV file

  • X.to_csv("XYZ.csv") -- NAN value will be empty string
  • X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
  • X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
  • X.to_csv("XYZ.csv", na_rep="NULL"index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

  • pd.read_json("iris.json") - READ JSON file
  • X = pd.read_html("*.html")
  • X.to_pickle('file') -- stores to pickle file
  • Y - pd.read_pickle('file') -- read from pickle file.

Share:

0 comments:

Post a Comment

Feature Top (Full Width)

Pageviews

Search This Blog