Pandasã®ã¡ã¢ãªäœ¿çšéãæé©åããããã®å æ¬çãªã¬ã€ããããŒã¿åããã£ã³ã¯åŠçãã«ããŽãªå€æ°ãå€§èŠæš¡ããŒã¿ã»ããåŠçã®å¹ççãªãã¯ããã¯ãç¶²çŸ ããŸãã
Pandas ããã©ãŒãã³ã¹æé©å: ã¡ã¢ãªäœ¿çšéåæžããã¹ã¿ãŒãã
Pandasã¯ãæè»ãªããŒã¿æ§é ãšããŒã¿åæããŒã«ãæäŸãããããŒã¿åæã®ããã®åŒ·åãªPythonã©ã€ãã©ãªã§ããããããå€§èŠæš¡ãªããŒã¿ã»ãããæ±ãå Žåãã¡ã¢ãªäœ¿çšéã倧ããªããã«ããã¯ãšãªããããã©ãŒãã³ã¹ã«åœ±é¿ãäžããããããã°ã©ã ãã¯ã©ãã·ã¥ããåå ãšãªã£ããããããšããããŸãããã®å æ¬çãªã¬ã€ãã§ã¯ãPandasã®ã¡ã¢ãªäœ¿çšéãæé©åããããã®æ§ã ãªãã¯ããã¯ãæ¢æ±ããããå¹ççãã€å¹æçã«å€§èŠæš¡ãªããŒã¿ã»ãããåŠçã§ããããã«ããŸãã
Pandasã®ã¡ã¢ãªäœ¿çšéãçè§£ãã
æé©åãã¯ããã¯ã«å ¥ãåã«ãPandasãã©ã®ããã«ããŒã¿ãã¡ã¢ãªã«æ ŒçŽããããçè§£ããããšãéèŠã§ããPandasã¯ãDataFrameãšSerieså ã§ããŒã¿ãæ ŒçŽããããã«äž»ã«NumPyé åã䜿çšããŸããååã®ããŒã¿åã¯ã¡ã¢ãªãããããªã³ãã«å€§ãã圱é¿ããŸããäŸãã°ã`int64`åã®åã¯`int32`åã®åã®2åã®ã¡ã¢ãªãæ¶è²»ããŸãã
DataFrameã®ã¡ã¢ãªäœ¿çšéã¯ã.memory_usage()ã¡ãœããã䜿çšããŠç¢ºèªã§ããŸãã
import pandas as pd
data = {
'col1': [1, 2, 3, 4, 5],
'col2': ['A', 'B', 'C', 'D', 'E'],
'col3': [1.1, 2.2, 3.3, 4.4, 5.5]
}
df = pd.DataFrame(data)
memory_usage = df.memory_usage(deep=True)
print(memory_usage)
deep=TrueåŒæ°ã¯ããªããžã§ã¯ãïŒæååïŒåã®ã¡ã¢ãªäœ¿çšéãæ£ç¢ºã«èšç®ããããã«äžå¯æ¬ ã§ãã
ã¡ã¢ãªäœ¿çšéãåæžããããã®ãã¯ããã¯
1. é©åãªããŒã¿åãéžæãã
ååã«é©åãªããŒã¿åãéžæããããšã¯ãã¡ã¢ãªäœ¿çšéãåæžããäžã§æãåºæ¬çãªã¹ãããã§ããPandasã¯ããŒã¿åãèªåçã«æšè«ããŸãããå¿ èŠä»¥äžã«ã¡ã¢ãªãæ¶è²»ããåãããã©ã«ãã§å²ãåœãŠãããšããããããŸããäŸãã°ã0ãã100ãŸã§ã®æŽæ°ãå«ãåã`int64`åã«å²ãåœãŠãããããšããããŸããã`int8`ã`uint8`ã§ååãªå ŽåããããŸãã
äŸ: æ°å€åã®ããŠã³ãã£ã¹ã
pd.to_numeric()颿°ã«downcastãã©ã¡ãŒã¿ã䜿çšããããšã§ãæ°å€åãããå°ããªè¡šçŸã«ããŠã³ãã£ã¹ãã§ããŸãã
def reduce_mem_usage(df):
"""Iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
if df[col].dtype == 'object':
continue # Skip strings, handle them separately
col_type = df[col].dtype
if col_type in ['int64','int32','int16']:
c_min = df[col].min()
c_max = df[col].max()
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
else:
df[col] = df[col].astype(np.int64)
elif col_type in ['float64','float32']:
c_min = df[col].min()
c_max = df[col].max()
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
äŸ: æååãã«ããŽãªåã«å€æãã
åã«äžæã®æååå€ãéãããæ°ããå«ãŸããŠããªãå Žåããããã«ããŽãªåã«å€æãããšã¡ã¢ãªäœ¿çšéãå€§å¹ ã«åæžã§ããŸããã«ããŽãªåã¯ãäžæã®å€ãäžåºŠã ãæ ŒçŽããåå ã®åèŠçŽ ãäžæã®å€ãåç §ããæŽæ°ã³ãŒããšããŠè¡šçŸããŸãã
df['col2'] = df['col2'].astype('category')
ã°ããŒãã«ãªeã³ããŒã¹ãã©ãããã©ãŒã ã®é¡§å®¢ååŒããŒã¿ã»ãããèããŠã¿ãŸãããããCountryãåã«ã¯æ°çŸã®äžæã®åœåããå«ãŸããªããããããŸããããããŒã¿ã»ããã¯äœçŸäžãã®ååŒãå«ãã§ããŸãããCountryãåãã«ããŽãªåã«å€æãããšãã¡ã¢ãªæ¶è²»éãåçã«åæžã§ããŸãã
2. ãã£ã³ã¯åŠçãšã€ãã¬ãŒã·ã§ã³
ã¡ã¢ãªã«åãŸããªãéåžžã«å€§èŠæš¡ãªããŒã¿ã»ãããæ±ãå Žåãpd.read_csv()ãŸãã¯pd.read_excel()ã®chunksizeãã©ã¡ãŒã¿ã䜿çšããŠããŒã¿ããã£ã³ã¯ïŒå¡ïŒã§åŠçã§ããŸããããã«ãããããŒã¿ãããå°ããã管çããããéšåã«åå²ããŠããŒãããã³åŠçã§ããŸãã
for chunk in pd.read_csv('large_dataset.csv', chunksize=100000):
# Process the chunk (e.g., perform calculations, filtering, aggregation)
print(f"Processing chunk with {len(chunk)} rows")
# Optionally, append results to a file or database.
äŸ: å€§èŠæš¡ãªãã°ãã¡ã€ã«ã®åŠç
ã°ããŒãã«ãããã¯ãŒã¯ã€ã³ãã©ã¹ãã©ã¯ãã£ããã®å€§èŠæš¡ãªãã°ãã¡ã€ã«ãåŠçããããšãæ³åããŠãã ããããã°ãã¡ã€ã«ã倧ããããŠã¡ã¢ãªã«åãŸããªãå ŽåããããŸãããã£ã³ã¯åŠçã䜿çšããããšã§ããã°ãã¡ã€ã«ãå埩åŠçããç¹å®ã®ã€ãã³ãããã¿ãŒã³ã«ã€ããŠåãã£ã³ã¯ãåæããã¡ã¢ãªå¶éãè¶ ããã«çµæãéèšã§ããŸãã
3. å¿ èŠãªåã®ã¿ãéžæãã
å€ãã®å ŽåãããŒã¿ã»ããã«ã¯åæã«é¢ä¿ã®ãªãåãå«ãŸããŠããŸããå¿
èŠãªåã®ã¿ãããŒãããããšã§ãã¡ã¢ãªäœ¿çšéã倧å¹
ã«åæžã§ããŸããpd.read_csv()ã®usecolsãã©ã¡ãŒã¿ã䜿çšããŠãç®çã®åãæå®ã§ããŸãã
df = pd.read_csv('large_dataset.csv', usecols=['col1', 'col2', 'col3'])
äŸ: 販売ããŒã¿ã®åæ
売äžããŒã¿ãåæããŠå£²äžäžäœã®è£œåãç¹å®ããå Žåãã補åIDããã販売æ°éããã売äžåçãã®åã®ã¿ãå¿ èŠãªå ŽåããããŸãããããã®åã®ã¿ãããŒãããããšã§ã顧客局ãé éå äœæããã®ä»ã®ç¡é¢ä¿ãªæ å ±ãå«ãããŒã¿ã»ããå šäœãããŒãããå Žåãšæ¯èŒããŠãã¡ã¢ãªæ¶è²»éãåæžã§ããŸãã
4. ã¹ããŒã¹ããŒã¿æ§é ã䜿çšãã
DataFrameã«å€ãã®æ¬ æå€ (NaN) ããŒããå«ãŸããŠããå Žåãã¹ããŒã¹ããŒã¿æ§é ã䜿çšããŠããŒã¿ãããå¹ççã«è¡šçŸã§ããŸããã¹ããŒã¹DataFrameã¯ãæ¬ æå€ããŒã以å€ã®å€ã®ã¿ãæ ŒçŽãããããã¹ããŒã¹ããŒã¿ãæ±ãéã®ã¡ã¢ãªäœ¿çšéãå€§å¹ ã«åæžããŸãã
sparse_series = df['col1'].astype('Sparse[float]')
sparse_df = sparse_series.to_frame()
äŸ: 顧客è©äŸ¡ã®åæ
倿°ã®è£œåã«å¯Ÿãã顧客è©äŸ¡ã®ããŒã¿ã»ãããèããŠã¿ãŸããããã»ãšãã©ã®é¡§å®¢ã¯å°æ°ã®è£œåããè©äŸ¡ããªããããè©äŸ¡ã®ã¹ããŒã¹è¡åãäœæãããŸãããã®ããŒã¿ãæ ŒçŽããããã«ã¹ããŒã¹DataFrameã䜿çšãããšãå¯ãªDataFrameãšæ¯èŒããŠã¡ã¢ãªæ¶è²»éãå€§å¹ ã«åæžã§ããŸãã
5. ããŒã¿ã®ã³ããŒãé¿ãã
Pandasã®æäœã§ã¯ãDataFrameã®ã³ããŒãäœæãããããšããããã¡ã¢ãªäœ¿çšéã®å¢å ã«ã€ãªãããŸããå¯èœãªå Žåã¯DataFrameãã€ã³ãã¬ãŒã¹ã§å€æŽããããšã§ãäžèŠãªã³ããŒãé¿ããããšãã§ããŸãã
äŸãã°ã次ã®ããã«ãã代ããã«:
df = df[df['col1'] > 10]
次ã®ããã«äœ¿çšããããšãæ€èšããŠãã ãã:
df.drop(df[df['col1'] <= 10].index, inplace=True)
inplace=TrueåŒæ°ã¯ãã³ããŒãäœæããã«DataFrameãçŽæ¥å€æŽããŸãã
6. æååã¹ãã¬ãŒãžã®æé©å
æåååã¯ãç¹ã«é·ãæååãå€ãã®åºæå€ãå«ãå Žåãããªãã®ã¡ã¢ãªãæ¶è²»ããããšããããŸããåè¿°ã®ããã«æååãã«ããŽãªåã«å€æããããšã¯ã广çãªãã¯ããã¯ã®1ã€ã§ãããã1ã€ã®ã¢ãããŒãã¯ãå¯èœã§ããã°ããå°ããªæåå衚çŸã䜿çšããããšã§ãã
äŸ: æååé·ã®åæž
åã«æååãšããŠæ ŒçŽãããŠãããæŽæ°ãšããŠè¡šçŸã§ããèå¥åãå«ãŸããŠããå ŽåãããããæŽæ°ã«å€æããããšã§ã¡ã¢ãªãç¯çŽã§ããŸããäŸãã°ããPROD-1234ãã®ãããªæååãšããŠçŸåšæ ŒçŽãããŠãã補åIDãæŽæ°IDã«ãããã³ã°ã§ããŸãã
7. ã¡ã¢ãªãè¶ ããããŒã¿ã»ããã«ã¯Daskã䜿çšãã
ãã£ã³ã¯åŠçã䜿çšããŠãã¡ã¢ãªã«åãŸããªãã»ã©å€§èŠæš¡ãªããŒã¿ã»ããã®å ŽåãDaskã®äœ¿çšãæ€èšããŠãã ãããDaskã¯ãPandasããã³NumPyãšããŸãçµ±åãã䞊åã³ã³ãã¥ãŒãã£ã³ã°ã©ã€ãã©ãªã§ããããã¯ãããŒã¿ã»ãããããå°ããªãã£ã³ã¯ã«åå²ããè€æ°ã®ã³ã¢ãŸãã¯è€æ°ã®ãã·ã³ã«ããã£ãŠäžŠååŠçããããšã§ãã¡ã¢ãªãè¶ ããããŒã¿ã»ãããæäœããããšãå¯èœã«ããŸãã
import dask.dataframe as dd
ddf = dd.read_csv('large_dataset.csv')
# Perform operations on the Dask DataFrame (e.g., filtering, aggregation)
result = ddf[ddf['col1'] > 10].groupby('col2').mean().compute()
compute()ã¡ãœããã¯ãå®éã®èšç®ãããªã¬ãŒããçµæãå«ãPandas DataFrameãè¿ããŸãã
ãã¹ããã©ã¯ãã£ã¹ãšèæ ®äºé
- ã³ãŒãã®ãããã¡ã€ãªã³ã°: ãããã¡ã€ãªã³ã°ããŒã«ã䜿çšããŠã¡ã¢ãªã®ããã«ããã¯ãç¹å®ããæã圱é¿ã®å€§ããé åã«æé©åã®åªåãéäžãããŸãã
- ç°ãªããã¯ããã¯ã®ãã¹ã: æé©ãªã¡ã¢ãªåæžãã¯ããã¯ã¯ãããŒã¿ã»ããã®ç¹å®ã®ç¹æ§ã«äŸåããŸããæ§ã ãªã¢ãããŒãã詊ããŠããŠãŒã¹ã±ãŒã¹ã«æé©ãªãœãªã¥ãŒã·ã§ã³ãèŠã€ããŠãã ããã
- ã¡ã¢ãªäœ¿çšéã®ç£èŠ: ããŒã¿åŠçäžã®ã¡ã¢ãªäœ¿çšéã远跡ããæé©åã广çã§ããããšã確èªããã¡ã¢ãªäžè¶³ãšã©ãŒãé²ããŸãã
- ããŒã¿ã®çè§£: ããŒã¿ã«å¯Ÿããæ·±ãçè§£ã¯ãæãé©åãªããŒã¿åãšæé©åãã¯ããã¯ãéžæããããã«äžå¯æ¬ ã§ãã
- ãã¬ãŒããªãã®èæ ®: äžéšã®ã¡ã¢ãªæé©åãã¯ããã¯ã¯ãããããªããã©ãŒãã³ã¹ãªãŒããŒãããã䌎ãå ŽåããããŸããã¡ã¢ãªäœ¿çšéåæžã®ã¡ãªãããšãæœåšçãªããã©ãŒãã³ã¹ãžã®åœ±é¿ãæ¯èŒæ€èšããŠãã ããã
- æé©åã®ææžå: å®è£ ããã¡ã¢ãªæé©åãã¯ããã¯ãæç¢ºã«ææžåããã³ãŒããä¿å®å¯èœã§ä»è ã«ãçè§£ã§ããããã«ããŸãã
çµè«
Pandasã®ã¡ã¢ãªäœ¿çšéãæé©åããããšã¯ãå€§èŠæš¡ãªããŒã¿ã»ãããå¹ççãã€å¹æçã«æ±ãäžã§äžå¯æ¬ ã§ããPandasãã©ã®ããã«ããŒã¿ãæ ŒçŽããããçè§£ããé©åãªããŒã¿åãéžæãããã£ã³ã¯åŠçã䜿çšãããã®ä»ã®æé©åãã¯ããã¯ãæ¡çšããããšã§ãã¡ã¢ãªæ¶è²»éãå€§å¹ ã«åæžããããŒã¿åæã¯ãŒã¯ãããŒã®ããã©ãŒãã³ã¹ãåäžãããããšãã§ããŸãããã®ã¬ã€ãã§ã¯ãPandasã§ã®ã¡ã¢ãªäœ¿çšéåæžããã¹ã¿ãŒããããã®äž»èŠãªãã¯ããã¯ãšãã¹ããã©ã¯ãã£ã¹ã®å æ¬çãªæŠèŠãæäŸããŸãããã³ãŒãããããã¡ã€ãªã³ã°ããæ§ã ãªãã¯ããã¯ããã¹ãããã¡ã¢ãªäœ¿çšéãç£èŠããŠãç¹å®ã®ãŠãŒã¹ã±ãŒã¹ã§æé«ã®çµæãéæããããšãå¿ããªãã§ãã ããããããã®ååãé©çšããããšã§ãPandasã®å¯èœæ§ãæå€§éã«åŒãåºããæãèŠæ±ã®å³ããããŒã¿åæã®èª²é¡ã«ãåãçµãããšãã§ããŸãã
ãããã®ãã¯ããã¯ãç¿åŸããããšã§ãäžçäžã®ããŒã¿ãµã€ãšã³ãã£ã¹ããã¢ããªã¹ãã¯ãããå€§èŠæš¡ãªããŒã¿ã»ãããæ±ããåŠçé床ãåäžãããããŒã¿ããããæ·±ãæŽå¯ãåŸãããšãã§ããŸããããã¯ãããå¹ççãªç ç©¶ãããæ å ±ã«åºã¥ããããžãã¹äžã®æææ±ºå®ããããŠæçµçã«ã¯ããããŒã¿é§ååã®äžçã«è²¢ç®ããŸãã