次å åæžã®ããã®Scikit-learnã®ãã£ãŒãã£ãŒéžæãã¯ããã¯ã®å æ¬çãªã¬ã€ããã°ããŒãã«ãªããŒã¿ãµã€ãšã³ã¹ã®å®è·µè ããããå¹ççã§å ç¢ãªã¢ãã«ãæ§ç¯ã§ããããã«ããŸãã
Scikit-learnã®ãã£ãŒãã£ãŒéžæïŒã°ããŒãã«ããŒã¿ã»ããã®ããã®æ¬¡å åæžããã¹ã¿ãŒãã
æ¡å€§ãç¶ããããŒã¿ã®äžçã§ã¯ããã£ãŒãã£ãŒã®éãèšå€§ã§ãããããæãæŽç·Žãããæ©æ¢°åŠç¿ã¢ãã«ã§ããå§åãããå¯èœæ§ããããŸãããã®çŸè±¡ã¯ããã°ãã°ã次å ã®åªãããšåŒã°ããèšç®ã³ã¹ãã®å¢å ãã¢ãã«ç²ŸåºŠã®äœäžãããã³è§£éèœåã®äœäžã«ã€ãªããå¯èœæ§ããããŸãã幞ããªããšã«ããã£ãŒãã£ãŒéžæã𿬡å åæžã®ãã¯ããã¯ã¯ã匷åãªãœãªã¥ãŒã·ã§ã³ãæäŸããŸããPythonã®æ©æ¢°åŠç¿ãšã³ã·ã¹ãã ã®èŠã§ããScikit-learnã¯ããããã®èª²é¡ã«å¹æçã«åãçµãããã®è±å¯ãªããŒã«ã¹ã€ãŒããæäŸããäžçäžã®ããŒã¿ãµã€ãšã³ãã£ã¹ãã«ãšã£ãŠäžå¯æ¬ ãªãªãœãŒã¹ãšãªã£ãŠããŸãã
ãã®å æ¬çãªã¬ã€ãã§ã¯ã次å åæžã«çŠç¹ãåœãŠãŠãScikit-learnã®ãã£ãŒãã£ãŒéžææ©èœã®è€éããæãäžããŸããããŸããŸãªæ¹æ³è«ããã®åºæ¬çãªååãã³ãŒãäŸãçšããå®è·µçãªå®è£ ãããã³å€æ§ãªã°ããŒãã«ããŒã¿ã»ããã«é¢ããèæ ®äºé ãæ¢ããŸããç§ãã¡ã®ç®æšã¯ãææ¬²çãªããŒã¿ã®å®è·µè ãšçç·ŽããããŒã¿ã®å®è·µè ãããªãã°ããŒãã«ãªèªè ã«ããã£ãŒãã£ãŒéžæã«é¢ããæ å ±ã«åºã¥ããæææ±ºå®ãè¡ãããã®ç¥èã身ã«ä»ããããããå¹ççã§æ£ç¢ºãã€è§£éå¯èœãªæ©æ¢°åŠç¿ã¢ãã«ã«ã€ãªããããã«ããããšã§ãã
次å åæžã®çè§£
Scikit-learnã®ç¹å®ã®ããŒã«ã«é£ã³èŸŒãåã«ã次å åæžã®åºæ¬çãªæŠå¿µãçè§£ããããšãéèŠã§ãããã®ããã»ã¹ã§ã¯ãéèŠãªæ å ±ãå¯èœãªéãä¿æããªããã髿¬¡å 空éããäœæ¬¡å 空éã«ããŒã¿ã倿ããŸããå©ç¹ã¯å€å²ã«ããããŸãã
- éåŠç¿ã®è»œæžïŒãã£ãŒãã£ãŒãå°ãªãã»ã©ã¢ãã«ãåçŽã«ãªãããã¬ãŒãã³ã°ããŒã¿å ã®ãã€ãºãåŠç¿ãã«ãããªããŸãã
- ãã¬ãŒãã³ã°æéã®ççž®ïŒãã£ãŒãã£ãŒãå°ãªãã¢ãã«ã¯ããã¬ãŒãã³ã°ãå€§å¹ ã«é«éåãããŸãã
- ã¢ãã«ã®è§£éæ§ã®åäžïŒå°ãªããã£ãŒãã£ãŒéã®é¢ä¿ãçè§£ãããããªããŸãã
- ã¹ãã¬ãŒãžã¹ããŒã¹ã®åæžïŒæ¬¡å ãäœãã»ã©ãå¿ èŠãªã¡ã¢ãªãå°ãªããªããŸãã
- ãã€ãºã®åæžïŒç¡é¢ä¿ãŸãã¯åé·ãªãã£ãŒãã£ãŒãæé€ããŠãããã¯ãªãŒã³ãªããŒã¿ã«ããããšãã§ããŸãã
次å åæžã¯ã倧ãã2ã€ã®äž»èŠãªã¢ãããŒãã«åé¡ã§ããŸãã
1. ãã£ãŒãã£ãŒéžæ
ãã®ã¢ãããŒãã§ã¯ãç®ã®åã®åé¡ã«æãé¢é£æ§ã®é«ãå ã®ãã£ãŒãã£ãŒã®ãµãã»ãããéžæããŸããå ã®ãã£ãŒãã£ãŒã¯ä¿æãããŸããããã®æ°ã¯åæžãããŸããã¬ã·ãã«æã圱é¿åã®ããææãç¹å®ããæ®ããç Žæ£ãããããªãã®ã§ãã
2. ãã£ãŒãã£ãŒæœåº
ãã®ã¢ãããŒãã§ã¯ãå ã®ãã£ãŒãã£ãŒãæ°ãããããå°ããªãã£ãŒãã£ãŒã»ããã«å€æããŸãããããã®æ°ãããã£ãŒãã£ãŒã¯ãå ã®ãã£ãŒãã£ãŒã®çµã¿åãããŸãã¯å°åœ±ã§ãããããŒã¿ã®æãéèŠãªåæ£ãŸãã¯æ å ±ããã£ããã£ããããšãç®çãšããŠããŸããããã¯ãå ã®ææã®èžçãšãã»ã³ã¹ãäœæããã®ã«äŒŒãŠããŸãã
Scikit-learnã¯ããããäž¡æ¹ã®ã¢ãããŒãã«åŒ·åãªããŒã«ãæäŸããŠããŸãããã£ãŒãã£ãŒéžæãŸãã¯æœåºãéããŠæ¬¡å åæžã«è²¢ç®ãããã¯ããã¯ã«çŠç¹ãåœãŠãŸãã
Scikit-learnã®ãã£ãŒãã£ãŒéžæã¡ãœãã
Scikit-learnã«ã¯ããã£ãŒãã£ãŒéžæãå®è¡ããããã®ããã€ãã®æ¹æ³ããããŸãããããã¯ã倧ãã3ã€ã®ã«ããŽãªã«ã°ã«ãŒãåã§ããŸãã
1. ãã£ã«ã¿ãŒã¡ãœãã
ãã£ã«ã¿ãŒã¡ãœããã¯ãç¹å®ã®æ©æ¢°åŠç¿ã¢ãã«ãšã¯ç¬ç«ããŠããã£ãŒãã£ãŒã®åºæã®ããããã£ã«åºã¥ããŠãã£ãŒãã£ãŒã®é¢é£æ§ãè©äŸ¡ããŸããäžè¬ã«é«éã§èšç®ã³ã¹ããå®ããããåæããŒã¿æ¢çŽ¢ãéåžžã«å€§ããªããŒã¿ã»ãããæ±ãå Žåã«æé©ã§ããäžè¬çãªã¡ããªãã¯ã«ã¯ãçžé¢ãçžäºæ å ±éãããã³çµ±èšçæ€å®ãå«ãŸããŸãã
a) çžé¢ããŒã¹ã®ãã£ãŒãã£ãŒéžæ
ã¿ãŒã²ãã倿°ãšé«åºŠã«çžé¢ãããã£ãŒãã£ãŒã¯ãéèŠã§ãããšèŠãªãããŸããéã«ãäºãã«é«åºŠã«çžé¢ãããã£ãŒãã£ãŒïŒå€éå
±ç·æ§ïŒã¯åé·ã§ããå¯èœæ§ããããåé€ãæ€èšã§ããŸããScikit-learnã®feature_selectionã¢ãžã¥ãŒã«ã«ã¯ããããæ¯æŽããããŒã«ãçšæãããŠããŸãã
äŸïŒåæ£éŸå€
忣ãéåžžã«äœããã£ãŒãã£ãŒã¯ãå€ãã®å€å¥åãæããªãå¯èœæ§ããããŸããVarianceThresholdã¯ã©ã¹ã¯ã忣ãç¹å®ã®éŸå€ãæºãããªããã£ãŒãã£ãŒãåé€ããŸããããã¯ãæ°å€ãã£ãŒãã£ãŒã«ç¹ã«åœ¹ç«ã¡ãŸãã
from sklearn.feature_selection import VarianceThreshold
import numpy as np
X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector = VarianceThreshold(threshold=0.0)
selector.fit_transform(X)
# Output: array([[2, 0, 3], [1, 4, 3], [1, 1, 3]])
ãã®äŸã§ã¯ãæåã®ãã£ãŒãã£ãŒïŒãã¹ãŠãŒãïŒã¯åæ£ããŒãã§ãããåé€ãããŸããããã¯ãäºæž¬èœåãæäŸããªã宿°ãŸãã¯ã»ãŒå®æ°ã®ãã£ãŒãã£ãŒãç Žæ£ããããã®åºæ¬çãªãã广çãªæ¹æ³ã§ãã
äŸïŒã¿ãŒã²ãããšã®çžé¢ïŒPandasãšSciPyã䜿çšïŒ
Scikit-learnã«ã¯ããã¹ãŠã®ãã£ãŒãã£ãŒã¿ã€ãã«ãããã¿ãŒã²ãããšã®çžé¢é¢ä¿ãçŽæ¥è¡ãé«ã¬ãã«ã®é¢æ°ã¯ãããŸããããäžè¬çãªååŠçã¹ãããã§ããããã«ã¯ãPandasãšSciPyã䜿çšã§ããŸãã
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
# Sample data
data = {
'feature1': np.random.rand(100),
'feature2': np.random.rand(100) * 2,
'feature3': np.random.rand(100) - 1,
'target': np.random.randint(0, 2, 100)
}
df = pd.DataFrame(data)
# Calculate Pearson correlation with the target
correlations = df.corr()['target'].drop('target')
# Select features with correlation above a certain threshold (e.g., 0.2)
selected_features = correlations[abs(correlations) > 0.2].index.tolist()
print(f"Features correlated with target: {selected_features}")
ãã®ã¹ããããã¯ãã¿ãŒã²ãã倿°ãšã®ç·åœ¢é¢ä¿ãæã€ãã£ãŒãã£ãŒãç¹å®ããæ¹æ³ã瀺ããŠããŸãããã€ããªã¿ãŒã²ããã®å Žåãç¹ååçžé¢ãé©åã§ãããã«ããŽãªã¿ãŒã²ããã®å Žåãä»ã®çµ±èšçæ€å®ãããé©åã§ãã
b) çµ±èšçæ€å®
ãã£ã«ã¿ãŒã¡ãœããã¯ãçµ±èšçæ€å®ã䜿çšããŠããã£ãŒãã£ãŒãšã¿ãŒã²ãã倿°éã®äŸåé¢ä¿ã枬å®ããããšãã§ããŸãããããã¯ãã«ããŽãªãã£ãŒãã£ãŒãæ±ãå ŽåããŸãã¯ããŒã¿ååžã«é¢ããç¹å®ã®ä»®å®ãè¡ãããšãã§ããå Žåã«ç¹ã«åœ¹ç«ã¡ãŸãã
Scikit-learnã®feature_selectionã¢ãžã¥ãŒã«ã¯ã以äžãæäŸããŸãã
f_classifïŒåé¡ã¿ã¹ã¯ã®ã©ãã«/ãã£ãŒãã£ãŒéã®ANOVA Få€ããã£ãŒãã£ãŒã¯æ°å€ã§ãããã¿ãŒã²ããã¯ã«ããŽãªã§ãããšæ³å®ããŸããf_regressionïŒååž°ã¿ã¹ã¯ã®ã©ãã«/ãã£ãŒãã£ãŒéã®Få€ããã£ãŒãã£ãŒã¯æ°å€ã§ãããã¿ãŒã²ããã¯æ°å€ã§ãããšæ³å®ããŸããmutual_info_classifïŒé¢æ£ã¿ãŒã²ãã倿°ã®çžäºæ å ±éãéç·åœ¢é¢ä¿ãåŠçã§ããŸããmutual_info_regressionïŒé£ç¶ã¿ãŒã²ãã倿°ã®çžäºæ å ±éãchi2ïŒåé¡ã¿ã¹ã¯ã®éè² ã®ãã£ãŒãã£ãŒã®ã«ã€2ä¹çµ±èšéãã«ããŽãªãã£ãŒãã£ãŒã«äœ¿çšãããŸãã
äŸïŒ`f_classif`ãš`SelectKBest`ã®äœ¿çš
SelectKBestã¯ãéžæãããã¹ã³ã¢ãªã³ã°é¢æ°ïŒf_classifãªã©ïŒã«åºã¥ããŠãã£ãŒãã£ãŒãéžæã§ããã¡ã¿ãã©ã³ã¹ãã©ãŒããŒã§ãã
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
iris = load_iris()
X, y = iris.data, iris.target
# Select the top 2 features using f_classif
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_new.shape}")
# To see which features were selected:
selected_indices = selector.get_support(indices=True)
print(f"Selected feature indices: {selected_indices}")
print(f"Selected feature names: {[iris.feature_names[i] for i in selected_indices]}")
ãã®äŸã¯ãåé¡ã®çµ±èšçæææ§ã«åºã¥ããŠãkãåã®æé©ãªãã£ãŒãã£ãŒãéžæããæ¹æ³ã瀺ããŠããŸããf_classifã®Få€ã¯ãæ¬è³ªçã«ãã°ã«ãŒãïŒã¯ã©ã¹ïŒéã®åæ£ããã°ã«ãŒãå
ã®åæ£ãšæ¯èŒããŠæž¬å®ããŸããFå€ãé«ãã»ã©ããã£ãŒãã£ãŒãšã¿ãŒã²ããã®éã®é¢ä¿ã匷ãããšã瀺ããŸãã
ã°ããŒãã«ãªèæ ®äºé ïŒããŸããŸãªå°åããã®ããŒã¿ã»ããïŒããšãã°ãããŸããŸãªæ°åããã®ã»ã³ãµãŒããŒã¿ãããŸããŸãªçµæžã·ã¹ãã ããã®éèããŒã¿ïŒãæ±ãå Žåããã£ãŒãã£ãŒã®çµ±èšçããããã£ã¯å€§ããç°ãªãå¯èœæ§ããããŸãããããã®çµ±èšçæ€å®ã®ä»®å®ïŒããšãã°ãANOVAã®æ£èŠæ§ïŒãçè§£ããããšãéèŠã§ãããçžäºæ å ±éã®ãããªãã³ãã©ã¡ããªãã¯æ€å®ã¯ã倿§ãªã·ããªãªã§ããå ç¢ã§ããå¯èœæ§ããããŸãã
2. ã©ãããŒã¡ãœãã
ã©ãããŒã¡ãœããã¯ãç¹å®ã®æ©æ¢°åŠç¿ã¢ãã«ã䜿çšããŠããã£ãŒãã£ãŒãµãã»ããã®å質ãè©äŸ¡ããŸããã¢ãã«ãã¬ãŒãã³ã°ããã»ã¹ãæ€çŽ¢æŠç¥å ã«ãã©ãããããŠãæé©ãªãã£ãŒãã£ãŒã»ãããèŠã€ããŸããäžè¬ã«ãã£ã«ã¿ãŒã¡ãœãããããæ£ç¢ºã§ãããã¢ãã«ãã¬ãŒãã³ã°ã®ç¹°ãè¿ãã«ãããèšç®ã³ã¹ããã¯ããã«é«ããªããŸãã
a) ååž°çãã£ãŒãã£ãŒåé€ïŒRFEïŒ
RFEã¯ããã£ãŒãã£ãŒãååž°çã«åé€ããããšã«ãã£ãŠæ©èœããŸãããŸãããã£ãŒãã£ãŒã»ããå šäœã§ã¢ãã«ããã¬ãŒãã³ã°ããã¢ãã«ã®ä¿æ°ãŸãã¯ãã£ãŒãã£ãŒã®éèŠåºŠã«åºã¥ããŠæãéèŠã§ãªããã£ãŒãã£ãŒãåé€ããŸãããã®ããã»ã¹ã¯ãå¿ èŠãªãã£ãŒãã£ãŒã®æ°ã«éãããŸã§ç¹°ãè¿ãããŸãã
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Generate synthetic data
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=5, random_state=42)
# Use a Logistic Regression model (can be any model that supports coef_ or feature_importances_)
estimator = LogisticRegression(solver='liblinear')
# Initialize RFE to select top 5 features
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
X_new = selector.transform(X)
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_new.shape}")
# To see which features were selected:
selected_indices = selector.get_support(indices=True)
print(f"Selected feature indices: {selected_indices}")
RFEã¯ãéžæãããã¢ãã«ã«ãã£ãŠè©äŸ¡ããããã£ãŒãã£ãŒéã®çžäºäœçšãèæ ®ããããã匷åã§ãã`step`ãã©ã¡ãŒã¿ãŒã¯ãåå埩ã§åé€ããããã£ãŒãã£ãŒã®æ°ãå¶åŸ¡ããŸãã
b) 鿬¡ãã£ãŒãã£ãŒéžæïŒSFSïŒ
Scikit-learnã®ã³ã¢`feature_selection`ã«çŽæ¥ã¯ã©ã¹ã¯ãããŸãããã鿬¡ãã£ãŒãã£ãŒéžæã¯ãScikit-learnæšå®éã䜿çšããŠå®è£ ãããããšãå€ãæŠå¿µçãªã¢ãããŒãã§ããããã«ã¯ãé æ¹åéžæïŒç©ºã®ã»ããããå§ããŠããã£ãŒãã£ãŒã1ã€ãã€è¿œå ïŒãŸãã¯éæ¹åé€å»ïŒãã¹ãŠã®ãã£ãŒãã£ãŒããå§ããŠããããã1ã€ãã€åé€ïŒãå«ãŸããŸããScikit-learnã®`sklearn.feature_selection`ã®`SequentialFeatureSelector`ã¯ããããå®è£ ããŸãã
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=5, random_state=42)
estimator = LogisticRegression(solver='liblinear')
# Forward selection: add features until desired number is reached
sfs_forward = SequentialFeatureSelector(
estimator, n_features_to_select=10, direction='forward', cv=5)
sfs_forward.fit(X, y)
X_new_forward = sfs_forward.transform(X)
print(f"Forward Selection - Reduced shape: {X_new_forward.shape}")
# Backward selection: start with all features and remove
sfs_backward = SequentialFeatureSelector(
estimator, n_features_to_select=10, direction='backward', cv=5)
sfs_backward.fit(X, y)
X_new_backward = sfs_backward.transform(X)
print(f"Backward Selection - Reduced shape: {X_new_backward.shape}")
`SequentialFeatureSelector`ã®`cv`ãã©ã¡ãŒã¿ãŒã¯ã亀差æ€èšŒãæå³ãããã£ãŒãã£ãŒéžæãããå ç¢ã«ãããã¬ãŒãã³ã°ããŒã¿ãžã®éåŠç¿ãèµ·ããã«ããããã®ã«åœ¹ç«ã¡ãŸããããŒã¿ã®å質ãšååžã倧ããç°ãªãå¯èœæ§ããããããããããã°ããŒãã«ã«é©çšããå Žåã«éåžžã«éèŠã§ãã
3. åã蟌ã¿ã¡ãœãã
åã蟌ã¿ã¡ãœããã¯ãã¢ãã«ãã¬ãŒãã³ã°ããã»ã¹ã®äžéšãšããŠãã£ãŒãã£ãŒéžæãå®è¡ããŸãããã£ãŒãã£ãŒã®çžäºäœçšãèæ ®ããªãããã©ãããŒã¡ãœãããããèšç®ã³ã¹ããå®ããšããå©ç¹ããããŸããå€ãã®æ£ååã¢ãã«ããã®ã«ããŽãªã«åé¡ãããŸãã
a) L1æ£ååïŒLassoïŒ
ç·åœ¢ã¢ãã«ã®`Lasso`ïŒæå°çµ¶å¯Ÿåçž®ããã³éžææŒç®åïŒãªã©ã®ã¢ãã«ã¯ãL1æ£ååã䜿çšããŸãããã®ææ³ã§ã¯ãä¿æ°ã®çµ¶å¯Ÿå€ã«ããã«ãã£ã远å ãããŸããããã«ãããäžéšã®ä¿æ°ãæ£ç¢ºã«ãŒãã«ãªãå¯èœæ§ããããŸãããŒãä¿æ°ãæã€ãã£ãŒãã£ãŒã¯å¹æçã«åé€ãããŸãã
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=20, n_informative=10, random_state=42, noise=10)
# Lasso with alpha (regularization strength)
# A higher alpha leads to more regularization and potentially more zero coefficients
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X, y)
# Get the number of non-zero coefficients (selected features)
non_zero_features = np.sum(lasso.coef_ != 0)
print(f"Number of features selected by Lasso: {non_zero_features}")
# To get the actual selected features:
selected_features_mask = lasso.coef_ != 0
X_new = X[:, selected_features_mask]
print(f"Reduced shape: {X_new.shape}")
`LassoCV`ã䜿çšãããšã亀差æ€èšŒãéããŠæé©ãªã¢ã«ãã¡å€ãèªåçã«èŠã€ããããšãã§ããŸãã
b) ããªãŒããŒã¹ã®ãã£ãŒãã£ãŒã®éèŠåºŠ
`RandomForestClassifier`ã`GradientBoostingClassifier`ã`ExtraTreesClassifier`ãªã©ã®ã¢ã³ãµã³ãã«ã¡ãœããã¯ãæ¬è³ªçã«ãã£ãŒãã£ãŒã®éèŠåºŠãæäŸããŸãããããã¯ãã¢ã³ãµã³ãã«å ã®ããªãŒå šäœã§ãåãã£ãŒãã£ãŒãäžçŽç©ãŸãã¯ãšã©ãŒã®åæžã«ã©ã®çšåºŠå¯äžããŠãããã«åºã¥ããŠèšç®ãããŸããéèŠåºŠã®äœããã£ãŒãã£ãŒã¯åé€ã§ããŸãã
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
# Get feature importances
importances = model.feature_importances_
# Sort features by importance
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(X.shape[1]):
print(f"{f + 1}. feature {indices[f]} ({cancer.feature_names[indices[f]]}) - {importances[indices[f]]:.4f}")
# Select top N features (e.g., top 10)
N = 10
selected_features_mask = np.zeros(X.shape[1], dtype=bool)
selected_features_mask[indices[:N]] = True
X_new = X[:, selected_features_mask]
print(f"Reduced shape after selecting top {N} features: {X_new.shape}")
ããªãŒããŒã¹ã®ã¡ãœããã¯ãéç·åœ¢é¢ä¿ãšãã£ãŒãã£ãŒã®çžäºäœçšããã£ããã£ã§ããããã匷åã§ãããããã¯ãããŸããŸãªåžå Žã«ãããå»ç蚺æïŒäŸã®ããã«ïŒããéèè©æ¬ºæ€åºãŸã§ãããŸããŸãªãã¡ã€ã³ã§åºãé©çšã§ããŸãã
次å åæžã®ããã®ãã£ãŒãã£ãŒæœåº
ãã£ãŒãã£ãŒéžæã¯å ã®ãã£ãŒãã£ãŒãä¿æããŸããããã£ãŒãã£ãŒæœåºã¯æ°ãããåæžããããã£ãŒãã£ãŒã»ãããäœæããŸããããã¯ãå ã®ãã£ãŒãã£ãŒãé«åºŠã«çžé¢ããŠããå ŽåããŸãã¯ããŒã¿ã®æå€§åæ£ããã£ããã£ããäœæ¬¡å 空éã«ããŒã¿ãæåœ±ããå Žåã«ç¹ã«åœ¹ç«ã¡ãŸãã
1. äž»æååæïŒPCAïŒ
PCAã¯ãããŒã¿ã®æå€§åæ£ããã£ããã£ããçŽäº€è»žïŒäž»æåïŒã®ã»ãããèŠã€ããããšãç®çãšããç·åœ¢å€æææ³ã§ããæåã®äž»æåã¯æå€§ã®åæ£ããã£ããã£ãã2çªç®ã®äž»æåã¯ïŒæåã®äž»æåã«çŽäº€ããïŒæ¬¡ã«å€§ãã忣ããã£ããã£ããŸããæåã®ãkãåã®äž»æåã®ã¿ãä¿æããããšã§ã次å åæžãå®çŸããŸãã
éèŠãªæ³šæïŒPCAã¯ãã£ãŒãã£ãŒã®ã¹ã±ãŒã«ã«ææã§ããPCAãé©çšããåã«ãããŒã¿ãã¹ã±ãŒã«ïŒããšãã°ã`StandardScaler`ã䜿çšïŒããããšãéèŠã§ãã
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
# Scale the data
X_scaled = StandardScaler().fit_transform(X)
# Initialize PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Original shape: {X.shape}")
print(f"Reduced shape after PCA: {X_pca.shape}")
# The explained variance ratio shows how much variance each component captures
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {np.sum(pca.explained_variance_ratio_):.4f}")
PCAã¯ã髿¬¡å ããŒã¿ã2次å ãŸãã¯3次å ã«åæžããŠèŠèŠåããã®ã«æé©ã§ããæ¢çŽ¢çããŒã¿åæã®åºæ¬çãªææ³ã§ãããåŸç¶ã®ã¢ããªã³ã°æé ãå€§å¹ ã«é«éåã§ããŸãããã®æå¹æ§ã¯ãç»ååŠçãéºäŒåŠãªã©ã®ãã¡ã€ã³ã§èгå¯ãããŠããŸãã
2. ç·åœ¢å€å¥åæïŒLDAïŒ
忣ãæå€§åããããšãç®çãšããæåž«ãªãã®PCAãšã¯ç°ãªããLDAã¯ã¯ã©ã¹éã®åé¢ãæå€§åããäœæ¬¡å 衚çŸãèŠã€ããããšãç®çãšããæåž«ããã®ææ³ã§ããããã¯äž»ã«åé¡ã¿ã¹ã¯ã«äœ¿çšãããŸãã
éèŠãªæ³šæïŒLDAããã£ãŒãã£ãŒã®ã¹ã±ãŒãªã³ã°ãå¿ èŠãšããŸããããã«ãLDAã®ã³ã³ããŒãã³ãã®æ°ã¯æå€§ã§`n_classes - 1`ã«å¶éãããŠããŸãã
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# Scale the data
X_scaled = StandardScaler().fit_transform(X)
# Initialize LDA. Number of components cannot exceed n_classes - 1 (which is 2 for Iris)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
print(f"Original shape: {X.shape}")
print(f"Reduced shape after LDA: {X_lda.shape}")
# LDA also has explained_variance_ratio_ but it's class separability
print(f"Explained variance ratio (class separability): {lda.explained_variance_ratio_}")
LDAã¯ãããŒã¿å ã®ããŸããŸãªã«ããŽãªãããŸãåºå¥ã§ããåé¡åšãæ§ç¯ããããšãç®æšã§ããå Žåã«ç¹ã«åœ¹ç«ã¡ãŸããããã¯ã顧客ã»ã°ã¡ã³ããŒã·ã§ã³ãçŸæ£åé¡ãªã©ãå€ãã®ã°ããŒãã«ã¢ããªã±ãŒã·ã§ã³ã§å ±éã®èª²é¡ã§ãã
3. t-ååžå確ççè¿ååã蟌ã¿ïŒt-SNEïŒ
t-SNEã¯ãäž»ã«é«æ¬¡å ããŒã¿ã»ãããèŠèŠåããããã«äœ¿çšãããéç·åœ¢æ¬¡å åæžææ³ã§ãã髿¬¡å ããŒã¿ãã€ã³ããäœæ¬¡å 空éïŒéåžžã¯2DãŸãã¯3DïŒã«ãããã³ã°ããŠãé¡äŒŒãããã€ã³ããäœæ¬¡å 空éã§é¡äŒŒããè·é¢ã§ã¢ãã«åãããããã«ããããšã§æ©èœããŸããããŒã¿ã®ããŒã«ã«æ§é ãšã¯ã©ã¹ã¿ãŒãæããã«ããã®ã«åªããŠããŸãã
éèŠãªæ³šæïŒt-SNEã¯èšç®ã³ã¹ããé«ããéåžžã¯ã¢ãã«ãã¬ãŒãã³ã°ã®ååŠçã¹ããããšããŠã§ã¯ãªããèŠèŠåã«äœ¿çšãããŸããçµæã¯ãã©ã³ãã ãªåæåãšãã©ã¡ãŒã¿ãŒèšå®ã«ãã£ãŠãç°ãªãå ŽåããããŸãã
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()
X, y = digits.data, digits.target
# For demonstration, we'll use a subset of the data as t-SNE can be slow
subset_indices = np.random.choice(len(X), 1000, replace=False)
X_subset = X[subset_indices]
y_subset = y[subset_indices]
# Initialize t-SNE with 2 components
# perplexity is related to the number of nearest neighbors (e.g., 30 is common)
# n_iter is the number of iterations for optimization
tsne = TSNE(n_components=2, perplexity=30, n_iter=300, random_state=42)
X_tsne = tsne.fit_transform(X_subset)
print(f"Original subset shape: {X_subset.shape}")
print(f"Reduced shape after t-SNE: {X_tsne.shape}")
# Plotting the results (optional, for visualization)
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset, cmap='viridis', alpha=0.7)
plt.title('t-SNE visualization of Digits dataset')
plt.xlabel('t-SNE component 1')
plt.ylabel('t-SNE component 2')
plt.legend(*scatter.legend_elements(), title='Classes')
plt.show()
t-SNEã¯ãã²ããã¯ã¹ããœãŒã·ã£ã«ãããã¯ãŒã¯åæãªã©ã®åéã§ééããè€éãªé«æ¬¡å ããŒã¿ã®åºæã®æ§é ãçè§£ããäžã§éåžžã«è²Žéã§ãããé ããããŸãŸã«ãªãå¯èœæ§ã®ãããã¿ãŒã³ã«é¢ããèŠèŠçãªæŽå¯ãæäŸããŸãã
ã°ããŒãã«ããŒã¿ã»ããã«é©ãããã¯ããã¯ã®éžæ
é©åãªãã£ãŒãã£ãŒéžæãŸãã¯æœåºæ¹æ³ã®éžæã¯ãäžèœã®æ±ºå®ã§ã¯ãããŸãããç¹ã«ã°ããŒãã«ããŒã¿ã»ããã«ãšã£ãŠéèŠãªãããã€ãã®èŠçŽ ããã®éžæã«åœ±é¿ãäžããŸãã
- ããŒã¿ã®æ§è³ªïŒããŒã¿ã¯æ°å€ãã«ããŽãªããŸãã¯æ··åã§ããïŒæ¢ç¥ã®ååžã¯ãããŸããïŒããšãã°ã
chi2ã¯éè² ã®ã«ããŽãªãã£ãŒãã£ãŒã«é©ããŠãããf_classifã¯æ°å€ãã£ãŒãã£ãŒãšã«ããŽãªã¿ãŒã²ããã«é©ããŠããŸãã - ã¢ãã«ã®çš®é¡ïŒç·åœ¢ã¢ãã«ã¯L1æ£ååã®æ©æµãåããå¯èœæ§ããããŸãããããªãŒããŒã¹ã®ã¢ãã«ã¯åœç¶ã«éèŠåºŠãæäŸããŸãã
- èšç®ãªãœãŒã¹ïŒãã£ã«ã¿ãŒã¡ãœãããæãéããæ¬¡ã«åã蟌ã¿ã¡ãœãããæ¬¡ã«ã©ãããŒã¡ãœãããšt-SNEãç¶ããŸãã
- è§£éå¯èœæ§ã®èŠä»¶ïŒäºæž¬ã*ãªã*è¡ããããã説æããããšãæãéèŠãªå Žåãå ã®ãã£ãŒãã£ãŒãä¿æãããã£ãŒãã£ãŒéžæã¡ãœããïŒRFEãL1ãªã©ïŒã¯ãæœè±¡çãªã³ã³ããŒãã³ããäœæãããã£ãŒãã£ãŒæœåºã¡ãœããïŒPCAãªã©ïŒãããåªå ãããããšããããããŸãã
- ç·åœ¢æ§ãšéç·åœ¢æ§ïŒPCAãšç·åœ¢ã¢ãã«ã¯ç·åœ¢é¢ä¿ãæ³å®ããŠããŸãããt-SNEãšããªãŒããŒã¹ã®ã¡ãœããã¯éç·åœ¢ãã¿ãŒã³ããã£ããã£ã§ããŸãã
- æåž«ãããšæåž«ãªãïŒLDAã¯æåž«ããïŒã¿ãŒã²ãã倿°ã䜿çšïŒã§ãããPCAã¯æåž«ãªãã§ãã
- ã¹ã±ãŒã«ãšåäœïŒPCAãšLDAã§ã¯ããã£ãŒãã£ãŒã¹ã±ãŒãªã³ã°ãäžå¯æ¬ ã§ããç°ãªãã°ããŒãã«å°åããåéãããããŒã¿ã®ã¹ã±ãŒã«ã®éããèæ ®ããŠãã ãããããšãã°ãé貚ã®å€ãã»ã³ãµãŒã®èªã¿åãå€ã¯ãåœãã»ã³ãµãŒã®çš®é¡ã«ãã£ãŠå€§ããç°ãªãã¹ã±ãŒã«ãæã€å ŽåããããŸãã
- æåçããã³å°åçãªãã¥ã¢ã³ã¹ïŒç°ãªãæåçèæ¯ããã®äººéã®è¡åã人å£çµ±èšããŸãã¯ææ ãå«ãããŒã¿ã»ãããæ±ãå Žåããã£ãŒãã£ãŒã®è§£éã¯è€éã«ãªãå¯èœæ§ããããŸããããå°åã§äºæž¬æ§ã®é«ããã£ãŒãã£ãŒã¯ã瀟äŒèŠç¯ãçµæžç¶æ³ããŸãã¯ããŒã¿åéæ¹æ³ãç°ãªããããå¥ã®å°åã§ã¯ç¡é¢ä¿ãŸãã¯èª€è§£ãæãå¯èœæ§ããããŸãã倿§ãªéå£å šäœã§ãã£ãŒãã£ãŒã®éèŠåºŠãè©äŸ¡ããå Žåã¯ãåžžã«ãã¡ã€ã³ã®å°éç¥èãèæ ®ããŠãã ããã
å®è¡å¯èœãªæŽå¯ïŒ
- åçŽãªãã®ããå§ããïŒã¯ã€ãã¯è©äŸ¡ãè¡ããæãããªãã€ãºãé€å»ããããã«ããã£ã«ã¿ãŒã¡ãœããïŒäŸïŒåæ£éŸå€ãçµ±èšçæ€å®ïŒããå§ããŸãã
- å埩ããŠè©äŸ¡ããïŒããŸããŸãªæ¹æ³ã詊ããŠãé©åãªã¡ããªãã¯ãšäº€å·®æ€èšŒã䜿çšããŠã¢ãã«ã®ããã©ãŒãã³ã¹ã«å¯Ÿãã圱é¿ãè©äŸ¡ããŸãã
- èŠèŠåããïŒPCAãt-SNEãªã©ã®ææ³ã䜿çšããŠãããŒã¿ãäœæ¬¡å ã§èŠèŠåããŸããããã«ãããåºç€ãšãªãæ§é ãæããã«ãªãããã£ãŒãã£ãŒéžææŠç¥ã«åœ¹ç«ã¡ãŸãã
- ãã¡ã€ã³ã®å°éç¥èãéèŠïŒç¹ã«è€éãªã°ããŒãã«ããŒã¿ãæ±ãå Žåã¯ããã¡ã€ã³ã®å°éå®¶ãšååããŠããã£ãŒãã£ãŒã®æå³ãšé¢é£æ§ãçè§£ããŸãã
- ã¢ã³ãµã³ãã«ã¢ãããŒããæ€èšããïŒè€æ°ã®ãã£ãŒãã£ãŒéžæãã¯ããã¯ãçµã¿åããããšãåäžã®æ¹æ³ã«äŸåãããããåªããçµæãåŸãããå ŽåããããŸãã
çµ±åã¯ãŒã¯ãããŒã®ããã®Scikit-learnã®ãã€ãã©ã€ã³
Scikit-learnã®Pipelineãªããžã§ã¯ãã¯ããã£ãŒãã£ãŒéžæ/æœåºãå«ãååŠçã¹ããããšã¢ãã«ãã¬ãŒãã³ã°ãçµ±åããå Žåã«éåžžã«åœ¹ç«ã¡ãŸããããã«ããã亀差æ€èšŒã®åãã©ãŒã«ãå
ã§ãã£ãŒãã£ãŒéžæãäžè²«ããŠå®è¡ãããããŒã¿ãªãŒã¯ã鲿¢ãããããä¿¡é Œæ§ã®é«ãçµæãåŸãããŸããããã¯ã倿§ãªã°ããŒãã«åžå Žå
šäœã«å±éãããã¢ãã«ãæ§ç¯ããå Žåã«ç¹ã«éèŠã§ãã
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
X, y = bc.data, bc.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a pipeline that first scales, then selects features, then trains a classifier
pipe = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_classif, k=10)),
('classifier', LogisticRegression(solver='liblinear'))
])
# Train the pipeline
pipe.fit(X_train, y_train)
# Evaluate the pipeline using cross-validation
cv_scores = cross_val_score(pipe, X_train, y_train, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Average CV score: {np.mean(cv_scores):.4f}")
# Make predictions on the test set
accuracy = pipe.score(X_test, y_test)
print(f"Test set accuracy: {accuracy:.4f}")
ãã€ãã©ã€ã³ã䜿çšãããšãã¹ã±ãŒãªã³ã°ãããã£ãŒãã£ãŒéžæãåé¡ãŸã§ãããã»ã¹å šäœãåäžã®ãšã³ãã£ãã£ãšããŠæ±ãããããã«ãªããŸããããã¯ãå ç¢ãªã¢ãã«éçºã®ãã¹ããã©ã¯ãã£ã¹ã§ãããç¹ã«ããŸããŸãªããŒã¿ååžå šäœã§äžè²«ããããã©ãŒãã³ã¹ãéèŠãªã°ããŒãã«å±éãç®çãšããã¢ãã«ã®å Žåã«åœãŠã¯ãŸããŸãã
çµè«
ãã£ãŒãã£ãŒéžæãšæœåºã«ããæ¬¡å åæžã¯ãå¹ççã§å ç¢ãã€è§£éå¯èœãªæ©æ¢°åŠç¿ã¢ãã«ãæ§ç¯ããäžã§äžå¯æ¬ ãªã¹ãããã§ããScikit-learnã¯ããããã®èª²é¡ã«åãçµãããã®å æ¬çãªããŒã«ããããæäŸããäžçäžã®ããŒã¿ãµã€ãšã³ãã£ã¹ããæ¯æŽããŸããããŸããŸãªæ¹æ³è«ïŒãã£ã«ã¿ãŒãã©ãããŒãåã蟌ã¿ã¡ãœãããããã³PCAãLDAãªã©ã®ãã£ãŒãã£ãŒæœåºãã¯ããã¯ïŒãçè§£ããããšã§ãç¹å®ã®ããŒã¿ã»ãããšç®æšã«åãããŠæ å ±ã«åºã¥ããæææ±ºå®ãè¡ãããšãã§ããŸãã
ã°ããŒãã«ãªèªè
ã«ãšã£ãŠãèæ
®äºé
ã¯ã¢ã«ãŽãªãºã ã®éžæã ãã«ãšã©ãŸããŸãããããŒã¿ã®åºæãããŸããŸãªå°åã§ã®ãã£ãŒãã£ãŒåéã«ãã£ãŠå°å
¥ãããå¯èœæ§ã®ãããã€ã¢ã¹ãããã³å°åã®å©å®³é¢ä¿è
ã®ç¹å®ã®è§£éå¯èœæ§ã®ããŒãºãçè§£ããããšãéèŠã§ããScikit-learnã®Pipelineãªã©ã®ããŒã«ã䜿çšãããšãæ§é åãããåçŸå¯èœãªã¯ãŒã¯ãããŒãä¿èšŒããã倿§ãªåœéçãªã³ã³ããã¹ãã§ä¿¡é Œæ§ã®é«ãAIãœãªã¥ãŒã·ã§ã³ãå±éããããã«äžå¯æ¬ ã§ãã
çŸä»£ã®ããŒã¿ãµã€ãšã³ã¹ã®è€éããããã²ãŒãããéã«ãScikit-learnã®ãã£ãŒãã£ãŒéžææ©èœãç¿åŸããããšã¯ééããªãéèŠãªè³ç£ãšãªãããã®åºæã«é¢ä¿ãªããããŒã¿ã®å¯èœæ§ãæå€§éã«åŒãåºãããšãã§ããŸãã