资讯专栏INFORMATION COLUMN

分类模型——变量选择

CloudDeveloper / 1425人阅读

摘要:系数反映每个特征的影响力。越大表示该特征在分类中起到的作用越大

import numpy as np  
import scipy as sp  
import pandas as pd
import matplotlib.pyplot as plt
Split train and test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(customer.ix[:,0:customer.columns.size-1], customer.ix[:,customer.columns.size-1], test_size = 0.2)
x_train, x_test, y_train, y_test = train_test_split(order.ix[:,0:order.columns.size-1], order.ix[:,order.columns.size-1], test_size = 0.2)
Pearson Correlation for Order
from scipy.stats import pearsonr  

prr = []
for i in range(order.columns.size-1):
   frame = pearsonr(order.iloc[:,i], order.iloc[:,order.columns.size-1]) 
   prr.append(frame)

result = pd.concat([pd.DataFrame(order.columns.values.tolist()), pd.DataFrame(prr)], axis=1) 
result.columns = ["Features", "Pearson", "Pvalue"]
result
result.to_csv("result.csv", index = True, header = True)
Pearson Correlation for Customer
from scipy.stats import pearsonr  
prr = []
for i in range(customer.columns.size-1):
   frame = pearsonr(customer.iloc[:,i], customer.iloc[:,customer.columns.size-1]) 
   prr.append(frame)

result = pd.concat([pd.DataFrame(customer.columns.values.tolist()), pd.DataFrame(prr)], axis=1) 
result.columns = ["Features", "Pearson", "Pvalue"]
result
result.to_csv("result.csv", index = True, header = True)
Random forest
from sklearn.ensemble import RandomForestRegressor  
clf = RandomForestRegressor()
clf.fit(x_train, y_train)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=100)
clf.fit(x_train, y_train)
MIC
from minepy import MINE
mic = []
for i in range(customer.columns.size-1):
   frame = m.compute_score(customer.iloc[:,i], customer.iloc[:,34]) 
   prr.append(frame)
result = pd.concat([pd.DataFrame(customer.columns.values.tolist()), pd.DataFrame(prr)], axis=1) 
result.columns = ["Features", "Pearson", "Pvalue"]
result.to_csv("result.csv", index = True, header = True)
Feature Correlation
corr = customer.corr()
corr.to_csv("result.csv", index = True, header = True)

tar_corr = lambda x: x.corr(x["tar"])
cus_call.apply(tar_corr)
cus_call.corrwith(cus_call.tar)
Feature Importance

系数反映每个特征的影响力。越大表示该特征在分类中起到的作用越大

importances = pd.DataFrame(sorted(zip(x_train.columns, map(lambda x: round(x, 4), clf.feature_importances_)), reverse=True))
importances.columns = ["Features", "Importance"]
importances.to_csv("result.csv", index = True, header = True)

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/44567.html

相关文章

  • 随机森林算法入门(python)

    摘要:翻译自昨天收到推送了一篇介绍随机森林算法的邮件,感觉作为介绍和入门不错,就顺手把它翻译一下。随机森林引入的随机森林算法将自动创建随机决策树群。回归随机森林也可以用于回归问题。结语随机森林相当起来非常容易。 翻译自:http://blog.yhat.com/posts/python-random-forest.html 昨天收到yhat推送了一篇介绍随机森林算法的邮件,感觉作为介绍和入门...

    张迁 评论0 收藏0
  • 机器学习算法基础(使用Python代码)

    摘要:机器学习算法类型从广义上讲,有种类型的机器学习算法。强化学习的例子马尔可夫决策过程常用机器学习算法列表以下是常用机器学习算法的列表。我提供了对各种机器学习算法的高级理解以及运行它们的代码。决策树是一种监督学习算法,主要用于分类问题。 showImg(https://segmentfault.com/img/remote/1460000019086462); 介绍 谷歌的自动驾驶汽车和机...

    BenCHou 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<