python - scitkit-learn query data dimension must match training data dimension -
i'm trying use code scikit learn site:
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
i'm using own data. problem is, have lot more 2 features. if want "expand" features 2 3 or 4....
i'm getting:
"query data dimension must match training data dimension"
def machine(): open("test.txt",'r') csvr: reader= csv.reader(csvr,delimiter='\t') i,row in enumerate(reader): if i==0: pass elif '' in row[2:]: pass else: liste.append(map(float,row[2:])) = np.array(liste) h = .02 names = ["nearest neighbors", "linear svm", "rbf svm", "decision tree", "random forest", "adaboost", "naive bayes", "lda", "qda"] classifiers = [ kneighborsclassifier(1), svc(kernel="linear", c=0.025), svc(gamma=2, c=1), decisiontreeclassifier(max_depth=5), randomforestclassifier(max_depth=5, n_estimators=10, max_features=1), adaboostclassifier(), gaussiannb(), lda(), qda()] x = a[:,:3] y = np.ravel(a[:,13]) linearly_separable = (x, y) datasets =[linearly_separable] figure = plt.figure(figsize=(27, 9)) = 1 ds in datasets: x, y = ds x = standardscaler().fit_transform(x) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.4) x_min, x_max = x[:, 0].min() - .5, x[:, 0].max() + .5 y_min, y_max = x[:, 1].min() - .5, x[:, 1].max() + .5 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) cm = plt.cm.rdbu cm_bright = listedcolormap(['#ff0000', '#0000ff']) ax = plt.subplot(len(datasets), len(classifiers) + 1, i) ax.scatter(x_train[:, 0], x_train[:, 1], c=y_train, cmap=cm_bright) ax.scatter(x_test[:, 0], x_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6) ax.set_xlim(xx.min(), xx.max()) ax.set_ylim(yy.min(), yy.max()) ax.set_xticks(()) ax.set_yticks(()) += 1 name, clf in zip(names, classifiers): ax = plt.subplot(len(datasets), len(classifiers) + 1, i) print clf.fit(x_train, y_train) score = clf.score(x_test, y_test) print y.shape, x.shape if hasattr(clf, "decision_function"): z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) print z else: z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] z = z.reshape(xx.shape) ax.contourf(xx, yy, z, cmap=cm, alpha=.8) ax.scatter(x_train[:, 0], x_train[:, 1], c=y_train, cmap=cm_bright) ax.scatter(x_test[:, 0], x_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6) ax.set_xlim(xx.min(), xx.max()) ax.set_ylim(yy.min(), yy.max()) ax.set_xticks(()) ax.set_yticks(()) ax.set_title(name) ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right') += 1 figure.subplots_adjust(left=.02, right=.98) plt.show()
in case use 3 features. doing wrong in code, x_train , x_test data? 2 features, ok.
my x value:
(array([[ 1., 1., 0.], [ 1., 0., 0.], [ 1., 0., 0.], [ 1., 0., 0.], [ 1., 1., 0.], [ 1., 0., 0.], [ 1., 0., 0.], [ 3., 3., 0.], [ 1., 1., 0.], [ 1., 1., 0.], [ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.], [ 4., 4., 2.], [ 0., 0., 0.], [ 6., 3., 0.], [ 5., 3., 2.], [ 2., 2., 0.], [ 4., 4., 2.], [ 2., 1., 0.], [ 2., 2., 0.]]), array([ 1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1.]))
the first array x array , second array y(target) array.
i'm sorry bad format = error:
traceback (most recent call last): file "allm.py", line 144, in <module> mainplot(nameplot,1,2) file "allm.py", line 117, in mainplot z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] file "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py", line 191, in predict_proba neigh_dist, neigh_ind = self.kneighbors(x) file "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 332, in kneighbors return_distance=return_distance) file "binary_tree.pxi", line 1298, in sklearn.neighbors.kd_tree.binarytree.query (sklearn/neighbors/kd_tree.c:10433) valueerror: query data dimension must match training data dimension
and x array without putting him dataset "ds".
[[ 1. 1. 0.][ 1. 0. 0.][ 1. 0. 0.][ 1. 0. 0.][ 1. 1. 0.][ 1. 0. 0.][ 1. 0. 0.][ 3. 3. 0.][ 1. 1. 0.][ 1. 1. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 4. 4. 2.][ 0. 0. 0.][ 6. 3. 0.][ 5. 3. 2.][ 2. 2. 0.][ 4. 4. 2.][ 2. 1. 0.][ 2. 2. 0.]]
this happening because clf.predict_proba()
requires array each row has same number of elements rows in training data -- in other words input shape (num_rows, 3)
.
when working two-dimensional exemplars worked because result of np.c_[xx.ravel(), yy.ravel()]
array two-element rows:
print np.c_[xx.ravel(), yy.ravel()].shape (45738, 2)
these exemplars have 2 elements because they're created np.meshgrid
sample code uses create set of inputs cover two-dimensional space plot nicely. try passing array three-item rows clf.predict_proba
, things should work fine.
if want reproduce specific piece of sample code, you'll have create 3d meshgrid, described in this question on so. you'll have plot results in 3d, mplot3d serve starting point, though based on (admittedly brief) gave plotting in sample code, suspect may more trouble it's worth. i'm not sure how 3d analog of plots look.
Comments
Post a Comment