scala - java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL -


i'm hitting strange problem when trying load jdbc dataframe spark sql.

i've tried several spark clusters - yarn, standalone cluster , pseudo distributed mode on laptop. it's reproducible on both spark 1.3.0 , 1.3.1. problem occurs in both spark-shell , when executing code spark-submit. i've tried mysql & ms sql jdbc drivers without success.

consider following sample:

val driver = "com.mysql.jdbc.driver" val url = "jdbc:mysql://localhost:3306/test"  val t1 = {   sqlcontext.load("jdbc", map(     "url" -> url,     "driver" -> driver,     "dbtable" -> "t1",     "partitioncolumn" -> "id",     "lowerbound" -> "0",     "upperbound" -> "100",     "numpartitions" -> "50"   )) } 

so far good, schema gets resolved properly:

t1: org.apache.spark.sql.dataframe = [id: int, name: string] 

but when evaluate dataframe:

t1.take(1) 

following exception occurs:

15/04/29 01:56:44 warn tasksetmanager: lost task 0.0 in stage 0.0 (tid 0, 192.168.1.42): java.sql.sqlexception: no suitable driver found jdbc:mysql://<hostname>:3306/test     @ java.sql.drivermanager.getconnection(drivermanager.java:689)     @ java.sql.drivermanager.getconnection(drivermanager.java:270)     @ org.apache.spark.sql.jdbc.jdbcrdd$$anonfun$getconnector$1.apply(jdbcrdd.scala:158)     @ org.apache.spark.sql.jdbc.jdbcrdd$$anonfun$getconnector$1.apply(jdbcrdd.scala:150)     @ org.apache.spark.sql.jdbc.jdbcrdd$$anon$1.<init>(jdbcrdd.scala:317)     @ org.apache.spark.sql.jdbc.jdbcrdd.compute(jdbcrdd.scala:309)     @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)     @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)     @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)     @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)     @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)     @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:61)     @ org.apache.spark.scheduler.task.run(task.scala:64)     @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:203)     @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142)     @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617)     @ java.lang.thread.run(thread.java:745) 

when try open jdbc connection on executor:

import java.sql.drivermanager  sc.parallelize(0 until 2, 2).map { =>   class.forname(driver)   val conn = drivermanager.getconnection(url)   conn.close()   }.collect() 

it works perfectly:

res1: array[int] = array(0, 1) 

when run same code on local spark, works too:

scala> t1.take(1) ... res0: array[org.apache.spark.sql.row] = array([1,one]) 

i'm using spark pre-built hadoop 2.4 support.

the easiest way reproduce problem start spark in pseudo distributed mode start-all.sh script , run following command:

/path/to/spark-shell --master spark://<hostname>:7077 --jars /path/to/mysql-connector-java-5.1.35.jar --driver-class-path /path/to/mysql-connector-java-5.1.35.jar 

is there way work around? looks severe problem, it's strange googling doesn't here.

apparently issue has been reported:

https://issues.apache.org/jira/browse/spark-6913

the problem in java.sql.drivermanager doesn't see drivers loaded classloaders other bootstrap classloader.

as temporary workaround it's possible add required drivers boot classpath of executors.

update: pull request fixes problem: https://github.com/apache/spark/pull/5782

update 2: fix merged spark 1.4


Comments

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -