pyspark - Get CSV to Spark dataframe -


i'm using python on spark , csv dataframe.

the documentation spark sql strangely not provide explanations csv source.

i have found spark-csv, have issues 2 parts of documentation:

  • "this package can added spark using --jars command line option. example, include when starting spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3" need add argument everytime launch pyspark or spark-submit? seems inelegant. isn't there way import in python rather redownloading each time?

  • df = sqlcontext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") if above, won't work. "source" argument stand in line of code? how load local file on linux, "/spark_hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

read csv file in rdd , generate rowrdd original rdd.

create schema represented structtype matching structure of rows in rdd created in step 1.

apply schema rdd of rows via createdataframe method provided sqlcontext.

lines = sc.textfile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) # each line converted tuple. people = parts.map(lambda p: (p[0], p[1].strip()))  # schema encoded in string. schemastring = "name age"  fields = [structfield(field_name, stringtype(), true) field_name in schemastring.split()] schema = structtype(fields)  # apply schema rdd. schemapeople = spark.createdataframe(people, schema) 

source: spark programming guide


Comments

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

java - How to filter a backspace keyboard input -

java - Show Soft Keyboard when EditText Appears -