pyspark - Get CSV to Spark dataframe -
i'm using python on spark , csv dataframe.
the documentation spark sql strangely not provide explanations csv source.
i have found spark-csv, have issues 2 parts of documentation:
"this package can added spark using --jars command line option. example, include when starting spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
need add argument everytime launch pyspark or spark-submit? seems inelegant. isn't there way import in python rather redownloading each time?df = sqlcontext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv")
if above, won't work. "source" argument stand in line of code? how load local file on linux, "/spark_hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
read csv file in rdd , generate rowrdd original rdd.
create schema represented structtype matching structure of rows in rdd created in step 1.
apply schema rdd of rows via createdataframe method provided sqlcontext.
lines = sc.textfile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) # each line converted tuple. people = parts.map(lambda p: (p[0], p[1].strip())) # schema encoded in string. schemastring = "name age" fields = [structfield(field_name, stringtype(), true) field_name in schemastring.split()] schema = structtype(fields) # apply schema rdd. schemapeople = spark.createdataframe(people, schema)
source: spark programming guide
Comments
Post a Comment