方便快速从不同的数据源(json, parquet, rdbms),经过混合处理(json join parquet)
再将处理结果以特定的格式(json, parquet)写回到指定的系统中(HDFS, S3)
spark.read.format(format)
format
built-in json, parquet jdbc, csv(2.x)
packages: 外部的非SPark内置 spark-packages.org
spark.write
操作parquet
ParquetApp
main
spark = SparkSession
df = spark.read.format(“parquet”).load(users.parquet”)
df.printSchema
df.select*
new_df.write.format(“json”).save(new_path)
默认处理parquet数据
spark-sql
create temporary view parquetTable
USING…
Options(
path: fsafsdfsafdsf
)
select * from parquettable
spark.read.format().option(path).load()
处理Hive
Df写入到hive
write.writeAsTable()
设置分区的数量,默认200