Apache Spark SQL - RDD In-Memory Data Skew -
i'm trying cache hive table in memory using cache table tablename;
after command, table gets cached noticed skew in way rdd in partitioned in memory.
here's see in "storage" tab on application master
rdd_71_1 memory deserialized 1x replicated 1264.7 mb 0.0 b node4:38759 rdd_71_10 memory deserialized 1x replicated 11.6 mb 0.0 b node1:58115 rdd_71_11 memory deserialized 1x replicated 25.7 mb 0.0 b node1:53968 rdd_71_2 memory deserialized 1x replicated 72.6 mb 0.0 b node4:54133 rdd_71_4 memory deserialized 1x replicated 1260.9 mb 0.0 b node2:33179 rdd_71_5 memory deserialized 1x replicated 56.8 mb 0.0 b node2:54222 rdd_71_7 memory deserialized 1x replicated 54.5 mb 0.0 b node4:34149 rdd_71_8 memory deserialized 1x replicated 1277.8 mb 0.0 b node1:43572 rdd_71_9 memory deserialized 1x replicated 1255.8 mb 0.0 b node1:58518
notice partitions of range of 11mb 72mb , other partitions of range ~1200mb
even when i'm not caching table, processing disk, see tasks complete earlier others further confirms guess skewness.
whats going on here? how can avoid data skew?
ps : table stored in orc format
i don't know why data skewed when read directly disk. however, find useful repartition
data balance size of partitions , avoid being held single long lasting task. recommend reading last part of https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html (the "data partitioning (advanced)" section) offers nice tips :)
Comments
Post a Comment