Apache Spark SQL - RDD In-Memory Data Skew -


i'm trying cache hive table in memory using cache table tablename;

after command, table gets cached noticed skew in way rdd in partitioned in memory.

here's see in "storage" tab on application master

rdd_71_1    memory deserialized 1x replicated   1264.7 mb   0.0 b   node4:38759 rdd_71_10   memory deserialized 1x replicated   11.6 mb     0.0 b   node1:58115 rdd_71_11   memory deserialized 1x replicated   25.7 mb     0.0 b   node1:53968 rdd_71_2    memory deserialized 1x replicated   72.6 mb     0.0 b   node4:54133 rdd_71_4    memory deserialized 1x replicated   1260.9 mb   0.0 b   node2:33179 rdd_71_5    memory deserialized 1x replicated   56.8 mb     0.0 b   node2:54222 rdd_71_7    memory deserialized 1x replicated   54.5 mb     0.0 b   node4:34149 rdd_71_8    memory deserialized 1x replicated   1277.8 mb   0.0 b   node1:43572 rdd_71_9    memory deserialized 1x replicated   1255.8 mb   0.0 b   node1:58518 

notice partitions of range of 11mb 72mb , other partitions of range ~1200mb

even when i'm not caching table, processing disk, see tasks complete earlier others further confirms guess skewness.

whats going on here? how can avoid data skew?

ps : table stored in orc format

i don't know why data skewed when read directly disk. however, find useful repartition data balance size of partitions , avoid being held single long lasting task. recommend reading last part of https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html (the "data partitioning (advanced)" section) offers nice tips :)


Comments

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

java - How to filter a backspace keyboard input -

java - Show Soft Keyboard when EditText Appears -