hadoop - Does Impala makes effective use of Buckets in a Hive Bucketed table? -


i'm in process of improving performance of table.

say table:

create table user_info_bucketed(user_id bigint, firstname string, lastname string) comment 'a bucketed copy of user_info' partitioned by(year int, month int) stored parquet; 

i'm planning apply bucketing user_id, queries involve user_id clause.

like this

create table user_info_bucketed(user_id bigint, firstname string, lastname string) comment 'a bucketed copy of user_info' partitioned by(year int, month int) clustered by(user_id) 256 buckets stored parquet; 

this table created , loaded hive, , queried impala...

what wanted know is, whether bucketing table improve performance of impala queries - i'm not sure how impala works buckets.

i tried creating bucketed , non-bucketed table table through hive (which table 6gb in size)

i tried benchmarking results both. there slight/no difference.

i tried analyzing profile of both queries, didn't show difference.

so answer is, impala doesn't know whether table bucketed or not, doesn't take advantage of (impala-1990). way becomes aware of partitions , files in table compute stats

by way, bucketing tables used impala not wasteful. if have limit number of small files in table, can bucket , switch on hive transactions (available hive 0.13.0)


Comments

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

java - How to filter a backspace keyboard input -

java - Show Soft Keyboard when EditText Appears -