hadoop - Does Impala makes effective use of Buckets in a Hive Bucketed table? -
i'm in process of improving performance of table.
say table:
create table user_info_bucketed(user_id bigint, firstname string, lastname string) comment 'a bucketed copy of user_info' partitioned by(year int, month int) stored parquet;
i'm planning apply bucketing user_id, queries involve user_id clause.
like this
create table user_info_bucketed(user_id bigint, firstname string, lastname string) comment 'a bucketed copy of user_info' partitioned by(year int, month int) clustered by(user_id) 256 buckets stored parquet;
this table created , loaded hive, , queried impala...
what wanted know is, whether bucketing table improve performance of impala queries - i'm not sure how impala works buckets.
i tried creating bucketed , non-bucketed table table through hive (which table 6gb in size)
i tried benchmarking results both. there slight/no difference.
i tried analyzing profile of both queries, didn't show difference.
so answer is, impala doesn't know whether table bucketed or not, doesn't take advantage of (impala-1990). way becomes aware of partitions , files in table compute stats
by way, bucketing tables used impala not wasteful. if have limit number of small files in table, can bucket , switch on hive transactions (available hive 0.13.0)
Comments
Post a Comment