At a high level, some of Hive's main features include querying and analyzing large datasets stored in HDFS.
As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL.
Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS.
Hive uses a query language called HiveQL, which is similar to SQL.
Thrift is a software framework.
Hive client supports different types of client applications in different languages to perform
Fig: Architecture of Hive
Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema
located on the Apache Derby DB
The data flow in the following sequence: We execute a query, which goes into the driver Then the driver asks for the plan, which refers to the query execution After this, the compiler gets the metadata from the metastore The metastore responds with the metadata The compiler gathers this information and sends the plan back to the driver Now, the driver sends the execution plan to the execution engine The execution engine acts as a bridge between the Hive and Hadoop to process the query In addition to this, the execution engine also communicates bidirectionally with the metastore to perform various operations, such as create and drop tables Finally, we have a bidirectional communication to fetch and send results back to the client
Partitions - Here, tables are organized into partitions for grouping similar types of data based on the partition key
Buckets - Data present in partitions can be further divided into buckets for efficient querying