У нас вы можете посмотреть бесплатно Why should we partition the data in spark? или скачать в максимальном доступном качестве, которое было загружено на ютуб. Для скачивания выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса savevideohd.ru
ATTENTION DATA SCIENCE ASPIRANTS: Click Below Link to Download Proven 90-Day Roadmap to become a Data Scientist in 90 days https://www.bigdataelearning.com/the-... Apache Spark Courses : https://www.bigdataelearning.com/courses Official Website : https://bigdataelearning.com In this video, let’s understand why we need to partition the data in spark by looking at an example. Then we will also see what is hash partitioning and range partitioning. Say a RDD called “deptRdd” with (UserId, DeptName) pairs have the following 10 elements. Say another RDD called “salaryRdd” with (UserId, Salary) pairs also have 10 elements. Now we need to join these 2 RDDs using Spark’s join operation to get all the 3 values userID, DeptName, and Salary. As you know, by default, the dataset is split based on the block size and the different blocks are stored in different nodes of the cluster. When the Spark’s join operation happens, it has no way of knowing which node holds which key. So, it ends up checking multiple nodes of the cluster, which increases network shuffling and thereby impacts the performance. Now, if we partition the datasets into 3 partitions, by taking modulo 3, then the records in the dataset are equally split and stored into appropriate partitions in different nodes of the cluster. When join operation happens, it now knows which partition will hold value for which key. This reduces lot of unnecessary network shuffling and improves the performance. Hash Partitioning When we hash partition the key and then take modulo of the number of partitions, the process is called hash partitioning. This can be done by using partitionBy transformation on deptRdd and passing the Hash Partitioner object to it, as shown below. deptRdd.partitionBy(new HashPartitioner(3)).persist() Here, the ‘3’ passed to "partitionBy" represents the number of partitions. Range Partitioning When the keys are sorted and non-negative, we can partition the dataset based on the range called range partitioning. Here keys belonging to a specific range will appear in a specific node. deptRdd.partitionBy(new RangePartitioner(3, deptRdd)).persist()