SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
High throughput data replication over
RAFT
Mukul Kumar Singh, Staff Software Engineer, Hortonworks
Lokesh Jain, Software Engineer, Hortonworks
2 © Hortonworks Inc. 2011–2018. All rights reserved
• msingh@apache.org
• Staff Software Engineer, Hortonworks
• ASF
• Committer for Apache Hadoop
• Committer for Apache Ratis
• MS from Carnegie Mellon University,
Pittsburgh
• ljain@apache.org
• Software Engineer, Hortonworks
• ASF
• Committer for Apache Ratis
• BE(Hons) Computer Science & M.Sc.
(Hons) Mathematics from BITS Pilani
Mukul Kumar Singh Lokesh Jain
Speakers
3 © Hortonworks Inc. 2011–2018. All rights reserved
Raft
4 © Hortonworks Inc. 2011–2018. All rights reserved
Raft
• Raft is a consensus algorithm
• Works when majority of nodes are alive in cluster
• i.e. can handle loss of minority number of nodes.
• “In Search of an Understandable Consensus Algorithm”
• by Diego Ongaro and John Ousterhout
• USENIX ATC’14, https://raft.github.io
5 © Hortonworks Inc. 2011–2018. All rights reserved
Raft Library
• Our Motivations
• Use Raft in Ozone
• “In Search of a Usable Raft Library”
• A long list of Raft implementations is available
• None of them a general library ready to be consumed by other projects.
• Most of them are tied to another project or a part of another project.
• We need a Raft library!
6 © Hortonworks Inc. 2011–2018. All rights reserved
Raft Basic
• Leader Election
• Servers are started as a Follower
• Randomly timeout to become Candidate and start a leader election
• Candidate sends requestVote to other servers
• It becomes the leader once it gets a majority of the votes.
• Append Entries
• Clients send requests to the Leader
• Leader forwards the requests to the Followers
• Leader sends appendEntries to Followers
• When there is no client requests, Leader also sends empty appendEntries
(heartbeats) to Followers to maintain leadership
7 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Ratis
8 © Hortonworks Inc. 2011–2018. All rights reserved
Data Intensive Applications
• In Raft,
• All transactions and the data are written in the log
• Not suitable for data intensive applications
• In Ratis
• Application could choose to not write all the data to log
• State machine data and log data can be separately managed
• See the FileStore example in ratis-example
• See the ContainerStateMachine as an implementation in Apache Hadoop Ozone.
9 © Hortonworks Inc. 2011–2018. All rights reserved
Ratis: Standard Raft Features
• Leader Election + Log Replication
• Automatically elect a leader among the servers in a Raft group
• Randomized timeout for avoiding split votes
• Log is replicated in the Raft group
• Membership Changes
• Members in a Raft group can be re-configurated in runtime
• Replication factor can be changed in runtime
• Log Compaction
• Snapshot is taken periodically
• Send snapshot instead of a long log history.
10 © Hortonworks Inc. 2011–2018. All rights reserved
Ratis: Pluggability
• Pluggable state machine
• Application must define its state machine
• Example: a key-value map
• Pluggable RPC
• Users may provide their own RPC implementation
• Default implementations: gRPC, Netty, Hadoop RPC
• gRPC allows implementation of native client
• Pluggable Raft log
• Users may provide their own log implementation
• The default implementation stores log in local files
11 © Hortonworks Inc. 2011–2018. All rights reserved
Ratis: Asynchronous/Synchronous APIs
• Using gRPC bi-directional stream API
• Netty and Hadoop RPC can support async but not yet implemented
• Server-to-server
• Asynchronous append entries
• Client-to-server
• Asynchronous client requests
12 © Hortonworks Inc. 2011–2018. All rights reserved
General Ratis Use Cases
• You want to:
• (1) replicate the server log/states to multiple machines
• The replication number/cluster membership can be changed in runtime
• It can tolerate server failures.
• or
• (2) have a HA (highly available) service
• When a server fails, another server will automatically take over.
• Clients automatically failover to the new server.
• Apache Ratis is for you!
13 © Hortonworks Inc. 2011–2018. All rights reserved
API
• Client Side APIs
• Send/SendReadOnly
• Send readonly commands are do not change the state of the raft server.
• Async versions also available (sendAsync, sendReadOnlyAsync)
• Server Side APIs
• applyTransaction
• Applies the transaction to the statemachine
• writeStateMachineData
• An optimization to avoid double write penalty for data intensive
applications.
14 © Hortonworks Inc. 2011–2018. All rights reserved
High Throughput
Data Pipeline
15 © Hortonworks Inc. 2011–2018. All rights reserved
Building a high performance data pipeline
• Requirements
• High data write throughput
• Parallelism/async interface
• Large number of transactions per second
• Configurable parameters
• Support for security
16 © Hortonworks Inc. 2011–2018. All rights reserved
Building a high performance data pipeline
• Optimizations
• Separate user data from the raft log
– Avoids double write penalty for data
• Efficient batching of raft log entries
– High write performance during local disk write
– Efficient network replication
• Async processing of operations
– Client ops
– Append entries to followers
– StateMachine implementation
17 © Hortonworks Inc. 2011–2018. All rights reserved
FileStoreStateMachine
• Located at org.apache.ratis.examples.filestore
• Simple state machine implementation to write bytes to a file
• Separates file data from raft log.
• File data written is persisted to disk
• Client generates random bytes of the specified file size
• Client uses writeAsync
18 © Hortonworks Inc. 2011–2018. All rights reserved
Performance Benchmarking
• Setup, 3 nodes with
• Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
• 256GiB System memory
• 10 Gigabit Network Connection
• 4 HGST (HUS726060AL4210) HDD of 5.5TB each
19 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Write Throughout
0
50
100
150
200
250
300
128000 102400 64000 51200 32000 20480 16000 10240 8000 5120 4000 2048 2000 1024 1000 1000 512 500 250 125 100
DatathrouhputinMB/s
File Size in KB
Write throughput for 1GB
20 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Transactions per second
0
2000
4000
6000
8000
10000
12000
100000 10000 1000 100 10
NUMBEROFTRANSACTIONPERSECOND
FILE SIZE IN BYTES
Number of transaction with 100000 files
21 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone
22 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone
Client
DN DN DN
RATIS
Ozone
Master
Storage
Container
Manager
Get Block
Get Container Location
(List of DNs)
Write Data
23 © Hortonworks Inc. 2011–2018. All rights reserved
Terminologies
• OM – Ozone Master
• Namespace manager inside Ozone, manages key name to block id mapping.
• Also manages Volume, buckets and key namespaces
• SCM – Storage Container Manager
• Block Manager, manager cluster membership, container location
information, manager containers
• Datanode
• Used to store user data, Ratis server spawned inside the datanode
• Ozone datanode persist containers, blocks are allocated out of containers.
24 © Hortonworks Inc. 2011–2018. All rights reserved
Storage Container
• Hadoop Distributed Data Storage (HDDS) introduces Storage Containers
• Provide generic data storage functionalities.
• Configurable Size (2GB - 16GB+)
• Unit of management and replication in SCM.
• Blocks are allocated from container
• BID = CID + LocalID
25 © Hortonworks Inc. 2011–2018. All rights reserved
Use of Ratis in Ozone
• Replicating data in open containers
• Replication of user data using Ratis
• Support HA in Storage Container Manager
• Work in Progress
• Support HA in Ozone Manager
• Work in Progress
26 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone Ratis Commands
• Ozone Data Pipeline involved interaction between client and
datanode.
• Commands are marked as readonly if they do not change the state
of the datanode.
• GetKey, ReadChunk, Read Container, or
• WriteChunk, PutKey, CreateContainer etc.
• Ozone Client send container commands to the leader datanode
using Ratis Protocol (grpc as underlying rpc)
27 © Hortonworks Inc. 2011–2018. All rights reserved
Command Replication on Containers
Leader
Follower Follower
Write Chunk
CSM
Response
28 © Hortonworks Inc. 2011–2018. All rights reserved
Open Container Replication using Ratis
• Ratis is used for replication of data being written to Ozone Datanodes.
• Ratis replicates container commands on open containers.
• Ozone Datanode provides its own state machine implementation
• This implementation handles various datanode commands (write chunk, put key, create
container)
• Performance optimizations
• To avoid rewrite of data twice to the disk, the state machine implementation separates user
data from block/chunk metadata.
• Multiple chunks are written in parallel.
• Append requests from Leader to followers are made async. Allows multiple appends in
parallel.
• Raft-journal in separate disk – fast contiguous writes without seeking
29 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone Data Write Performance
• The performance numbers were taken for different key sizes and 10 client writes
in parallel.
• Measure the end to end throughput numbers
• Key allocation in OM and Block Allocation is SCM also account for total throughput.
• Ozone Client
• Uses sync apis to write data to the datanodes
• ContainerStateMachine implementation
• Parallelize write chunk operations
Key Sizes 10 MB 100MB
Throughput (MB/s) 81.3 110.3 MB
30 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Ratis is Java based implementation of Raft protocol
• Essentially constituting a replicated statemachine.
• Suitable for data intensive applications.
• Features
• Sync/Async client apis
• Pluggable StateMachine
• Pluggable Raft Log Implementation
• Performance
• Write throughput - 250MB/s – 300 MB/s
• IOPS - 10,000 txns/s
31 © Hortonworks Inc. 2011–2018. All rights reserved
Contributors
• A big thanks to all the contributors for Apache Ratis, Apache Hadoop
and Ozone
• Animesh Trivedi, Anu Engineer, Arpit Agarwal, Brent,
• Chen Liang, Chris Nauroth, Devaraj Das, Enis Soztutar,
• garvit, Hanisha Koneru, Hugo Louro, Jakob Homan,
• Jian He, Jing Chen, Jing Zhao, Jitendra Pandey, Junping Du,
• kaiyangzhang, Karl Heinz Marbaise, Li Lu, Lokesh Jain,
• Marton Elek, Mayank Bansal, Mingliang Liu,
• Mukul Kumar Singh, Sen Zhang, Shashikant Banerjee, Sriharsha
Chintalapani,Tsz Wo Nicholas Sze,
• Uma Maheswara Rao G, Venkat Ranganathan, Wangda Tan,
• Weiqing Yang, Will Xu, Xiaobing Zhou, Xiaoyu Yao, Yubo Xu,
• yue liu, Zhiyuan Yang
32 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Ratis & Apache Hadoop Ozone
• Contributions are welcome!
• Ratis
• http://ratis.incubator.apache.org
• dev@ratis.incubator.apache.org
• Ozone
• http://hadoop.apache.org
• hdfs-dev@hadoop.apache.org
33 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
34 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

High throughput data replication over RAFT

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved High throughput data replication over RAFT Mukul Kumar Singh, Staff Software Engineer, Hortonworks Lokesh Jain, Software Engineer, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved • msingh@apache.org • Staff Software Engineer, Hortonworks • ASF • Committer for Apache Hadoop • Committer for Apache Ratis • MS from Carnegie Mellon University, Pittsburgh • ljain@apache.org • Software Engineer, Hortonworks • ASF • Committer for Apache Ratis • BE(Hons) Computer Science & M.Sc. (Hons) Mathematics from BITS Pilani Mukul Kumar Singh Lokesh Jain Speakers
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Raft
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Raft • Raft is a consensus algorithm • Works when majority of nodes are alive in cluster • i.e. can handle loss of minority number of nodes. • “In Search of an Understandable Consensus Algorithm” • by Diego Ongaro and John Ousterhout • USENIX ATC’14, https://raft.github.io
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Raft Library • Our Motivations • Use Raft in Ozone • “In Search of a Usable Raft Library” • A long list of Raft implementations is available • None of them a general library ready to be consumed by other projects. • Most of them are tied to another project or a part of another project. • We need a Raft library!
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Raft Basic • Leader Election • Servers are started as a Follower • Randomly timeout to become Candidate and start a leader election • Candidate sends requestVote to other servers • It becomes the leader once it gets a majority of the votes. • Append Entries • Clients send requests to the Leader • Leader forwards the requests to the Followers • Leader sends appendEntries to Followers • When there is no client requests, Leader also sends empty appendEntries (heartbeats) to Followers to maintain leadership
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Apache Ratis
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Data Intensive Applications • In Raft, • All transactions and the data are written in the log • Not suitable for data intensive applications • In Ratis • Application could choose to not write all the data to log • State machine data and log data can be separately managed • See the FileStore example in ratis-example • See the ContainerStateMachine as an implementation in Apache Hadoop Ozone.
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Ratis: Standard Raft Features • Leader Election + Log Replication • Automatically elect a leader among the servers in a Raft group • Randomized timeout for avoiding split votes • Log is replicated in the Raft group • Membership Changes • Members in a Raft group can be re-configurated in runtime • Replication factor can be changed in runtime • Log Compaction • Snapshot is taken periodically • Send snapshot instead of a long log history.
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Ratis: Pluggability • Pluggable state machine • Application must define its state machine • Example: a key-value map • Pluggable RPC • Users may provide their own RPC implementation • Default implementations: gRPC, Netty, Hadoop RPC • gRPC allows implementation of native client • Pluggable Raft log • Users may provide their own log implementation • The default implementation stores log in local files
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Ratis: Asynchronous/Synchronous APIs • Using gRPC bi-directional stream API • Netty and Hadoop RPC can support async but not yet implemented • Server-to-server • Asynchronous append entries • Client-to-server • Asynchronous client requests
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved General Ratis Use Cases • You want to: • (1) replicate the server log/states to multiple machines • The replication number/cluster membership can be changed in runtime • It can tolerate server failures. • or • (2) have a HA (highly available) service • When a server fails, another server will automatically take over. • Clients automatically failover to the new server. • Apache Ratis is for you!
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved API • Client Side APIs • Send/SendReadOnly • Send readonly commands are do not change the state of the raft server. • Async versions also available (sendAsync, sendReadOnlyAsync) • Server Side APIs • applyTransaction • Applies the transaction to the statemachine • writeStateMachineData • An optimization to avoid double write penalty for data intensive applications.
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved High Throughput Data Pipeline
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Building a high performance data pipeline • Requirements • High data write throughput • Parallelism/async interface • Large number of transactions per second • Configurable parameters • Support for security
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Building a high performance data pipeline • Optimizations • Separate user data from the raft log – Avoids double write penalty for data • Efficient batching of raft log entries – High write performance during local disk write – Efficient network replication • Async processing of operations – Client ops – Append entries to followers – StateMachine implementation
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved FileStoreStateMachine • Located at org.apache.ratis.examples.filestore • Simple state machine implementation to write bytes to a file • Separates file data from raft log. • File data written is persisted to disk • Client generates random bytes of the specified file size • Client uses writeAsync
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Performance Benchmarking • Setup, 3 nodes with • Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz • 256GiB System memory • 10 Gigabit Network Connection • 4 HGST (HUS726060AL4210) HDD of 5.5TB each
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Write Throughout 0 50 100 150 200 250 300 128000 102400 64000 51200 32000 20480 16000 10240 8000 5120 4000 2048 2000 1024 1000 1000 512 500 250 125 100 DatathrouhputinMB/s File Size in KB Write throughput for 1GB
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Transactions per second 0 2000 4000 6000 8000 10000 12000 100000 10000 1000 100 10 NUMBEROFTRANSACTIONPERSECOND FILE SIZE IN BYTES Number of transaction with 100000 files
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Ozone
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Client DN DN DN RATIS Ozone Master Storage Container Manager Get Block Get Container Location (List of DNs) Write Data
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Terminologies • OM – Ozone Master • Namespace manager inside Ozone, manages key name to block id mapping. • Also manages Volume, buckets and key namespaces • SCM – Storage Container Manager • Block Manager, manager cluster membership, container location information, manager containers • Datanode • Used to store user data, Ratis server spawned inside the datanode • Ozone datanode persist containers, blocks are allocated out of containers.
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Storage Container • Hadoop Distributed Data Storage (HDDS) introduces Storage Containers • Provide generic data storage functionalities. • Configurable Size (2GB - 16GB+) • Unit of management and replication in SCM. • Blocks are allocated from container • BID = CID + LocalID
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Use of Ratis in Ozone • Replicating data in open containers • Replication of user data using Ratis • Support HA in Storage Container Manager • Work in Progress • Support HA in Ozone Manager • Work in Progress
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Ratis Commands • Ozone Data Pipeline involved interaction between client and datanode. • Commands are marked as readonly if they do not change the state of the datanode. • GetKey, ReadChunk, Read Container, or • WriteChunk, PutKey, CreateContainer etc. • Ozone Client send container commands to the leader datanode using Ratis Protocol (grpc as underlying rpc)
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Command Replication on Containers Leader Follower Follower Write Chunk CSM Response
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Open Container Replication using Ratis • Ratis is used for replication of data being written to Ozone Datanodes. • Ratis replicates container commands on open containers. • Ozone Datanode provides its own state machine implementation • This implementation handles various datanode commands (write chunk, put key, create container) • Performance optimizations • To avoid rewrite of data twice to the disk, the state machine implementation separates user data from block/chunk metadata. • Multiple chunks are written in parallel. • Append requests from Leader to followers are made async. Allows multiple appends in parallel. • Raft-journal in separate disk – fast contiguous writes without seeking
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Data Write Performance • The performance numbers were taken for different key sizes and 10 client writes in parallel. • Measure the end to end throughput numbers • Key allocation in OM and Block Allocation is SCM also account for total throughput. • Ozone Client • Uses sync apis to write data to the datanodes • ContainerStateMachine implementation • Parallelize write chunk operations Key Sizes 10 MB 100MB Throughput (MB/s) 81.3 110.3 MB
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Summary • Ratis is Java based implementation of Raft protocol • Essentially constituting a replicated statemachine. • Suitable for data intensive applications. • Features • Sync/Async client apis • Pluggable StateMachine • Pluggable Raft Log Implementation • Performance • Write throughput - 250MB/s – 300 MB/s • IOPS - 10,000 txns/s
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Contributors • A big thanks to all the contributors for Apache Ratis, Apache Hadoop and Ozone • Animesh Trivedi, Anu Engineer, Arpit Agarwal, Brent, • Chen Liang, Chris Nauroth, Devaraj Das, Enis Soztutar, • garvit, Hanisha Koneru, Hugo Louro, Jakob Homan, • Jian He, Jing Chen, Jing Zhao, Jitendra Pandey, Junping Du, • kaiyangzhang, Karl Heinz Marbaise, Li Lu, Lokesh Jain, • Marton Elek, Mayank Bansal, Mingliang Liu, • Mukul Kumar Singh, Sen Zhang, Shashikant Banerjee, Sriharsha Chintalapani,Tsz Wo Nicholas Sze, • Uma Maheswara Rao G, Venkat Ranganathan, Wangda Tan, • Weiqing Yang, Will Xu, Xiaobing Zhou, Xiaoyu Yao, Yubo Xu, • yue liu, Zhiyuan Yang
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Apache Ratis & Apache Hadoop Ozone • Contributions are welcome! • Ratis • http://ratis.incubator.apache.org • dev@ratis.incubator.apache.org • Ozone • http://hadoop.apache.org • hdfs-dev@hadoop.apache.org
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Thank you

Editor's Notes

  1. TALK TRACK Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications. [NEXT SLIDE]