Recently I had the opportunity to work with Tachyon an In Memory File System. I used Tachyon as a Caching Layer for ETL Output to be used downstream for 2 purposes
- Low Latency Adhoc Query using Spark SQL
- Used for Analytics and Algorithms downstream
The writeup below is a consolidation of what I learnt about Tachyon.
Tachyon – In Memory Data Exchange Layer
Tachyon is an in-memory distributed file system with HDFS / any file system backup. It has resilience built into it through lineage and survives Spark JVM restarts. It allows for fine tuning performance and can act as a cache for Warehouse table – which is faster than in-process cache due to delayed GC. It can provide efficient in-memory columnar storage with compression. It is written in Java and currently works on Linux and Mac.
In environments with high amounts of memory or multiple applications, the experimental OFF_HEAP mode has several advantages:
- It allows multiple apps / executors to share the same pool of memory
- It significantly reduces garbage collection costs
- Cached data is not lost if individual executors crash
- Spark Context might crash
Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change.
Tachyon implements the Hadoop FileSystem interface. Therefore, Hadoop MapReduce and Spark can run with Tachyon without modification.
Pluggable under-layer file system: To provide fault-tolerance, Tachyon checkpoints in-memory data to the under-layer file system. It has a generic interface to make plugging different under-layer file systems easy. Currently support HDFS, S3, GlusterFS, and single-node local file systems.
Native support for raw tables: Table data with over hundreds of columns is common in data warehouses. Tachyon provides native support for multi-columned data, with the option to put only hot columns in memory to save space.
What happens if data set does not fit in memory
Depends on the system setup, Tachyon may leverage local SSD and HDD. It keeps hot data in Tachyon, and cold data in Under-Filesystem
Fault Tolerance in Tachyon is based upon a multi-master approach where multiple master processes are run. One of these processes is elected the leader and is used by all workers and clients as the primary point of contact. The other masters act as standbys using the shared journal to ensure that they maintain the same file system metadata as the leader and can rapidly take over in the event of the leader failing.
If the leader fails a new leader is automatically selected from the available standby masters and Tachyon proceeds as usual.
Tachyon as a Tiered Storage
The under-file system in Tachyon can be modelled as a Tiered Layer – where each layer can be a different storage
- Eviction policy – Only LRU for now
● Directories and their sizes configured for each tier separately
● When storage tier became full data is spilled to next level by eviction policy
Where Tachyon makes MOST sense
In an enterprise setting, with multiple jobs and applications running together, there are some variables that you cannot always control for: A JVM may simply crash. Spark can run out of memory. An app may impact memory in some unforeseen way. But in any of these cases, customer’s jobs can be restarted without losing their in-memory datasets (RDDs) and the overall system must respond gracefully. This is where Tachyon comes in; it survives JVM crashes so the show does indeed go on.
Moreover, for long-running Spark jobs, Tachyon outperforms the Spark Cache as garbage collection kicks in sooner in the Spark JVM, whereas Tachyon and its off-heap memory storage is not affected.
Where Tachyon Does NOT make sense
Tachyon provides high I/O performance, but if task is primarily CPU bound, will not be able to get significant performance gains.
Tachyons Use Cases
Memory storage for serialized blocks
Caching layer for predictable performance
Where is Tachyon currently being used
Tachyon can be used as a Fast Analytic Query Engine Server by hooking up SparkSQL to the BI /Visualization Tool
Start Here – http://tachyon-project.org/downloads/
Tachyon Locally – https://github.com/amplab/tachyon/wiki/Running-Tachyon-Locally
Running Spark with Tachyon – http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
Tachyon on AWS – http://tachyon-project.org/documentation/Deploy-Module.html
Developer Docs – http://tachyon-project.org/documentation/#developer-documentation
Tachyon Performance Benchmarks compared to HDFS
Tachyon Git Repository – https://github.com/amplab/tachyon
Companies Using Tachyon
Atigeo – Company Slides – http://www.slideshare.net/ClaudiuBarbura/tachyon-meetup-san-francisco-oct-2014
H20 is actively using Tachyon