Summary of Tachyon InMemory File System

Recently I had the opportunity to work with Tachyon an In Memory File System. I used Tachyon as a Caching Layer for ETL Output to be used downstream for 2 purposes

  1. Low Latency Adhoc Query using Spark SQL
  2. Used for Analytics and Algorithms downstream

The writeup below is a consolidation of what I learnt about Tachyon.

Tachyon – In Memory Data Exchange Layer

Tachyon is an in-memory distributed file system with HDFS / any file system backup. It has resilience built into it through lineage and survives Spark JVM restarts. It allows for fine tuning performance and can act as a cache for Warehouse table – which is faster than in-process cache due to delayed GC. It can provide efficient in-memory columnar storage with compression. It is written in Java and currently works on Linux and Mac.

Image1

In environments with high amounts of memory or multiple applications, the experimental OFF_HEAP mode has several advantages:

  • It allows multiple apps / executors to share the same pool of memory
  • It significantly reduces garbage collection costs
  • Cached data is not lost if individual executors crash
  • Spark Context might crash

Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change.

Tachyon implements the Hadoop FileSystem interface. Therefore, Hadoop MapReduce and Spark can run with Tachyon without modification.

Pluggable under-layer file system: To provide fault-tolerance, Tachyon checkpoints in-memory data to the under-layer file system. It has a generic interface to make plugging different under-layer file systems easy. Currently support HDFS, S3, GlusterFS, and single-node local file systems.

Native support for raw tables: Table data with over hundreds of columns is common in data warehouses. Tachyon provides native support for multi-columned data, with the option to put only hot columns in memory to save space.

What happens if data set does not fit in memory

Depends on the system setup, Tachyon may leverage local SSD and HDD. It keeps hot data in Tachyon, and cold data in Under-Filesystem

Fault Tolerance in Tachyon is based upon a multi-master approach where multiple master processes are run. One of these processes is elected the leader and is used by all workers and clients as the primary point of contact. The other masters act as standbys using the shared journal to ensure that they maintain the same file system metadata as the leader and can rapidly take over in the event of the leader failing.

If the leader fails a new leader is automatically selected from the available standby masters and Tachyon proceeds as usual.

Image2

Tachyon as a Tiered Storage

The under-file system in Tachyon can be modelled as a Tiered Layer – where each layer can be a different storage

Image3

  • Eviction policy – Only LRU for now
    ● Directories and their sizes configured for each tier separately
    ● When storage tier became full data is spilled to next level by eviction policy

Where Tachyon makes MOST sense

In an enterprise setting, with multiple jobs and applications running together, there are some variables that you cannot always control for:  A JVM may simply crash.  Spark can run out of memory.  An app may impact memory in some unforeseen way.  But in any of these cases, customer’s jobs can be restarted without losing their in-memory datasets (RDDs) and the overall system must respond gracefully.  This is where Tachyon comes in; it survives JVM crashes so the show does indeed go on.

Moreover, for long-running Spark jobs, Tachyon outperforms the Spark Cache as garbage collection kicks in sooner in the Spark JVM, whereas Tachyon and its off-heap memory storage is not affected.

Where Tachyon Does NOT make sense

Tachyon provides high I/O performance, but if task is primarily CPU bound, will not be able to get significant performance gains.

Tachyons Use Cases

Memory storage for serialized blocks

Image4

Caching layer for predictable performance

Image5

Where is Tachyon currently being used

Image6

Tachyon can be used as a Fast Analytic Query Engine Server by hooking up SparkSQL to the BI /Visualization Tool

Image7

References

Start Here – http://tachyon-project.org/downloads/

Next – http://ampcamp.berkeley.edu/5/exercises/tachyon.html

Tachyon Locally –              https://github.com/amplab/tachyon/wiki/Running-Tachyon-Locally

Tachyon HDFS – https://github.com/amplab/tachyon/wiki/Running-Hadoop-MapReduce-on-Tachyon

Running Spark with Tachyon – http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html

Tachyon Examples – https://github.com/amplab/tachyon/tree/master/examples/src/main/java/tachyon/examples

Tachyon on AWS – http://tachyon-project.org/documentation/Deploy-Module.html

Developer Docs – http://tachyon-project.org/documentation/#developer-documentation

Tachyon Performance Benchmarks compared to HDFS

http://www.datanami.com/2014/08/14/amplabs-tachyon-promises-solidify-memory-analytics/

Tachyon Git Repository – https://github.com/amplab/tachyon

Documents

http://files.meetup.com/14452042/Tachyon_Meetup_2014_8_25.pdf

http://files.meetup.com/14452042/Tachyon_Meetup_2015_5_28-1-Baidu.pdf

http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16-Strata.pdf

https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-14-Zoomdata-Alarcon.pdf

Tachyon JIRA – https://tachyon.atlassian.net/projects/TACHYON/issues/TACHYON-780?filter=allopenissues

Companies Using Tachyon

http://www.tachyonnexus.com/

Atigeo – Company Slides – http://www.slideshare.net/ClaudiuBarbura/tachyon-meetup-san-francisco-oct-2014

H20 is actively using Tachyon

Baidu

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s