Data Types
JSON
JSON stands for
JavaScript Object NotationJSON is a lightweight format for storing and transporting data
JSON is often used when data is sent from a server to a web page
JSON is “self-describing” and easy to understand
JSON is built on two structures:
A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
YAML
Yet Another Markup LanguageYAML is a digestible data serialization language that is often utilized to create configuration files and works in concurrence with any programming language.
YAML is a data serialization language designed for human interaction. It’s a strict superset of JSON, another data serialization language.
But because it’s a strict superset, it can do everything that JSON can and more.
One major difference is that newlines and indentation actually mean something in YAML, as opposed to JSON, which uses brackets and braces.
Parquet
Open-source columnar file format that provides I/O optimization such as data compression, which gives quick access to data columns for querying.
Columnar storage is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly.
With columnar storage, aggregation queries are less time consuming compared to row-oriented databases.
This way of storage has translated into hardware savings and minimized latency for accessing data.
Advantages
Parquet has helped reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time which ultimately reduces the overall costs.
Parquet is a columnar file format, which means that it can be used to store data in a more efficient way than CSV.
Dynamic Allocation
The Spark driver can request more or fewer compute resources as the demand of large workflows. If workflows vary in their demands for compute capacity then Dynamic Allocation could be an option (e.g. streaming data).
Example of settings:
spark.dynamicAllocation.enabled truespark.dynamicAllocation.minExecutors 2
Must configure spark executors’ memory allocation
Default:
Execution memory: 60%
Storage memory: 40%
Reserved Memory 300 MB (safe-gaurd OOM)
The default settings work in most cases. - can be tricky to tune %
Storage memory is primarily for caching data structures/partitions from DataFrames.