Zoroaster: Realistic synthetic workload

This project is no longer active. Information is still available below.

One of the biggest challenges for system architects and performance analysts is identifying and obtaining real workloads to evaluate and compare system designs. CERN, for instance, collects approximately 86 Terabyte of data each day from its Large Hadron Collider (LHC), 90 percent of which is filtered out while the rest is stored as traces. The remaining 8 Terabyte is stored to create real workloads that represent the system environment being used by worldwide users of the CERN compute systems. Not having to store all these traces everyday will help CERN assign their compute power to other resources. 
We propose ZOROASTER, a suite of machine learning tools to generate and validate realistic workloads that allows users to make storage decisions without the privacy, cost, and management overheads of storing raw I/O traces. The project addresses three research objectives:
 1. Automating workload generation will reduce the cost of ownership of systems: Testing generative models, leveraging a neural network's ability to learn from realistic sets of traces to generate workload samples, as opposed to using probability distributions to generate synthetic data. Generating hybrid workloads, which will eliminate the need for multiple workloads to test different aspects of a hybrid system. Generates samples will also be tested over scalable system architectures.
 2. A design guide for Generative Adversarial Networks (GANs) will facilitate an exploration of unorthodox applications of these models. We propose to identify a set of design rules not only for GANs, but also for other generative models, that will make building them more generalizable, instead of limiting them to a single kind of problem. Inconsistencies in real workloads will be studied to understand how unexpected inequalities in time-series distributions can cause issues like overfitting, diminished gradient and modal collapse in generative models.
 3. Generating realistic traces will help maintain privacy of sensitive user and business information. Data privacy laws restricts sharing of real data and anonymization falls short for high-dimensional, highly-correlated data. We propose to show that generative models can produce samples that are realistic and representative, thereby completely eliminating the need for users to share their sensitive information.
This project's analysis of generative model architectures for different trace types, will contribute to the understanding of roles of layers and nodes on the configuration of a multilayer neural networks. This project will pave way for inter-disciplinary research in statistics, physics, systems and performance testing. This research will reduce provisioning cost in the cases where the storage systems need to be scalable for petabyte workloads. Performance prediction of future systems is not limited to enterprise storage systems and will be utilized by databases, the cloud, distributed systems, and the gaming community. This project's support for anonymity of sensitive company and user information will facilitate trusted collaboration between system vendors and consumers. This is useful for healthcare industries as well, which require a high privacy standard making it almost impossible to generate representative real workloads. 
\textbf{Key Words: Synthetic workload generation; Generative Adversarial Networks}


Our proposed research includes data management to collect and clean up the real workloads, feature extraction to find feature and workload classes that the discriminator will use for training, training and generation of workload samples and statistical analysis of the generated data.

Data Management:
We used traces from the CERN EOS logs to train our preliminary GAN prototype. CERN's EOS is an open-source storage software solution that CERN uses to manage multi-petabyte storage for their large hadron collider LHC. EOS records fields from the metadata within each file, which contains information about where and how a file is being stored. EOS also records additional file functionalities like vector reads, third party copy transfers etc. The EOS log files has a total of 57 fields to record file creation and updates. We use datapoints from the Compact Muon Solenoid (CMS) experiment. This dataset contains random variations caused due to inconsistencies in system accesses, which may interfere with the performance of other workloads. To reduce these variations, we apply the moving average trend analysis algorithm before training and prediction. This isolates the inconsistent trends in the data and helps improve prediction accuracy. Additionally, we enable derived target features to be calculated like throughput and latency, which are not always recorded directly in the dataset. We use features found in the dataset like the start and end timestamps of events pertaining to a file to calculate these target values. The current version of the CERN EOS access log we are using contains a subset of 21 features to describe file accesses. It tracks time taken to complete actions using open and closed timestamps and read/write times. To identify a potential bottleneck on the system, it keeps track of the number of read/write calls. It also tracks changes in file sizes using bytes read from/written to files. 

Evolution of the GAN architecture:
When we first started working with the CERN dataset, we decided to work with only four fields, number of bytes read/written and time taken to read/write those bytes. At this time, we had not come up with the WinnowML design. Since, the discriminator of the GAN was meant to be a classifier that matched the generated samples to trace indexes, our first step was to build a prototype of the discriminator. Our first version of the discriminator was a binary classifier that classified normalized disk read and write speed, derived from the CERN logs, into two classes: high (above 1) and low (below 1). The classifier had a high validation accuracy of 96.7 percent, since the number of features we were working was relatively small.

We developed a preliminary GAN prototype consisting of a discriminator and generator, both of which are sequential multilayer perceptrons. They each contain four dense layers with alternating LeakyRELU layers, which accelerates convergence of the stochastic gradient descent, which in turn reduces the computational burden on our model. We converted the fields to time series data and normalized them as part of input pre-processing for the models. We trained both the models using this format of the data, but generator currently only  generates random noise that does not currently look like a workload. The intuition is that a dense neural network does not provide a well enough representation of time series data, as a feedback to the GAN's generator. This is because the layers in the a dense neural network are not connected to understand the correlation between past, present and future events. 

Since the generation of realistic time series using GANs has not been sufficiently studied, there arises a need to do an evaluation of the most suitable neural network architecture for such a sample generation. We will evaluate different combinations of generators and discriminators with neural network architectures ranging from vanilla feed-forward neural networks (FFNN) to Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Long/short-term memory networks (LSTMs). CNNs are useful when working with a large number of features since the output parameters are defined on the convolution rather than on a per-unit basis, significantly restricting the number of features that need to be selected. However, since WinnowML already restricts the number of features we are using to train our GAN, using CNNs will probably not have an added impact on the performance of the GAN.
Apart from each layer being defined as an output of previous layers in case of FFNNs, RNNs additionally have a link from the current layer to previous layers, which allows them to model present dependence on past events. LSTMs on the other hand, have no form of activation performed, allowing the same signal to flow back into the network for a longer period of time as compared to RNNs. This helps LSTMs represent time series better since the interaction between past and present events is sensitive to both very distant and recent events.

Finally it will also be interesting to look at which of the above mentioned architectures would have the best performance accuracy for different kinds of datasets. Considering time series data, cross sectional data and pooling data (combination of time series and cross sectional data), the models that would most efficiently capture patterns of these should vary.


Last modified 25 Jan 2024