Generating Synthetic Financial Data With Generative Adversarial Neural Networks (GANs) and Transformers

Tags

, , , ,

Introduction

Imagine you are working at a major investment bank like J.P. Morgan or Goldman Sachs, or maybe at a hedge fund, where modeling how the market behaves is critical, whether for trading strategies, risk controls, or stress scenario simulations. In these types of environments, being able to generate realistic synthetic financial time-series data isn’t just a nice-to-have – it is necessary.

Financial market data is messy: it’s high-dimensional, noisy, non-stationary, and full of seasonality and calendar effects (like month-end anomalies). Classic forecasting models often fall short here, especially when faced with outliers or structural changes in behavior.

To tackle this, I built a generative model – a GAN-based architecture specifically designed for financial time-series data:

  • The generator is a Transformer that learns to generate the next day’s features from a short rolling window of previous days.
  • The discriminator is a feedforward network that tries to tell apart real vs. generated next-day data.
  • The generator loss blends Binary Cross Entropy, MSE, and a custom volume-aware term to better handle the scale mismatch between price and volume.
  • A Gaussian noise layer is added to the generator, which helps avoid mode collapse and encourages more diverse outputs.
  • Ticker (stock name) embeddings to let the model generalize across multiple stocks while still capturing their unique behaviors.
Continue reading

Support Vector Machine (SVM)

Support Vector Machine is a supervised learning type of algorithm and its goal is to find the optimal hyperplane which separates clusters of data. In 1D we will have a line instead of a hyperplane and in 2D it will be a plane. In higher dimensions only, we call it a hyperplane. Suppose we have a data with some positive and some negative examples. How do we divide positive examples from the negative ones?

Continue reading

Linear Regression

Regression models are most studied and best understood models in statistics. Regression and correlation are closely related and were invented by Francis Galton, the famous Victorian statistician.

In linear regression, we model the relationship between two variables – dependent and independent variable and we represent the data by drawing a line through it. Before fitting the data and deciding whether we are going to use linear regression or something else, we need to check whether there is a  relationship between variables. For this, a scatterplot can be helpful.

Continue reading

Topological Data Analysis (TDA)

Introduction

More data has been created in the last two years than in the entire history of human race and the amount of data keeps growing at the unprecedented rate. The computational power is getting more available and more powerful which allows us to get data very fast. This data is often very noisy, with missing information and high dimensional, which reduces the ability to visualize it. Usually, very few parts of the data provided are useable. To determine which part of data is relevant for us is not an easy task. It is important first to gain a general understanding of data we are dealing with and then to develop quantitative methods to analyze it. This is where topology, a branch of mathematics can be very useful.

Continue reading

Backward Elimination

In the previous post I have been talking about multiple linear regression, preparing the data and model predictions.

Now we can build the optimal model using backward elimination. But why the model we built so far is not optimal? Because so far we have been using all independent variables. Some of them are highly statistically significant, which means that they have a big impact on the dependent variable “Profit”, but others are not statistically significant – they have little or no impact on the dependent variable. That means that we can remove variables that are not statistically significant from the model and still get good predictions. The goal is to find the team of independent variables where each of the variable is highly statistically significant and highly affects the dependent variable.

Continue reading

Multiple Linear Regression

Problem description

We have CSV file containing information about 50 companies. Your job as a data scientist is to help venture capitalists with the decision regarding which company they should invest in, by using the information provided in the CSV file. There are five columns, each stating how much during the year company spent for Research & Development, Administration, Marketing, in which state the company is located and the profit of the company.

Continue reading

Lax Pair and The Nonlinear Schrödinger (NLS) Equation (Part 1)

Tags

, , , , ,

The idea behind Lax Representation (Lax Pair) is that the nonlinear PDE can be expressed as a compatibility condition of two linear equations.

If we have a system of two partial differential equations

\begin{array}{ccc} \left( i \partial_x + U(x, \lambda) \right) \Psi & = &0 \\  \left( i \partial_t + V(x, \lambda) \right) \Psi & = & 0 \end{array}

on a domain \Omega \subset \mathbb{R}^2, we can define their compatibility condition as

i V_x - iU_t + [U, V] = 0.

Continue reading

Lax Pair and the Nonlinear Schrödinger (NLS) Equation (Part 2)

Tags

, , , , , , ,

Operator L has continuous and discrete spectrum in general so we can separate solutions into two groups:

  • Solutions parametrized by data on the continuous spectrum only
  • Solutions parametrized by the data on the discrete spectrum, which are known as soliton solutions

Continue reading