PowerExpect

Posted on October 13, 2025 • 26 min read • 5,468 words
Share via

A fast and simple app for viewing forecast and actual electricity prices for French and German/Luxembourg bidding zones.

PowerExpect

Open App  

Click here 

1. Motivation  

The initial goal of this project was twofold: on the one hand, to learn a framework specialized in frontend development, and on the other hand, to test new methods/architectures for time series forecasting applied to electricity price forecasting.

Python frontend libraries are often limited and/or slow. I therefore wanted to test a framework specialized in frontend development for web applications. I chose Vue.js, which seemed to be the most accessible option.

I also wanted a comprehensive tool that would allow me to easily test new methods and models for time series forecasting, particularly electricity price forecasting, in real-world conditions. When working on this type of project in industry, you generally prefer to use what you know already works well, following Occam’s razor and under time constraints. This often leaves little room for in-depth exploration of new things. However, there is a wealth of literature on the subject, making me very curious.

2. Introduction  

Electricity Price Forecasting (EPF) is a well-known application of Time Series Forecasting (TSF) methods. This question has been explored for over 30 years now, since the wave of electricity market liberalisation in the 1990s, and is attracting increasing interest in the literature [1]. Firstly because TSF itself is attracting growing interest [2] and secondly because EPF plays a fundamental role in energy markets, enabling market players to optimize their trading strategies, reduce financial risks, and maintain grid stability [3]. However, this is a complex (and increasingly so) exercise because short-term price fluctuations are highly sensitive to renewable energy generation forecasts, regulatory and geopolitical announcements, and unforeseen disruptions such as generation outages or grid constraints. This results in prices with high volatility, non-linearity, and sudden spikes [3]. There are several EPF methods, but only those involving Machine Learning (ML) are of interest here.

The history of EPF has closely followed the history of TSF. TSF began with statistical methods based on moving averages developed in the 1950s to 1970s (exponential smoothing/ARIMA). It then continued with the first ML methods such as decision trees in the 1980s, followed by GBM (Gradient Boosting Machines) in 1999/2001. At the same time, deep learning (DL) models appeared with RNNs and LSTMs in the 1990s [2].

These fundamental building blocks were subsequently expanded, improved, and combined to create the wide variety of architectures we have today. Popular implementations of GBM such as XGBoost and LightGBM, which were not specifically designed for time series forecasting, appeared in the 2010s and quickly became standards in the field thanks to their performance and interpretability. Deep learning models have received increasing interest, with architectures based on MLP (Multi-Layer Perceptron) such as NBEATS and TIDE, and transformer-based architectures such as PatchTST and PETformer. Following the rise of foundation models for natural language processing, foundation models for time series forecasting have emerged, such as TimeGPT and Lag-Llama. Advances have also been made in other architectures, such as Mamba models, and in related areas such as feature engineering. Overall, TSF is a particularly dynamic and fruitful area of research, where many ideas show great potential [2].

The aim of EPF (and more generally of TSF) is to find a sufficient approximation according to certain metrics of the function:

f:Z=(ytL+1:t,  XtL+1:t(n),  Xt+1:t+H(m))f(Z)=y^t+1:t+Hf : Z = \Big(\pmb{y}_{t-L+1:t},\;\pmb{X}^{(n)}_{t-L+1:t},\;\pmb{X}^{(m)}_{t+1:t+H}\Big) \longmapsto f(Z) = \pmb{\hat{y}}_{t+1:t+H}

With:

  • LL = The past observations window size
  • HH = Forecast horizon
  • nn = Number of features
  • mm = Number of features available during forecast ( mnm \le n )
  • ytL+1:tRL\pmb{y}_{t-L+1:t} \in \mathbb{R}^{\,L} = Vector of LL past observations
  • XtL+1:t(n)RL×n\pmb{X}^{(n)}_{t-L+1:t} \in \mathbb{R}^{\,L \times n} = The matrix of nn explanatory variables over LL time steps
  • Xt+1:t+H(m)RH×m\pmb{X}^{(m)}_{t+1:t+H} \in \mathbb{R}^{\,H \times m} = The matrix of mm features over HH
  • y^t+1:t+HRH\pmb{\hat{y}}_{t+1:t+H} \in \mathbb{R}^{\,H} = The vector of the target variable forecast over HH

The choice of ff is an architectural choice. It is based on several factors, such as our knowledge of the field of application, our belief in the architecture’s ability to model the relationships between the features and the target variable for this problem, and our understanding of the literature.

In this project, the choice of ff is based primarily on my curiosity after reading a paper on EPF or TSF.

3. Workflow  

PowerExpect workflow
PowerExpect workflow

The PowerExpect application has its backend on AWS and its frontend on Vercel. I wanted to have for this project an application that updates automatically every day, following the pace of day-ahead auctions. This meant I needed a machine that would launch every day, execute the code, and then save the results in a storage location accessible via an API by the frontend. I chose AWS, which I was already familiar with. The CI/CD pipeline is handled by CodePipeline and Codebuild, which automatically retrieve the code from my GitHub repository, then build the Docker image and store it in ECR. An EC2 instance is launched every day via a lambda controlled by EventBridge. The instance retrieves the image from ECR when it starts up and executes it. All updated inputs and outputs are saved in two separate private S3 buckets. The logs are also sent to my email address by the instance so that I don’t have to log in to AWS every time, which is a bit of a pain with MFA. The files on S3 can be large and exceed Amazon’s service limit. So I opted to download them directly to S3 from the frontend via 1-second pre-signed URLs served by an API.

4. Data  

The day-ahead (or SPOT) price for each 15 minutes of the following day is determined by an auction for each market in Europe. There are therefore 96 auctions for the French zone and for the German zone. Participants have until noon to indicate the volumes they want to buy/sell based on the price. The price at time t will therefore be the intersection between the aggregate demand curves at t and the supply curves at t for each participant [4].

SPOT price supply-demand curve
SPOT price supply-demand curve

SPOT prices are therefore driven by supply (generation) and demand (consumption), which are themselves driven by numerous factors. The structure of the market in question, the composition of the electricity mix, the weather, balancing markets, operational constraints, the availability of power plants, seasonality/calendar (time, day, season, etc.), fuel prices, and CO2 quota prices all influence supply. Similarly, demand is influenced by seasonality/calendar and temperature effects, but also by other factors such as consumption habits [5].

CAL+1 prices, i.e. the series of prices for calendar contracts traded daily on exchanges in year Y for delivery in year Y+1, are currently being added.

Like any time series, prices are strongly correlated with previous values. An autocorrelation and partial correlation study on both markets makes it easy to select relevant lags. I have chosen lags common to both markets to simplify the pipelines.

For the time being, only the main and easily available influencing factors, such as generation and total demand, are used in order to limit the complexity of the models. Inevitably, this also limits their predictive power.

Data is primarily sourced from the great ENTSOE Transparency Platform  . This data is accessible free of charge via an API. Models are trained on post-energy crisis data from 2022. The forecast horizon is currently one day, or 96 points.

5. NHITS  

a. Architecture  

N-HiTS, which stands for “Neural Hierarchical Interpolation for Time Series” [6], falls into the category of MLP-based DL architectures, which have now proven to be among the best options available. This model was introduced to overcome two typical problems in long-term forecasting: complexity, i.e. memory costs that explode with the size of the horizon, and forecast volatility, i.e. the difficulty of obtaining reliable and stable long-term forecasts.

To do this, the researchers started with the NBEATS model introduced three years earlier [7] and then improved it using two techniques: “Multi-rate Signal Sampling” and “Hierarchical Interpolation” [6].

NHITS architecture
NHITS architecture

The figure above illustrates the univariate version of the N-HiTS architecture, which simplifies the equation of the problem posed in the introduction but is sufficient to understand how the model works (see NBEATSx for more information on incorporating exogenous variables into the architecture [8]). Thus, this version only takes as input the vector of past observations, denoted here as ytL:t\pmb{y}_{t-L:t} with LL being the number of lags. We can see that N-HiTS is based on a 3-layer architecture with S stacks, each composed of B blocks. The highest level layer corresponds to the overview with the arrangement of stacks, the next to the arrangement of blocks within each stack, and the last to the components within each block [6].

1st layer

N-HiTS is composed of S stacks, each of which models different frequencies of the time series. A stack takes as input the residual of the previous stack ytL:t,s1\pmb{y}_{t-L:t,s-1} (except for the first stack of course, which takes ytL:t\pmb{y}_{t-L:t} directly) and returns its forecast y^t+1:t+H,s\pmb{\hat{y}}_{t+1:t+H,s} and its residual ytL:t,s\pmb{y}_{t-L:t,s} . The forecasts of each stack are added together to form the final forecast (the model also returns the residual of the last stack S ytL:t,S\pmb{y}_{t-L:t,S} ) [6]:

y^t+1:t+H=s=1Sy^t+1:t+H,s\pmb{\hat{y}}_{t+1:t+H} = \sum_{s=1}^S \pmb{\hat{y}}_{t+1:t+H,s}

2nd layer

Similar to stacks (and comparable to boosting models), the B blocks of a stack s are assembled hierarchically, where the input of block ll is the residual of block l1l-1 . In fact, the backcast output of a block ll denoted y~tL:t,l\pmb{\tilde{y}}_{t-L:t,l} is used to clean the input ytL:t,l+1\pmb{y}_{t-L:t,l+1} of the next block. This allows the blocks to focus their attention on parts of the series that are not understood. The forecasts of each block are added together to form the final forecast of the stack [6]:

ytL:t,l+1=ytL:t,ly~tL:t,l\pmb{y}_{t-L:t,l+1} = \pmb{y}_{t-L:t,l} - \pmb{\tilde{y}}_{t-L:t,l} y^t+1:t+H,s=l=1By^t+1:t+H,l\pmb{\hat{y}}_{t+1:t+H,s} = \sum_{l=1}^B \pmb{\hat{y}}_{t+1:t+H,l}

3rd layer

Each block begins with a max pooling layer with filter size klk_l followed by an MLP. The pooling step allows the block to focus on a specific frequency of the input series. A large klk_l will filter out high frequencies, forcing the MLP to model a low-frequency series, and conversely, a small klk_l will force the MLP to model a high-frequency series. This is called “multi-rate signal sampling”. This step has two advantages: First, it allows the model to focus on low-frequency aspects of the time series, which produces more stable and consistent long-term forecasts. Second, it reduces the size of the MLP input, thereby decreasing the complexity of the calculation (and thus improving training speed) and the risk of overfitting. With ytL:t,l\pmb{y}_{t-L:t,l} as the input vector of block ll , we have [6]:

ytL:t,l(p)=MaxPool(ytL:t,l,kl)\pmb{y}^{(p)}_{t-L:t,l} = MaxPool(\pmb{y}_{t-L:t,l}, k_l)


 ~

Block ll will then use the reduced input to estimate the backward interpolation parameters θlb{\theta}^b_l and forward interpolation parameters θlf{\theta}^f_l by estimating the hidden vector hbh_b with an MLP and then projecting it linearly [6]:

hl=MLPl(ytL:t,l(p))h_l = MLP_l(\pmb{y}^{(p)}_{t-L:t,l})

θlf=LINEARf(hl){\theta}^f_l = LINEAR^f(h_l)

θlb=LINEARb(hl){\theta}^b_l = LINEAR^b(h_l)


 ~

Finally, the output vectors of block ll y~tL:t,l\pmb{\tilde{y}}_{t-L:t,l} and y^t+1:t+H,l\pmb{\hat{y}}_{t+1:t+H,l} are computed by “hierarchical interpolation” [6].

Once the coefficients θlb{\theta}^b_l and θlf{\theta}^f_l have been computed, it is necessary to restore the initial sampling rate or granularity (modified by the max pooling/multi-rate signal sampling operation) in order to predict all H points in the forecast horizon. To do this, N-HiTS performs “temporal interpolation” using the interpolation function gg [6]:

τ{t+1,,t+H}, y^τ,l=g(τ,θlf)\forall \tau \in \{t+1, \dots, t+H\}, ~\hat{y}_{\tau,l} = g(\tau,{\theta}^f_l)

τ{tL,,t}, y~τ,l=g(τ,θlb)\forall \tau \in \{t-L, \dots, t\}, ~\tilde{y}_{\tau,l} = g(\tau,{\theta}^b_l)

 ~

gg can take the form of a nearest neighbor function, piecewise linear, or cubic. The authors of N-HiTS indicate that, on average across all tested datasets, prediction accuracy and computational performance favor the linear method, where gg is expressed as [6]:

g(τ,θ)=θ[t1]+(θ[t2]θ[t1]t2t1(τt1))g(\tau,\theta) = \theta [t_1] + (\frac{\theta [t_2] - \theta [t_1]}{t_2 - t_1}(\tau - t_1))

With:

  • t2=t1+1/rlt_2 = t_1 + 1/r_l
  • t1=argmintT:tτ(τt)t_1 = argmin_{t \in T:t\leq \tau} (\tau - t)

The time partition TT used to calculate t1t_1 is defined as T={t+1,t+1+1/rl,,t+H1/rl,t+H}T = \{t + 1, t + 1 + 1/r_l, \dots, t + H − 1/r_l, t + H\} where rlr_l is called the “expressivity ratio” and controls the number of parameters per unit of output time: θlf=rlH|\theta^f_l| = ⌈r_l H⌉ (smallest integer greater than or equal to rlHr_l H ). Using this ratio and temporal interpolation avoids the explosion in memory cost and computational complexity that occurs as the horizon increases, a recurring problem in other architectures such as transformers.

N-HiTS will force the expressivity ratio rlr_l to synchronize with the size of the filter klk_l used in multi-rate signal sampling, and this is where the power of the architecture lies. Considering a top-down approach (which proves to be more effective than a bottom-up approach on the datasets tested in the paper), i.e. by choosing increasing rlr_l and therefore decreasing klk_l , the blocks will learn both to observe and generate signals with increasingly fine granularity/frequency. This approach, which prioritizes the specialization of each block and therefore each stack in a specific frequency, is called “hierarchical interpolation”. By cleverly choosing the rlr_l , we can match the frequency of the blocks with known cycles in the time series, such as months, weeks, days, etc [6].

In short, N-HiTS is a generalization of N-BEATS that offers accurate, interpretable, stable forecasts across all horizons, while minimizing memory costs and training time. N-HiTS greatly enhances the power of MLP-based architectures.

With the recent switch in October 2025 from hourly to 15-minute intervals for European day-ahead markets, thereby multiplying the forecasting horizon by 4, the use of models designed for long-term forecasting such as N-HiTS has undoubtedly become more relevant.

b. Hyperparameters  

Hyperparameters are fine-tuned using the tree-structured Parzen estimator (TPE) approach [9] implemented in hyperopt (see section “built with”).

image

c. Training  

image

6. HSL  

a. Architecture  

Hybrid or ensemble models, which combine several models and methods, are a solid choice for EPF/TSF [3]. They are particularly effective at capturing the complex, nonlinear, and volatile dynamics of electricity markets by leveraging the strengths of multiple architectures. The paper presenting the following model provides an example of combination that can be used specifically for EPF [10].

HSL stands for “Hybrid Stacking Lasso” and is a hybrid model combining six base models: XGBoost, CatBoost, LightGBM, LSTM, GRU, and NODE. These basic models are combined by stacking, using Lasso regression as the meta-model [10].

Each base model has its strengths: XGBoost is robust on structured data and has a good ability to capture nonlinear interactions. LightGBM is effective on large datasets, enabling fast training and good temporal feature extraction. CatBoost performs well in its handling of categorical variables and its resistance to overfitting. LSTM and GRU are included for their ability to model long-term temporal patterns. NODE, by combining decision trees and neural networks, can capture complex and nonlinear interactions on tabular data [10].

HSL architecture
HSL architecture

The forecasts of each base model are concatenated and then normalized to [0,1][0,1] using Min-Max scaling:

i{0,,5}, y^t+1:t+H,iscaled=y^t+1:t+H,imin(y^t+1:t+H,i)max(y^t+1:t+H,i)min(y^t+1:t+H,i)\forall i \in \{0,\dots,5\}, ~\hat{y}^{scaled}_{t+1:t+H,i} = \frac{\hat{y}_{t+1:t+H,i} - min(\hat{y}_{t+1:t+H,i})}{max(\hat{y}_{t+1:t+H,i}) - min(\hat{y}_{t+1:t+H,i})}

 ~

Next, the matrix with the scaled forecasts is augmented with temporal variables (hours, days, weeks, etc.) to form the meta-feature matrix Xt+1:t+HmetaX^{meta}_{t+1:t+H} used as input for the meta-model. The article introducing HSL does not provide any information on the strategies to be used to train the meta-model. This leaves us free to test different things while respecting the philosophy of the original paper. I therefore followed some of the recommendations in Jason Brownlee’s cool article on stacking algorithms [11]. For example, he points out that adding other explanatory variables to the meta-model can provide it with additional context on the best way to combine the forecasts from the base models. Tests confirmed this approach. Furthermore, increasing the input size is not a concern when using Lasso regression as a meta-learner, since unnecessary variables will be dropped. The final forecast is written as:

y^t+1:t+H=Xt+1:t+Hmeta.W\hat{y}_{t+1:t+H} = X^{meta}_{t+1:t+H}.W

 ~

With WW the weight vector of dimension H. This vector is found by minimizing the objective function JJ over the training period τ\tau where the Lasso term appears:

J(W)=L(yτ,y^τ)+αW1J(W) = L(y_{\tau}, \hat{y}_{\tau}) + \alpha||W||_1

With:

  • LL the loss function
  • α\alpha the regularisation parameter
  • W1=iwi||W||_1 = \sum_i |w_i| L1 norm applied to model weights

Lasso/L1 regularization sets weights to zero in directions where the cost function is not sensitive. Thus, it attempts to set parameters that are not very important to zero, which makes it possible to handle large input matrices and promotes a minimalist solution. Let’s now describe each base model in detail.

b. XGBoost  

Architecture Gradient Boosting Machine/Tree
Architecture Gradient Boosting Machine/Tree

XGBoost, which stands for “Extreme Gradient Boosting”, is a machine learning architecture based on the GBMs mentioned in the introduction and developed at the beginning of the century. It is an ensemble model that uses CARTs (“Classification And Regression Trees”) called “weak learners” as base models. CARTs are similar to simple decision trees, except that a score is associated with each of the leaves/nodes, allowing them to handle regression problems with continuous variables.

As a reminder, a decision tree is a non-parametric (=there are no underlying assumptions about the distribution of errors or data and the model is based simply on the observed data) supervised learning method. They can be used to predict the value of a target variable by learning simple decision rules inferred from the data features. They are highly interpretable as they directly and visually illustrate the decision-making process. This interpretability is lost with most ensemble methods.

XGBoost combines trees through boosting, where the model’s forecast is equal to the sum of each tree’s forecast for each time step [12]:

i{t+1,,t+H}, y^i=cst+ϵk=1Kfk(Xi(m))\forall i \in \{t+1, \dots, t+H\}, ~\hat{y}_{i} = cst + \epsilon\sum^K_{k=1} f_k(X^{(m)}_{i})

With:

  • KK = the number of trees.
  • fkf_k = the structure of the k-th regression tree.
  • Xi(m)X^{(m)}_{i} = the vector of the mm features at index ii .
  • ϵ\epsilon = the learning rate. The larger it is, the more trees will need to generalise in order to predict the residual, thus preventing overfitting.
  • cstcst = the default forecast if the model didn’t have trees.

The functions f1f_1 to fKf_K are learnt by boosting, adding one tree after another. Hence [12]:

k{1,,K},i{1,,n}, y^i(k)=y^i(k1)+ϵ.fk(Xi(m))\forall k \in \{1, \dots, K\}, \forall i \in \{1, \dots, n\},~\hat{y}^{(k)}_{i} = \hat{y}^{(k-1)}_{i} + \epsilon.f_k(X^{(m)}_{i})

 ~

Thus, the model will grow as trees are added, and each tree will learn to predict the residual of the previous tree. The objective function to be minimised in order to train the k-th tree is written as [12]:

J(k)=i=1nl(yi,y^i(k))+Ω(fk)=i=1nl(yi,y^i(k1)+ϵ.fk(Xi(m)))+Ω(fk)J^{(k)} = \sum_{i=1}^n l(y_i, \hat{y}^{(k)}_i) + \Omega (f_k) = \sum_{i=1}^n l(y_i, \hat{y}^{(k-1)}_i + \epsilon.f_k(X^{(m)}_{i})) + \Omega (f_k)

With:

  • ll the loss function.
  • Ω\Omega the regularisation function.

Let wRTw \in \mathbb{R}^T be the vector of weights for each leaf j{1,,T}j \in \{1, \dots, T\} (where TT = number of leaves in the k-th tree) and q:Rm {1,,T}q: \mathbb{R}^m \longrightarrow\ \{1, \dots, T\} the tree structure that associates the features to a leaf jj . We have: fk(X)=wq(X)f_k(X) = w_{q(X)} [12].

Furthermore, by considering ridge regularisation plus a term γT\gamma T to encourage simple trees, we can write:

Ω(fk)=γT+12λj=1Twj2\Omega (f_k) = \gamma T + \frac{1}{2} \lambda \sum^T_{j=1} w_j^2

 ~

λ\lambda reduces the sensitivity of the weights wjw_j to the observed values which limits overfitting.

Thus, by approximating J(k)J^{(k)} using a second-order Taylor expansion and simplifying (= removing the sum of losses, which is a constant, and setting ϵ=1\epsilon = 1 ), we get [12]:

J~(k)=j=1T[wjiIjgi(k)+12(iIjhi(k)+λ)wj2]+γT\tilde{J}^{(k)} = \sum^T_{j=1} [w_j \sum_{i \in I_j} g_i^{(k)} + \frac{1}{2} (\sum_{i \in I_j} h_i^{(k)} + \lambda)w_j^2] + \gamma T

With:

  • Ij={iq(Xi(m))=j}I_j = \{i | q(X^{(m)}_{i})=j \} the set of indices of the mm features associated with the j-th leaf.
  • gi(k)=y^i(k1)l(yi,y^i(k1))g_i^{(k)} = \partial_{\hat{y}^{(k-1)}_i} l(y_i, \hat{y}^{(k-1)}_i) the i-th first-order gradient of the k-th tree or boosting iteration.
  • hi(k)=y^i(k1)2l(yi,y^i(k1))h_i^{(k)} = \partial^2_{\hat{y}^{(k-1)}_i} l(y_i, \hat{y}^{(k-1)}_i) the i-th second-order gradient of the k-th boosting iteration.

For a given structure qq , the optimal weight of the j-th leaf wjw_j^* is then written as [12]:

wj=iIjgi(k)iIjhi(k)+λw_j^* = \frac{\sum_{i \in I_j} g_i^{(k)}}{\sum_{i \in I_j} h_i^{(k)} + \lambda}

And the optimal value J~(k)(q)\tilde{J}^{(k)^*}(q) :

J~(k)(q)=12j=1T(iIjgi(k))2iIjhi(k)+λcost of the j th leaf+γT\tilde{J}^{(k)^*}(q) = -\frac{1}{2}\sum^T_{j=1}\underbrace{\frac{(\sum_{i \in I_j} g_i^{(k)})^2}{\sum_{i \in I_j} h_i^{(k)} + \lambda}}_{cost~of~the~j~th~leaf} + \gamma T

The cost J~(k)(q)\tilde{J}^{(k)^*}(q) measures the performance of the tree structure qq by comparing the regularisation term on the number of leaves to the cost of each leaf. The smaller J~(k)(q)\tilde{J}^{(k)^*}(q) is, the better the tree structure performs [12].

If we used the mean square error as the loss function, then we would have gi(k)=y^i(k1)yig_i^{(k)} = \hat{y}^{(k-1)}_i - y_i et hi(k)=1h_i^{(k)} = 1 . So:

wj=iIj(y^i(k1)yi)Ij+λ=sum of residuals from previous treenumber of residuals+λw_j^* = \frac{\sum_{i \in I_j} (\hat{y}^{(k-1)}_i - y_i)}{|I_j| + \lambda} = \frac{sum~of~residuals~from~previous~tree}{number~of~residuals + \lambda}

and:

J~(k)(q)=12j=1T(iIj(y^i(k1)yi))2Ij+λcost of the j th leaf+γT=12j=1Tsquared sum of residuals from previous treenumber of residuals+λ+γT\tilde{J}^{(k)^*}(q) = -\frac{1}{2}\sum^T_{j=1}\underbrace{\frac{(\sum_{i \in I_j} (\hat{y}^{(k-1)}_i - y_i))^2}{|I_j| + \lambda}}_{cost~of~the~j~th~leaf} + \gamma T = -\frac{1}{2}\sum^T_{j=1}\frac{squared~sum~of~residuals~from~previous~tree}{number~of~residuals + \lambda} + \gamma T
 ~

Now, we need to test every possible structure, i.e. every possible split combination (= one parent or root node and two child leaves), and keep the one that offers the best performance (= smallest possible J~(k)(q)\tilde{J}^{(k)^*}(q) ). This search is costly in terms of complexity and potentially also in terms of memory, since the costs of each possible leaf and therefore their gradients ( o(T.m.(n1))o(T.m.(n-1)) ) must be computed.

XGBoost’s first approach is to use an exact algorithm (“Exact Greedy Algorithm”) that starts from a leaf and iteratively adds the branches of the best splits. It runs through all the sorted values of the features and accumulates the gradients, which allows it to lower the memory cost. To select the best split, the algorithm will choose the one that maximises the following score [12]:

scoresplit=12[cost of the left leaf+cost of the right leafcost of the root]γscore_{split} = \frac{1}{2}[cost~of~the~left~leaf + cost~of~the~right~leaf - cost~of~the~root] - \gamma

 ~

The higher the score, the more the split lowers the cost of the tree.

Once the structure has been built, XGBoost will optimise it by removing splits that have a negative score. This tree pruning is controlled by the regularisation parameter γ\gamma , which reflects a greater or lesser desire to have deep trees, i.e. a complex solution that is likely to overfit [12]. Other parameters can be used to manage the complexity of the model, such as max_depth, which limits the maximum depth of the trees.

Usually, XGBoost will use variants of the exact algorithm to find splits in a more optimised way. This is because the exact algorithm is computationally expensive and difficult to scale, as it tests all possible splits. For example, if the input data is too large, XGBoost will use an algorithm (“Approximate Algorithm”) that reduces the number of splits tested to well-chosen percentiles [12].

In summary, XGBoost will make a forecast by summing the forecasts of each tree for each time step. During training, the model is built by adding trees one after the other in a cumulative way. Each tree is built by starting with a root and then adding other leaves level by level. Leaf splits are chosen to maximise a score that represents the leaf’s contribution to error reduction. XGBoost uses different regularisation terms to avoid overfitting and limit model complexity, such as γ\gamma , a threshold below which the split is considered to contribute enough to error reduction.

As XGBoost is based on trees where specific features are chosen for its splits, we can deduce the weight of each of them in the foercast, which makes the model easier to interpret than other methods such as those based on neural networks. In addition to the founding paper [12], I recommend the video series from statquest  for more details on XGBoost.

c. LightGBM  

LightGBM, which stands for “Light Gradient Boosting Machine”, offers an optimised implementation of GBM similar to XGBoost (see previous section). This algorithm takes performance optimisation further than XGBoost by introducing two improvements called “Gradient-based One-Side Sampling” (GOSS) and “Exclusive Feature Bundling” (EFB) [13].

The challenge with GBM implementations is always the same: how to optimise split finding while maintaining maximum performance ? The natural approach is to reduce the number of candidate splits (by reducing the number of splits to be tested or by resampling the amount of observed data) and features tested. LightGBM uses the GOSS algorithm, which keeps all data with significant gradients (above a predefined threshold) and performs random sampling on data with low gradients. In addition to GOSS, LightGBM introduces the EFB algorithm, which groups certain features together, thereby reducing the search space for splits [13].

Another important feature of LightGBM is the way it builds its trees. Unlike other implementations such as XGBoost, which do so level by level, LightGBM builds its trees leaf by leaf, choosing (according to a criterion) the best next leaf to process. With an equal number of leaves, this approach tends to yield better results. See the library documentation (built with section) for more information on this subject [13].

LightGBM Tree Growth
LightGBM Tree Growth

d. CatBoost  

CatBoost for “Categorical Boosting” is, like XGBoost and LightGBM, an implementation of a GBM (see the XGBoost section for notations). CatBoost stands out for its handling of categorical variables and its approach to boosting to solve target leakages [14].

Categorical Features

A categorical feature consists of a discrete set of distinct values called categories. For example in TSF, we sometimes consider times of day (morning, afternoon, evening, etc.). These categorical variables can provide a substantial gain in information. However, they require a crucial encoding step, i.e. transforming the categories into numerical values that can be understood by the model with minimal loss of information [14].

There are several methods for encoding categorical features  . A classic encoding method is “one-hot encoding”, which consists of converting each category into a binary column vector (1 for indices where the category appears and 0 otherwise). Obviously, this method is quickly limited by the number of categories to be encoded. Another popular method is “target encoding”, which consists of replacing categories with statistical values of the target variable. For example, we can use the average of the target variable values for each category. This simple approach should be avoided, as it will lead to data leakage: the test data will contain information from the training data. Several methods are possible to solve this problem, such as partitioning the training data [14].

CatBoost uses a target encoding strategy described as more effective by the founding paper. Simply put, CatBoost encodes categories based solely on previously observed values. This means that CatBoost applies category encoding sequentially, thus avoiding any target leakage. Since the order of the categories is important, this approach is called “ordered target statistics” or “ordered target encoding”. The numerical value of the category will then be calculated as the average of past observations for that category, smoothed by a constant pp (=prior) such that the value of category kk at index ii is written as [14]:

xik=jIikyj+pIik+1x_i^k=\frac{\sum_{j\in I_i^k}y_j + p}{|I_i^k| + 1}

 ~

With IikI_i^k the set of indices preceding the index ii where yy is 1 for category kk (when the target is continuous, we simply use statistical intervals/bins).

During inference, the categories are encoded by taking all the previously encoded values.

Ordered Boosting

Just like ordered encoding for processing categorical variables, ordered boosting is motivated by the presence of target leakage during training. Indeed, the authors of the original CatBoost paper show that in a simple case where the same dataset is used to train each tree, a bias inversely proportional to the size of the dataset shifts the prediction. This stems from the use of the same target variables for gradient estimation and tree training. Indeed, we compute the gi(k)g_i^{(k)} using the target yiy_i and the prediction fk1(xi,yi)f_{k-1}(x_i,y_i) , whereas the tree fk1f_{k-1} was itself built using yiy_i . This is a target leakage that causes a shift in the conditional distribution of the gradients gi(k)(xi,yi)xig_i^{(k)}(x_i,y_i) |x_i (and therefore fk(xi,yi)xif_k(x_i,y_i)|x_i ) relative to that on the test data and ultimately leads to a biased prediction and a model that generalises less well. To solve this, CatBoost uses ordered boosting [14].

Impact of ordered boosting on gradient computation
Impact of ordered boosting on gradient computation

Ordered boosting suggests to solve the leakage issue by using previous examples (x0:i1,y0:i1)(x_{0:i-1},y_{0:i-1}) to train fk1f_{k-1} . Thus, fk1(xi,yi)f_{k-1}(x_i,y_i) has never seen the example (xi,yi)(x_i,y_i) and the gi(k)g_i^{(k)} will not be estimated with a biased fk1f_{k-1} . This approach is the same as the one used for ordered encoding by processing the data sequentially, row by row [14].

The naive or brute force implementation of ordered boosting would consist of training n (=number of examples/observations, i{1,,n}i \in \{1, \dots, n\} ) models M1,,MnM_1, \dots, M_n , such that model MiM_i is trained only on the i-th observations of a random permutation of the data set. Then, model Mi1M_{i-1} is used to calculate gig_i . However, this naive solution is too complex: training n models would be too costly in terms of time and memory. To reduce the complexity of the implementation and use ordered boosting, CatBoost proceeds as follows [14].

First, CatBoost generates a set of random permutations of the data set, on which to perform ordered encoding and then train each tree. Thus, the gradients gi(k)g_i^{(k)} and categorical variables will be calculated using a different history at each boosting step kk . This still ensures that no model has seen the example for which the gradient is calculated, but reduces instability due to the fact that an example at the beginning of the permutation would have very little previous data [14].

Next, CatBoost trains its weak learners, which are symmetric or oblivious decision trees, i.e. trees in which all leaves at a level have the same split/division criterion. These trees are less prone to overfitting and significantly speed up execution at test time. The trees are trained sequentially, where the gradient gig_i is calculated using the prediction Mr,j(i)M_{r,j}(i) of the i-th example using the first j-th examples of permutation r. In practice, instead of keeping all predictions Mr,j(i)M_{r,j}(i) , CatBoost only keeps intermediate predictions at geometric intervals (at indices 2k2^k ). This is possible thanks to the symmetric structure of the trees. Thus, if there are s permutations, then the complexity of updating the models Mr,jM_{r,j} and computing the gradients goes from O(sn2)O(sn^2) to O(sn)O(sn) [14].

e. NODE  

image

f. LSTM  

image

g. GRU  

image

h. Hyperparameters  

image

i. Training  

image

7. Confidence Bands  

image

8. SHAP  

image

9. Evaluation  

image

10. Roadmap  

image

11. Built With  

Thanks to all the developers who made these cool open source tools <3

Python Backend

image
image
image
image
image
image
image
image
image
image

Vue.js Frontend

image
image
image

12. Sources  

[1] Chai, S., Li, Q., Abedin, M. Z., & Lucey, B. M. (2024). Forecasting electricity prices from the state-of-the-art modeling technology and the price determinant perspectives. Research in International Business and Finance, 67, 102132. 

[2] Kim, J., Kim, H., Kim, H., Lee, D., & Yoon, S. (2025). A comprehensive survey of deep learning for time series forecasting: architectural diversity and open challenges. Artificial Intelligence Review, 58(7), 1-95. 

[3] O’Connor, C., Bahloul, M., Prestwich, S., & Visentin, A. (2025). A Review of Electricity Price Forecasting Models in the Day-Ahead, Intra-Day, and Balancing Markets. Energies, 18(12), 3097. 

[4] EPEX SPOT. (2025). Basics of the power market. 

[5] Geissmann, T., & Obrist, A. (2018). Fundamental Price Drivers on Continental European Day-Ahead Power Markets. Available at SSRN 3211339. 

[6] Challu, C., Olivares, K. G., Oreshkin, B. N., Ramirez, F. G., Canseco, M. M., & Dubrawski, A. (2023, June). Nhits: Neural hierarchical interpolation for time series forecasting. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 6, pp. 6989-6997). 

[7] Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437. 

[8] Olivares, K. G., Challu, C., Marcjasz, G., Weron, R., & Dubrawski, A. (2023). Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx. International Journal of Forecasting, 39(2), 884-900. 

[9] Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24. 

[10] Chen, J., Xiao, J., & Xu, W. (2024, August). A hybrid stacking method for short-term price forecasting in electricity trading market. In 2024 8th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE) (pp. 1-5). IEEE. 

[11] Brownlee, J. (2021). Stacking Ensemble Machine Learning With Python. 

[12] Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). 

[13] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., … & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30. 

[14] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, 31. 

Let’s Connect !

Open to Ideas & Discussions