Exploring regression models to enable monitoring capability of local energy communities for self‐management in low‐voltage distribution networks

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.

equipment to resolve the problems, ensuring safe operation of the network. This study focusses on the congestions of the MV/ LV distribution transformers.
However, several burdens prevent the DSOs from adequately monitoring and managing the congestions in their LV distribution networks. For congestion monitoring and management, the DSOs have to install advanced sensors at the MV/LV power substations [6], develop the communication infrastructure between their systems and customers' premises, and build data management systems to handle the gathered data [7]. These developments are deemed costly and cannot be easily implemented, given the sheer scale of LV networks. A recent implementation of the local energy communities (LECs), along with the increasing availability of residential smart meters (SMs), has offered a solution to the congestion monitoring and management in LV networks. The LECs intend to actively participate in RES projects in their regions [8], which can improve the controllability and energy efficiency of the local networks [9]. While the LECs can access customers' SMs, they generally do not have information on the network topology. Therefore, if cooperating actively with the DSOs, the LECs can provide essential operational supports to the utility grids [9]. In contrast, if the collection and management of all customers' SM data, as well as the control of customers' RES and consumption, are performed by the DSOs, this can impose additional burden on their work, especially when considering the high volume of SMs and the wide geographical dispersal of RES [10]. All these in turn can negatively affect congestion management as well as network operation. In this regard, using data-driven approaches (i.e. leveraging the residential SM data implemented by the LECs) is considered promising while supporting the network operation and preserving customers' rights.
While recognising added values of SM data from residential houses for LV network monitoring, most previous works focussed on physics-based approaches that limit their applicability in the diversified, complex LV networks [11][12][13][14][15][16][17]. In [11], a field demonstrator with residential SMs for LV network monitoring is presented. Dedicated monitoring systems, however, are still required at the substation's secondary side. An approach based on the synchronous SM dataset is investigated in [12] to monitor LV distribution networks. In [13], the aggregated SM readings from LV end users are utilised to calculate transformer load profiles. Similarly, based on the aggregated SM data, a state estimation method for LV systems is described in [14], while a state estimation method for MV systems is discussed in [15]. The distribution system state estimation enables the DSOs to monitor their grid variables, that is voltages, currents, and powers, allowing network congestion monitoring. The authors in [16] propose an approach to process SM data in LV grids, including phase identification, data aggregation and uncertainty evaluation, to determine nodal powers in MV grids, that is LV transformer loading. However, the practical application of these methods discussed in [11][12][13][14][15][16] to monitor LV transformer loading has some significant impediments. For instance, not all customers are equipped with SMs [16], causing the summation of SM readings from the relevant customers to determine the distribution transformer loading inaccurately. Besides, the volume of SM data aggregation grows exponentially given the increasing number of customers, which likely creates difficulties in collecting and processing the data [18]. In this regard, it is desired to have a method that can use a reduced amount of SM data to monitor the congestions. The works in [17] propose methods to select representative meters for voltage monitoring in LV networks, which can be extended to realise congestion monitoring. However, a common issue with the application discussed in [17] is that active and reactive power data from residential SMs are used. Since the consumption data contains highly private information about the households, their privacy and security are subjected to violation [19]. Privacy concerns cause the customers in certain countries to raise objections against the use of SM readings. In contrast, the voltage measurements by residential SMs within the LEC context can be considered as resulting in no privacy concerns as they represent the network state. Our recent study in [20] shows that there exists a strong correlation between the nodal voltage magnitudes, that is measured by SMs, and the active power flows through distribution transformers. A limited (critical) set of these measurements leveraged by advanced machine learning (ML) techniques can be utilised to estimate transformer loading and then facilitate congestion management, for example, by controlling RES.
Among the ML techniques for power systems, regression models are commonly used because of their easy-to-implement and easy-to-interpret features [19]. Regression models mathematically formulate the correlation between one output variable and one or multiple input variables, afterwards allowing to estimate the output variable from the input variables. Regression models are widely employed for load forecasting [19,21,22], RES generation forecasting [23,24], power system state estimation [25,26], and supporting network operation [27,28]. However, to the authors' best knowledge, there have been no studies on regression models in transformer congestion monitoring that take into account the measurements from end-users' SMs.
Based on the discussion presented above, this article proposes a data-driven approach implemented by the LECs to monitor transformer congestions in LV distribution networks. The proposed approach involves adopting different regression models and leveraging data measured from residential SMs. Four ML algorithms for realising regression models namely (a) Ridge Regression (RR), (b) Support Vector Regression (SVR), (c) Random Forest Regression (RFR), and (d) eXtreme Gradient Boosting Regression (XGBR) are compared to select the best one using a cross-validation (CV) approach. The motivation for adopting these ML algorithms is elaborated in the next sections. Among all the features discussed, the main contributions of this article can be summarised as follows: � A thorough analysis based on summary statistic and graphical visualisation is implemented to discover the relationships between the transformer loading and voltage magnitudes measured at the point of connections (POC) of the houses. � A regression-based method is introduced in combination with a limited series of residential SM data for transformer loading estimation. To this end, a data-driven monitoring 26framework is provided to implement and validate the models. A comprehensive comparison with various ML algorithms is also conducted to select the best-performing regression model.
The rest of the article is organised as follows: Section 2 presents the problem definition. Section 3 elaborates on the data pre-processing procedure. Next, Section 4 describes the procedure of training and evaluating various learning algorithms for calculating transformer loading. Afterwards, the numerical results and discussion presenting the proposed method's efficacy are given in Section 5. Finally, concluding remarks are drawn in Section 6.

| PROBLEM DEFINITION
To provide the background for the study, this section discusses the monitoring of transformer congestion followed by the application of residential SM data. The data-driven monitoring framework for this study is then presented.

| Transformer congestion monitoring
This study aims to estimate the distribution transformer's loading, then enabling the monitoring of the transformer congestions. The loading of the transformer is defined as the instantaneous apparent power of the transformer, S trans , which is due to either household power consumption or reverse power flows from residential RES units [3]. This work introduces a procedure to monitor the congestions stemming from both cases. Despite the ability of the transformer to withstand loads beyond a specified limit, prolonged overloading impairs the insulation of transformer windings [29], subsequently degrading the transformer's lifetime. Congestion monitoring, therefore, is essential for a coordinated control to be activated to restrict residential consumption or curtail RES generation.

| Residential smart meter data
Beyond the consumption measurement for billing, SMs can record voltage, current, and power quality information [30]. The information can be recorded at preset time intervals (typically 15, 30 or 60 min) and with high accuracy. These advanced features allow residential SMs to be already employed for management and planning in distribution networks, such as detecting energy theft, forecasting load, and verifying network topology. A detailed analysis of SM application in distribution networks can be found in [31,32]. For determining the transformer loading discussed in this work, the voltage magnitude measured by residential SMs is used as an input variable. For the sake of simplicity, residential SMs will be termed as SMs in the rest of this article.

| Data-driven monitoring framework
A framework is introduced in this study to realise the datadriven approach for the transformer congestion monitoring, which has been schematically presented in Figure 1. This framework includes two main building blocks as follows: � Model preparation: a comprehensive procedure for transformer loading estimation is executed involving data preprocessing, and ML model fitting and evaluation. The main outputs of the procedure are a reduced set of SMs identified as input data sources and the best-performing ML algorithm for the regression model. Individual steps in the procedure are elaborated in Sections 3 and 4. � Model deployment: data from the selected SMs is collected and fit into the tuned, trained model given by the model preparation stage to estimate the transformer loading. The transformer congestion, subsequently, can be identified. The model performance is tested with the results discussed in Section 5.

| DATA PRE-PROCESSING PROCEDURE
This section explains the procedure adopted to pre-process data in the framework for calculating transformer loading depicted in Figure 1. Implementation of the procedure for a specific test case is also described.

| Training data generation
An equitable training dataset plays a decisive role in obtaining promising results when utilising any data-driven technique. In this article, to prepare a reasonable training dataset, stochastic samples are generated by solving a substantial number of power flows and logging the corresponding results. For a given LV network and PV penetration levels, the variations in household load consumption and PV production need to be considered.

| Simulation approach
A simulation approach to generate the stochastic training dataset is developed, which is based on the method in [33]. In the following, the main stages are briefly described. it is assumed that PV units have the same phase connection as the houses. � Secondly, one-year time-series power flows with size |P| = 35,040 are solved iteratively. A data sample (X, Y) is recorded at each execution. For this, X is a set of input variables, which is regarded as the SM data, including voltage magnitudes and net active and reactive power measured at the POC of all houses, that is V i , P net i , and Q net i with ∀i ∈ N , respectively. Y is the output variable, that is the transformer loading, S trans . � Thirdly, the obtained dataset is split into two parts: 80% of the dataset is used to simulate the historical data for model preparation, and the remaining dataset simulates future data for model deployment. These sets are then stored for later processing steps.

| Test network description
The test network is based on the IEEE European LV distribution network [34]. This three-phase, radial LV network is energised from a 250 kVA, 11/0.4 kV distribution transformer with secondary-side voltage level settings of 1.03 p.u. The remaining features are retained as in the original IEEE European LV distribution network. A geographical single-line diagram of the IEEE European LV test network is displayed in Figure 2. The test network serves as an unbalanced grid supplying 55 residential users with single-phase connections, of which there are 21 users on phase A (i.e. N A = 21), 19 users on phase B, and 15 users on phase C. Every household is assumed to own a PV system. The rated capacities of PV systems in the test network are randomly chosen between 4.06 and 6.27 kWp [33], which is derived from the information on the real residential PV installations in the Netherlands. The test network is assumed to have a LEC that can access the residential SM data and control RES at the customer premises. For the demonstration purpose, an overview of household load characteristics and installed PV capacities for phase A is shown in Table 1. Furthermore, in this study it is assumed that the topology of the test network remains constant.

| Load consumption and photovoltaic generation profiles
The household load consumption used in the training data generation is modelled based on real one-year 15-min historical SM measurements in terms of active power from residential consumers in the Netherlands. To consider reactive power consumption, all loads have been assumed to have a power factor of 0.97 (inductive). The synthetic PV generation profiles are built on a PV model, which utilises meteorological data and specified PV installations to calculate the active power output as in Equation (1) [35], where P PV is the active power output (kW), I tr is the irradiance incident transmitted to the PV module (W/m 2 ), I r is the reference irradiance at test condition, that is 1000 W/m 2 , P 0 is the rated power of PV unit (kW), γ is the temperature coefficient provided by the PV panel manufacturers (%=°C), T c is the actual cell temperature (°C), and T r is the reference temperature at test condition, that is 25°C. In this article, the value of γ is selected to be 0.35 %=°C corresponding to a premium type of PV module [35]. Moreover, a set of real historical meteorological data of I tr and T c with a 15-min resolution, provided by the Royal Netherlands Meteorological Institute, is adopted [36]. Because of short geographical distances between the houses within the test network, the exposure to irradiance and ambient temperature is assumed to be similar for all PV systems. However, to justify the modest variations of the meteorological data and other uncertainties, the computed active power output for individual PV units is multiplied by a random sample drawn from a uniform distribution between 0.95 and 1.05 [37].
The proposed process for training data generation has been implemented in Python programming language and OpenDSS software. Simulations using a computer (3.4 GHz Intel CPU and 28 GB RAM) were executed in around 27 min in total to solve a large number of power flows and process the input and output data.
Having the generated training data, the remaining steps in the methodological framework are implemented using Python. It should be noted that because the test network is an unbalanced three-phase LV grid with all single-phase connected loads, data-driven monitoring is developed for each phase of the transformer. For clarity, only the monitoring for phase A of the transformer is presented for illustration in the rest of the article. The proposed method, nevertheless, can be readily applied to the other phases.

| Exploratory data analysis
Exploratory Data Analysis (EDA) is a crucial implementation in the deployment of any data-driven technique. EDA is a process of initial investigation of the data from various angles to discover interesting features [38]. The discovered features are related to existing patterns, relationships among input variables, relationships between input and output variables, and anomalies [39]. Such gained insights from the data assist in the checking of initial assumptions and preliminary choosing of applicable models [39].
In this study, the EDA process is performed on the training data achieved from Section 3.1 utilising two methods: summary statistic and graphical visualisation.  the transformer loading on phase A (S A trans ) and voltage magnitudes at the POC of all houses connected to the same phase (V i with ∀i ∈ N A ) in the dataset. A pair plot comprises histogram plots along the diagonal, Pearson correlation coefficients below the diagonal, and scatter plots above it [21], which constitute a matrix of relationship between each variable. The resulting Pearson correlation coefficients illustrate the extent of linear dependence either between S A trans and V i with ∀i ∈ N A or among V i with ∀i ∈ N A , which are calculated utilising Equation (2), where R denotes the Pearson correlation coefficient between two variables v and u, and E represents the expectation with standard deviations σ v and σ u , and means μ v and μ u . As can be seen from the subplots below the diagonal in Figure 3 in this section and Figures A1 and A2, the resulting coefficients R between S A trans and V i with ∀i ∈ N A have significant positive values above 0.9, indicating strong correlation between them. It means that increasing voltage magnitudes correlate with higher transformer loading. The voltage levels, therefore, can serve as explanatory variables for calculating transformer loading (i.e. response variable). The relationship between S A trans and V i with ∀i ∈ N A can also be validated by the computed scatter plots in Figure 3 in F I G U R E 3 A pair plot for the transformer phase A loading (S A trans ) and voltage magnitudes (V ) at the point of connections of the houses H3, H1, H31, H29, and H48 connecting to the same phase. This pair plot consists of histogram plots along the diagonal, Pearson correlation coefficients below the diagonal, and scatter plots above it 30this section and Figures A1 and A2. In these figures, pairs of numerical data are plotted to visualise the relationship of the variables. Interestingly, the scatter plots indicate clear V-shaped patterns, highlighting a potential for applying regression models to compute power flow through the transformer from the nodal voltage magnitudes. Thus, the statistical and graphical analysis prove that combining voltage measurements from SMs within a particular LEC and regression models provides a possible solution to the transformer loading estimation.
It is important to mention that, although net (active and reactive) power measurements are also available from household SMs, using such measurements to estimate transformer loading will violate the privacy rights of customers. Hence, active and reactive power data from SMs are deemed unsuitable input candidates for data-driven monitoring of transformer loading.

| Feature selection
Since the dataset contains a high number of features (variables), feature selection (variable elimination) is of importance to select a subset of relevant features directly from the original set [40]. The reason for this is that some features are irrelevant, redundant, or noisy for model construction, which likely degrades the performance of the model [41]. Additionally, highdimensional data results in more computational burden and increased execution time for model training, thus reducing computational efficiency [42]. Feature selection aims to reduce the dimensionality of the data, subsequently decreasing computational requirements, accelerating the learning process, and enhancing the model performance [41,42].
Feature selection methods can be categorised as supervised, semi-supervised and unsupervised according to the label information [40]. Solving regression problems generally involves applying supervised feature selection that measures the importance of an individual feature via its correlation with the regression output [43]. On the other hand, methods of feature selection can be broadly classified into three groups: filter, wrapper and embedded according to the searching strategy [40]. Because it has the merits of both filter and wrapper methods, the embedded method is preferable to realise feature selection [43]. The embedded method leverages the inherent structure of a learning algorithm to incorporate feature selection into the learning process [43]. Among available models for embedded method-based feature selection, regularisation models, for example least absolute shrinkage and selection operator (LASSO), are most widely utilised [43].
In this article, the LASSO approach is adopted to perform supervised feature selection on the training dataset. This adoption allows preventing bias in comparing ML algorithms used for the regression problem (as discussed in the next steps) as the LASSO model is not in the list of the ML algorithms to be compared. The importance values of a single feature (i.e. V i with ∀i ∈ N A ) to the regression output (i.e. S A trans ) calculated on the training dataset by using LASSO is illustrated in Figure 4. It is observed that the houses are ranked in order of importance with their values varying largely. Several houses have small or even zero importance values, while others have high values. This is derived from the working principle of LASSO that minimises the fitting errors whilst shrinking the magnitude of feature coefficients [43]. An importance threshold ω is then required to define the relevant features with their importance values greater than the threshold. Therefore, with ω = 2.5, the voltage levels at the POC of five houses, including H3, H1, H31, H29, and H48, are selected from a total of 21 houses on phase A to be the explanatory variables of the regression model. Choosing this value of ω for feature selection has no loss of generality because the main focus of this study is to compare the relative performance of various ML algorithms. The feature selection based on LASSO is conducted separately in a task before the training process; thus, the same subset of features will be selected for every algorithm fitting. Using the same feature subset (i.e. same inputs) for the algorithms, along with tuning its hyperparameters and comparing the generalisation and stability of its performance (as discussed in the next section), facilitates the fair comparison of the algorithms. In addition, selecting the feature based on this ω value is adequate to exemplify the model fitting and validation procedure. Hence, the dataset corresponding to the selected features can be formed from the training set for constructing the regression model in a later stage.

| MACHINE LEARNING MODEL FITTING AND EVALUATION PROCEDURE
Given the provided input data, this section focusses on the procedure of training and evaluating several ML algorithms for calculating transformer loading, as depicted in Figure 1. It is F I G U R E 4 Feature importance and selection with the least absolute shrinkage and selection operator approach. H i with ∀i ∈ N A denotes the house number, as listed in Table 1. The vertical red dashed line represents the importance threshold with ω = 2.5. POC, point of connections important to note that properly evaluating and comparing different ML algorithms necessitate addressing two matters. Firstly, the hyperparameters of the algorithm should be tuned before the evaluation stages as their values will highly impact the algorithm performance. Secondly, the comparison of various algorithms should be based on generalisation [7] and the stability of their estimated performance [44]. In this section, these two essential concerns are thoroughly considered using nested CV, aiming to facilitate the fair comparison between the studied ML algorithms. The hyperparameter tuning task is presented in Section 4.2, and the algorithm-comparing task is elaborated in Section 4.3.

| Machine learning algorithm list preparation
As there are several different ML algorithms for the regression problem, this section aims to prepare a list of various ML algorithms selected to compare their performance for transformer loading estimation. There are four ML algorithms added in the list as follows: � Ridge Regression: This regression model can be formulated in matrix form as [45].
whereβðαÞ is the RR estimator, α is the penalty parameter, X and Y are explanatory and response variables, respectively, and I is the identity matrix.
� Support Vector Regression: Support Vector Regression attempts to find the coefficients that minimise the function given by [46].
subject to while the function approximation for SVR is written aŝ where ε denotes the epsilon parameter, that is the tolerance for error, N denotes the size of data, λ i , λ * i , λ j , λ * j denote the Lagrange multipliers with non-negative values, κð:; :Þ denotes the kernel coefficient, C denotes the regularisation strength, and x and y denote explanatory and response variables, respectively. � Random Forest Regression: As a tree-based ensemble algorithm, the function approximation of RFR (i.e. a collection of T r randomised regression trees) for a training sample D N = (X, Y) can be formulated as [47], where T r is the number of randomised regression trees in the forest, N is the size of data, Θ 1 ; …; Θ T r are random variables (i.e. independent of the sample D N ),f N ðx; Θ i Þ denotes the predicted value of point x for a single i-th tree (i.e. constructed using the bootstrapped samples [48]), and X and Y are explanatory and response variables, respectively.
� eXtreme Gradient Boosting Regression. eXtreme Gradient Boosting Regression is an ensemble tree algorithm with the aim of minimising the objective function given as [49].
where N is the size of data,ŷ ðt−1Þ i denotes the prediction output at the ðt − 1Þ-th (i.e. previous) iteration of the i-th instance, y i denotes the observation, f t denotes the independent tree structure at the t-th iteration, l denotes the loss function measuring the difference between y i ,ŷ ðt−1Þ i , and f t , and Ωðf t Þ denotes the penalty term for the model complexity. The term Ωðf Þ can be calculated as where L is the number of leaves in the tree, ω is the score of leaves, and γ and α are regularisation coefficients [50]. The motivation behind the preliminary use of these four algorithms is the capability to handle the non-linear regression (LR) problem and continuous variables, that is by using SVR [22] and RFR [51]. Besides, the algorithm's capability for analysing data suffering from the multicollinearity problem, which refers to the linear relationship among explanatory variables, is considered, that is by using RFR [51] and RR [52]. This is derived from the fact that non-linearity and multicollinearity exist in the obtained training data, as can be seen in Figures 3, A1 and A2. Additionally, the algorithm having state-of-the-art advantages in ML research and recently being increasingly used in power engineering is included, that is XGBR [27]. Simulation set up of the listed ML algorithms for the next steps is implemented using the Scikit-learn package in Python [53]. For this, the principles of the ML algorithms represented by Equations (3)-(9) are considered. This set-up process is discussed in the next section.

| Model evaluation and selection
Model evaluation aims to estimate how well the performance of the specified model generalises to unseen data, the so called generalisation error [44]. To this end, it is required that model fitting and evaluation are executed on the independent datasets, that is training and validation data. Otherwise, the performance estimation can be biased due to overfitting [44].
Model selection refers to a process of tuning the hyperparameters of a specific ML algorithm to the best settings corresponding to the best-performing model [54]. A ML algorithm is generally designed with several hyperparameters; those are tuning parameters with adjustable values and have decisive effects on the performance of the algorithm [44]. Given a specific ML approach, applying different hyperparameters will lead to different models. Thus, it is desirable to select the model with the best estimation accuracy by tweaking the optimal hyperparameters [44].
To perform model evaluation and selection (as well as the algorithm evaluation and selection), nested CV is well-suited as it demonstrates several advantages [44]. This advanced method is a nesting of two k-fold CV loops: the inner and the outer CV loop, which form a k � k set up. The k-fold CV of the outer loop splits the entire historical dataset into k random, nonoverlapping subsets: one subset serves as the testing data for the algorithm selection and the remaining (k − 1) subsets are merged to form the training data used in the inner loop [55]. The inner loop based on the k-fold CV implements model evaluation and selection by crossing over the training and validation steps in an iterative manner, that is over the original training data derived from the outer loop k times. For model evaluation purpose, the k-fold CV of the inner loop further splits the provided training data into k random, independent subsets. The k subsets are then interchangeably used so that one subset serves as the validation data and the remaining (k − 1) subsets are merged and serve as the training data [55]. This split prevents the leaking of test data during the training process, thus avoiding overfitting and reducing the bias of the models' performance estimates [54]. For model selection purpose, the inner loop runs a specific ML algorithm with each fixed hyperparameter configuration over the training data k times in a loop. Each iteration of the loop uses a distinct subset of training and validation data; therefore, when that loop finishes, each hyperparameter configuration is associated with k different models and k different performance estimates [44]. Later, the multiple resulting estimates of every model are averaged to produce the ultimate generalisation error, ranked against each other to select the best model, that is with the lowest error. This loop is essential because the algorithm cannot learn the hyperparameter settings by itself during the fitting process, meaning these settings need to be specified a priori and then evaluated separately [44]. More details on the nested CV method can be found in [44,54].
This study involves adopting nested CV with a 10 � 10 set up for the model evaluation and selection for the ML algorithm list (prepared in Section 4.1). The k-fold CV outer loop with k = 10 partitions the whole dataset into 10 parts utilised for the inner loop. For an algorithm, a range of settings is initially specified as the bounded domain of hyperparameters. Then, from that domain, a set H, with size |H| = 15, of the combinations of hyperparameter settings are randomly sampled using randomised search technique [44]. To facilitate the random sampling, the hyperparameters' settings are defined as the uniform distributions of values, as shown in Table 2. The k-fold CV inner loop with k = 10 in the initiated nested CV is later applied for each combination of hyperparameter values h ∈ |H|. The goal is to find the hyperparameter combination that achieves the lowest generalisation error among all the investigated combinations in |H|. For this, a set S with |S| = 10 of various subsets of training and validation data is generated. In total, 6000 simulations are required for a list of four ML algorithms. A performance metric used for the model evaluation and selection is root mean square error (RMSE), which quantifies the average magnitude of the error between the real values and the computed values, as calculated by [56], RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where y andŷ denote the real and computed values, respectively, T denotes the total number of steps of the time-series validation data, and t denotes the time step in T . The best hyperparameters are chosen with the lowest RMSE. The process can be schematically presented, as shown in Figure 5. For the demonstration purpose, the numerical results in terms of the optimal hyperparameter configurations associated with the averaged RMSE as a generalisation error for the SVR algorithm are summarised in Table 3. It is worth mentioning that since the inner loop operates over 10 distinct subsets of the simulated historical data, which are formed on account of the outer loop, 10 combinations of the best hyperparameters along with their estimated performance are also achieved. It is observed that the hyperparameters are properly customised with varying levels of settings among the subsets. With the optimal hyperparameters from the inner loop, the algorithm will be trained and evaluated in the outer loop, which is presented in the next section.

| Machine learning algorithm selection
The ML algorithm selection involves finding the algorithm that has the lowest generalisation error and the best stability capability among all the compared algorithms. The generalisation error represents the performance estimates of the algorithm in new, unseen data, that is not in the training set. Meanwhile, the stability capability quantifies how well the algorithm performs on multiple different test datasets [44]. The k-fold CV enables both estimation of the generalisation performance and testing of the stability of the algorithm. The rationale is that this method retains the independent test dataset for the performance estimate, and it iterates the performance testing k times in different datasets. These features make the k-fold CV a common approach to complete the algorithm selection [44].
Owing to the k-fold CV in the outer loop, nested CV is applicable to the ML algorithm selection. Accordingly, nested CV with a 10 � 10 set up established in Section 4.2 is employed in this section. Specifically, in every iteration k of the outer loops, every algorithm fits the model along with its optimal hyperparameters determined from the inner loop on the training data and then evaluates the model performance on the withholding test data. The performance metrics are based on RMSE using Equation (10). Similar to the inner loop, 10 resulting generalisation errors of each algorithm are averaged  afterwards as the ultimate generalisation error for comparing the algorithms. The RMSE results across each fold of the outer loop for individual ML algorithms are displayed in the box plot, as shown in Figure 6. The numerical values of RMSE and the computational time are listed in Table 4. As can be observed from both the figure and the table that, RFR and XGBR clearly outperform SVR and RR as the first two algorithms have lower mean RMSE values from around 3.5 to 4.5 times than that of the last two algorithms. Moreover, it can be seen from the figure that the error spreading of RFR and XGBR significantly decreases compared to SVR and RR. Between RFR and XGBR, the RMSE results are relatively comparable as both these models are based on the ensemble tree algorithm. In view of the computational time, SVR has the longest time, followed by RFR and XGBR, and RR is the fastest algorithm. This is due to the fact that the SVR model is built by solving a quadratic programming problem and selecting the best support vectors from the dataset, which is typically computationally expensive [57]. RFR consists of a large number of regression trees in which each tree is grown by drawing a set of bootstrapped samples. This makes the training of RFR relatively timeconsuming, nevertheless faster than SVR as a result of growing the trees in parallel [48]. XGBR has significantly lower computational time than RFR because XGBR adopts an improved tree learning algorithm (for managing sparse data) and a quantile sketch scheme (for considering weights) [49]. By only solving the linear equations and sorting them, RR requires the lowest simulation time [58] but at the expense of significantly reduced accuracy. Hence, these findings suggest that RFR and XGBR algorithms are of worth for further study on the defined regression problem.

| MODEL DEPLOYMENT FOR TRANSFORMER LOADING ESTIMATION
Given the ML algorithms selected from the previous section, this section focusses on the procedure of deploying the algorithms to estimate the transformer loading, as depicted in Figure 1. The evaluation of the algorithm performance is also analysed. It is worth mentioning that the algorithm evaluation completed in Section 4.3 involves the small sets of test data, and only RMSE served as the performance metric. Therefore, it is desirable to carry out further tests on a larger dataset and several performance metrics for the selected algorithms.

| Model set up
The model set-up stage prepares the input data for ML algorithms and implements the model evaluation and selection. Firstly, the input data preparation requires gathering the voltage measurements of the selected household SMs as specified in Sections 3.2 and 3.3 from the operational network data. In this study, the operational data during the model deployment is simulated by using 20% of the stochastic training dataset withheld from the data generation in Section 3.1.1. As a result, a set of 7008 time-series voltage measurements with 15-min time resolution at the POC of five houses, including H3, H1, H31, H29, and H48, are obtained. Secondly, implementing the model evaluation and selection adopts the k-fold CV with k = 10 that is similar to the inner loops of nested CV, as explained in Section 4.2 to optimise the model hyperparameters. To this end, the randomised search technique is also utilised to randomly sample a set H, with size |H| = 15, of the hyperparameter combinations from the predefined ranges listed in Table 2.
A difference from Section 4.2, however, is that the model selection process in this section excludes the k-fold CV outer loop with k = 10 of nested CV. It means that the model hyperparameters are tuned based on the entire set of the simulated historical data at once instead of being iterated several times in various folds. A total of 300 simulations, thus,  are required for two ML algorithms of RFR and XGBR. The results of optimal hyperparameters, along with the estimated model performance based on RMSE, are illustrated in Table 5.
With these optimal hyperparameters, the RFR and XGBR models are trained on the entire historical data for the later performance estimates.

| Model performance analysis
After being trained with the optimised hyperparameters, the RFR and XGBR algorithms use the gathered input data of household voltage measurements to calculate the transformer loading. With an aim to analyse the model performance, three performance metrics are computed. First, by using RMSE as determined by Equation (2), the quantitative insights into the generalisation performance of the model are obtained [56]. Secondly, Pearson correlation coefficients R, as defined by Equation (10), are used to gain insights into the degree of linear dependence between the real value and the estimated value. Thirdly, the Kolmogorov-Smirnov (KS) statistical test is performed to provide insights into the statistical significance of the model results [59]. The KS statistical test is adopted in this study because it sufficiently operates with non-normally distributed data. The non-normally distributed data of V i with ∀i ∈ N A is shown in the histogram plots in Figures 3, A1 and A2. In the KS statistical test, the K statistic indicates how close the proximity of the compared distributions of data will be. Hence, a lower K statistic will be expected for higher estimation accuracy. Moreover, for a comparison of the RFR and XGBR models to non-ML models, the ordinary least squares LR model is applied as a benchmark for transformer loading calculation using the same input data and performance metrics.
The results of the three performance metrics are listed in Table 6. As can be seen, the ML models (i.e. RFR and XGBR) significantly outperform the non-ML model (i.e. LR). The ML models have the coefficients R and RMSE around 5.2 and 7.8 times, respectively, higher than the non-ML model, while the K statistic from the KS test is around 3.4 times lower. Owing to the features of the tree-based ensemble algorithm, RFR and XGBR are comparable in the estimation accuracy as their differences in the calculated three performance metrics are insignificant. These results shows the efficacy of the regression-based methods for estimating the transformer loading.
Furthermore, the probability density function (PDF) based on kernel density estimate is adopted to visualise the performance of regression models based on RFR, XGBR, and LR algorithms on the transformer loading calculation. A comparison of the PDF of the database of transformer loading measurements (i.e. true values) and the values calculated by RFR, XGBR, and LR models is shown in Figure 7. Further, a comparison of the PDF of the errors between the measured and calculated values is demonstrated in Figure 8.
It is observed from Figure 7 that the RFR and XGBR profiles have similar patterns and statistical properties (i.e. the peak of the tail of the kernel density estimate) with the true data. In contrast, LR profiles show a large difference with the true data. As can be noticed from Figure 8, RFR and XGBR profiles have comparable statistical properties, for example the peak at 0 kVA and the short tails between approximately −9 and 5 kVA. All these indicate that regression modes based on RFR and XGBR appear equally effective in accurately estimating the transformer loading when using the voltage measurements from SMs. The main driver of these similarities is With an aim to further compare these two models, in addition to the computation time during the model fitting and evaluation, the computation time during the deployment phase is measured and displayed in Table 7. In this phase, the computation time is for re-tuning the model parameters and re-evaluating the model performance (as explained in Section 5.1). This table points out that the XGBR-based regression model runs considerably faster than the RFR one as the former takes only one-third of the latter's simulation time. Hence, considering the high effectiveness and lower execution time, it is evident that the XGBR algorithm provides the best solution to the regression problem of transformer loading calculation.

| CONCLUSION
This article presented an innovative alternative to implementing transformer congestion monitoring in LV distribution networks by developing a data-driven approach implemented by the LECs. The proposed approach adopts the different regression models and leveraging SM data from residential customers. A thorough analysis is conducted to discover the relationships between the transformer loading (i.e. instantaneous apparent power) and residential SM readings (active and reactive power, and voltage magnitude). A comprehensive framework is provided to implement, validate and compare four different ML algorithms for the transformer loading estimation from the LECs' perspective. The compared algorithms include RR, SVR, RFR, and XGBR. The presented procedure covers all the essential stages in deploying a data-driven technique, consisting of training data generation, exploratory data analysis, feature selection, model selection and algorithm selection. In this respect, state-of-the-art advances in ML research have been integrated, such as the XGBR algorithm and nested CV technique.
The obtained simulation results indicate that combining the regression models with voltage magnitude measurements from residential SMs can effectively estimate transformer loading, that is with the Pearson correlation coefficient R and RMSE calculated for the real values and the estimated values of around 0.98 and 0.87, respectively, by using only a reduced set of SM data (5 out of 21 SMs used). Simultaneously, the proposed approach can preserve customers' privacy rights. Among four ML algorithms examined, XGBR appears to be the best method as it achieves adequate accuracy on estimating the transformer loading but spends significantly less time for the execution (e.g. in total, one-third of the execution time of RFR). Hence, the introduced approach provides a solution to implementing transformer congestion monitoring in LV distribution networks. The approach can eliminate the need for sensors and communications dedicated to transformer congestion management due to the uses of available residential SMs. Furthermore, the approach enables the congestion monitoring capability of the LECs. Thus, the LECs can control their consumption or generation profiles to solve the congestion problems, enhancing their self-management in the regional network for safe operation.
Although successful, the data-driven approach comes with its own limitations. Applying the proposed approach hinges on the existence of the LECs and their ability to access the residential SMs. The LECs will use the SM readings as historical and real-time operational data for the regression model's preparation and deployment. In this respect, modifications of the existing regulatory framework about the LECs' business model and the accessibility to the residential SMs will be required. In addition, the proposed approach is not adaptive to grid topology changes or increasing PV installation, which is inherent in grid operation and evolution. These changes will induce variations in the correlation between the transformer power and nodal voltage levels, which needs to be re-captured by the regression model. Periodically assessing the regression model performance and retraining the model afterwards are essential.
In this research, a typical LV distribution network with a radial configuration is considered. For LV distribution networks with mesh topology, the application of the proposed approach can face challenges. The relationships between the transformer power and the nodal voltage magnitude can be complicated for regression models. Thus, further research is required to consider the mesh networks more efficiently.

None.
F I G U R E 8 Kernel density estimate-based probability density function of the errors between the measurements and the estimations by the models. LR, linear regression; RFR, random forest regression; XGBR, eXtreme gradient boosting regression