Document Type

Conference Paper


Available under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International Licence


Statistics, Probability, Computer Sciences

Publication Details

Presented at Credit Scoring and Credit Control XII Conference, The University of Edinburgh Management School: Edinburgh.


In this paper we propose a framework to generate artificial data that can be used to simulate credit risk scenarios. Artificial data is useful in the credit scoring domain for two reasons. Firstly, the use of artificial data allows for the introduction and control of variability that can realistically be expected to occur, but has yet to materialise in practice. The ability to control parameters allows for a thorough exploration of the performance of classification models under different conditions. Secondly, due to non-disclosure agreements and commercial sensitivities, obtaining real credit scoring data is a problematic and time consuming task. By the provision of publicly available artificial data, credit scoring is opened to the wider data mining community. This in turn could help enable greater participation, promote replicable experimental findings, and give rise to solution proposals to outstanding credit scoring problems. To ensure that our framework is sufficiently grounded in reality, data distributions are generated using a troika of sources: demographic information from the Central Statistics Office, Ireland; housing statistics published by the Irish Government Department of the Environment, Heritage and Local Government; and a profile of loan defaulters developed using a recent report published by a credit rating agency. By engaging with a credit scoring expert we select characteristics that are typical of most application scorecard models including, amongst others: age, income, loan value, and occupation. Through user controlled settings the conditional prior probabilities of the characteristics can be adjusted over time to simulate differing scenarios. In order to assign class labels to the generated data a credit risk score is estimated based on the non-linear interactions between various characteristics. Based on the desired number of defaulters a cut-off score is placed on this monotonic ordering of credit scores to distinguish between those likely to repay and those likely to default on their financial obligation. The classification complexity is controlled by adding user-defined random Gaussian noise. After discussing the desirable characteristics of artificial data we describe a pseudo-random data generator for credit scoring and provide illustrations on how the framework can be used to generate population drift.


Kennedy_Credit_Scoring.arff (1434 kB)
credit scoring artificial dataset