Recent Posts



Build Yourself a Digital Bank

The future of brick and mortar banks

There’s plenty of lazy money to be made in banking, either retail banking or commercial banking. There, I said it!

Why should you create a digital bank, and why should you be terrified if you are an existing brick and mortar bank? Here’s why.

Let’s say we have two banks “Brick & Mortar” and “Savvy Digital”.

“Brick & Mortar” is a brick and mortar bank, it has heaps of staff. All loan applications are processed manually by diligent staff. It takes a week or so to have an application for a loan processed (assuming all is well with the paperwork) and you have to step into a branch to apply for a loan.

“Savvy Digital” is a new bank. It has very few staff, no branches, instead it has a few data scientists and developers. You apply for a loan with “Savvy Digital” from your phone, you are assessed by their algorithms in seconds and you have the money for your loan in your account in half an hour.

What we will find over time is the better quality customers (generally people with jobs, middle class etc) will leave “Brick & Mortar” and become customers of “Savvy Digital”. These people just won’t have the time or energy to deal with “Brick & Mortar”. However, any customers rejected by “Savvy Digital” will apply for loans with “Brick & Mortar” as a last resort.

So what is going to happen to the loan book and customer quality of “Brick & Mortar”? It will deteriorate! They will find their cost of business rise sharply with their lower quality through the door customers. Lower quality customers means higher arrears and write offs, higher collections costs, higher costs of underwriting. Because loans are manually processed they will be a vector for fraud from falsified application documentation. They will be forced to lay off staff, close operations and branches, and ultimately the brick and mortar banks will fail. This is going to happen sooner than they realise.

This is the fate of the brick and mortar banks. There are four major banks in Australia, if someone else is seeing what I am seeing and they have more in the way of savings than I do - say a few million dollars more and they can finance a team for a few years then I reckon the big four banks are going to get hammered, and the regional banks? Well once the old farmers with their passbooks in their back pockets can’t shuffle in any longer these guys too will feel the pain of that new way of getting customers and doing business called “the interweb” (or something like that).

Here’s how you build a digital bank.

Step 1: Bank statements are the ultimate source of credit information

Bank statements are the most powerful source of information for credit decisions. It amazes me that all the banks like “Brick & Mortar” are sitting on this information but mostly not incorporating it in their credit decisioning systems. Instead they rely on information on the application form users fill in, at best case the user doesn’t know how much they spend on groceries in a week, in the worst case they lie and say they have zero credit cards when they actually have ten.

How do you categorize bank statements? What is your training set?

Option 1) Sit a bunch of people down and get them to manually categorize millions of transactions from bank statements to create a training set.

Well, I mean you could but it would be painful for the people doing it. It would be expensive and they would probably get it wrong heaps. So you’d have a lot of conflicting data.

“Gloria Jeans Coffee Shop”

  • Is it “Clothing and Fashion” or is it “Restaurants and Cafes”?

This example highlights why a keyword lookup approach to bank statements will fail too. I mean depending on the hierarchy of the tree you will get incorrect results consistently. Concretely, if the mapping of Jeans -> “Clothing and Fashion” appears before “Coffee” -> “Restaurants and Cafes” this example will always be categorised incorrectly.

Option 2) Use an independent source of truth to train the bank statements

There are several you could use, but ultimately all you would want is a mapping of businesses to category. How you’d then rollup and group the categories depends on your application. But as an example:

  • Yellow Pages has businesses mapped to categories

  • Business names mapped to ANZSIC codes

Go with Option 2).

Step 2: Build an engine to categorize each transaction in the bank statement

Obviously this includes cleaning the corpus of text, but you have some tough choices to make:

  • Do you strip punctuation, or it is punctuation meaningful here, for example “7-Eleven”?

  • Do convert text to lowercase? I did.

  • How do you allow for edge cases? Coles is a big supermarket chain in Australia, but it is also a suburb in Australia

How sophisticated do you want the model to be?

  • You can build a simple model, like a Maximum Entropy classifier, which is really just a multinomial logistic regression model with the bag of words as features, but you would want to create bigrams or trigrams (2 or 3 word combinations) for greater context.

  • You can try a deep learning approach which would eliminate the need for bigram or trigram features.

There will be new transactions popping up all the time, how do you deal with that?

  • There will be new online gambling sites opening up, new credit providers etc. So you need a system of picking them up quickly. As these new examples would not have been seen in the training data they would likely be more volatile, that is they would be classified in a random way flipping between categories.

  • Topic modelling would help, for instance you’d expect online gambling sites to be clustered together.

Step 3: Group up similar transactions

The grouping of transactions will be difficult, but it is necessary to determine the quality of the income streams, the regularity of the expense items and understanding those massive one-off expenses that hit people like having to fix the washing machine or the car.

You will often see transactions like this:






So, a SQL “group by” won’t cut it here. You need something that will group up similar, but not exact line items in the bank statement. So, I’d recommend the Jaro Winkler as a measure of string distance.

The results of the Jaro Winkler algorithm tend to be bimodal. You usually see a clear threshold for grouping and for items that shouldn’t be grouped. For instance if we compare the above to “WAGES ABC PTY LTD SEP”

  • “WAGES ABC PTY LTD JUN” 0.9047619

  • “WAGES ABC PTY LTD JUL” 0.9047619

  • “WAGES ABC PTY LTD AUG” 0.9047619

  • “UBER SOME RANDOM PLACE” 0.5533911

  • “ICE CREAM SHOP” 0.4920635

Setting a grouping threshold at around 0.85 to 0.9 should work pretty well, obviously worth checking with lots of data.

What this then allows you to do is create new features in your risk models around regularity of the transactions by working out the days in between each similar transaction. Or even the variability of the transaction by using the coefficient of variation as a measure

Step 4: Feed these features into a real time credit risk scorecard

You’d access bank statements using Yodlee or some other provider allowing you to make credit risk decisions in real time.

It’s then not that tricky to build a credit risk scorecard. Most banks tend to have your basic train/ test split with a few variables from the application form (which we know are unreliable, incorrect, or just false) and a few credit bureau variables (which we know are from the past, sometimes distant past). They don’t tune any of the model parameters at all usually. Others don’t even have their own risk models and just pass some variables from the application form to a credit bureau and hope that something good is returned as a score.

Most credit risk scorecards are not trained using cross-validation and regularization is not used. So, what happens is most of them require significant rebuilds way more often than they should.

Bank statements are the ultimate data source, the king of lending data, swamping the power of the credit bureau and the application form. Very few lenders have the ability to incorporate this data into their credit models, which is totally nuts in my opinion.

Beyond credit scoring there is a responsible lending element to using the bank statements, hygiene checks such as percentage of income spent on gambling, percentage of income from government income, ATM withdrawals, amount of spending on superfluous items like take out, cabs can help to assess the individual in a way basic credit scoring cannot. There is also an education piece to help the client to get back on track with their finances rolling off the back of a detailed, bank statement based credit assessment.


The applications of what I’d call next generation credit risk modelling are huge, epoch changing for banking, finance, fintech, everything!

1. Classification of bank statement transaction items as a service

This could be offered to banks and financial institutions as a service. I can really seeing something like this as replacing/ disrupting credit bureaus. Surely it is more interesting to detect job severance pay the day before someone applies for a credit card rather than seeing a forgotten payment on a personal loan from 3 years ago. Calling this service could even be like calling a traditional credit bureau, except I think the relevance and power of the data would be far greater than a traditional credit bureau.

2. The next generation of Personal Financial Management Apps could use this service.

The holy grail is to be able to project out income and expenses into the future. This is possible only with excellent bank statement categorization and sensible grouping, determining a likely future of the client. The idea being if you can see the road ahead you could offer a real solution with directed messages to help the user reach their saving goals. “Last year you were short $500 for Christmas, this year based on projections of your income and expenses it looks like you will fall short $300, but if we set aside $6 per week into a separate Christmas fund you will be apples Champ!”

3. Real time credit scoring.

Credit bureau scores are either event based (like a default) or are updated monthly. Whereas something like this would give a real time credit score. Or at worst case the credit score could be updated daily. Real time credit scoring would give individuals a deeper understanding of the actions they have and their impact on their credit health. Eg if you had a bender, spent hundreds of dollars on booze and uber rides you’d see a dip in your credit score, your savings goals would stretch out that bit further. Then it would be bring lunch to work for a while to bring your score back to where it was. Such a system would keep users on track and financially responsible through better information.

4. Socially responsible lending

Credit cards are an evil instrument. They just are. If you make minimum payments you might still have $30k of principal sitting there waiting for you. That’s just horrible. So, the next generation of digital banks would offer products only when they were absolutely confident the client could repay based on extensive and detailed history. These loan products would have high minimum payments, the idea being at any point in time you’d be able to pay the thing off in say 2 years by making minimum payments. This isn’t just nice, this is socially responsible (read this as not evil!). You’d be able to be a bank and also not be evil because you would have much fewer staff, no spending on branches etc.

5. Better fraud assessment

The falsification of PDFs is a trivial thing to do. I mean it could even be a school project for kids. It would be much, much harder to falsify transactions in a bank statement, you’d have to actually have some kind of real bank statement history with transactions to apply for a loan. So, it is possible for someone to open an account, create a history of a year or so of transactions, and then apply with fraudulent purposes, but it creates a barrier.

For corporate lending such as invoice finance all the risk is loaded in the first invoice, in the same way the risk of fraud is in the first loan for consumer lending. There is also a fine line between extreme credit risk and fraud in both commercial and consumer lending. They can look pretty similar, by that I mean the small business who is desperate to get cash from anywhere to keep things ticking over sometimes looks pretty similar to the fraudster who is desperate to get cash from anywhere.

If you can’t see a wage or salary, that’s something to be concerned about. So too if you see a heap of cash withdrawals.

6. Kindness to those in collections

Some people will have unfortunate things happen to them, they might not be able to make payments on their loan due to job loss, separation with a partner, death of primary breadwinner. I mean the list goes on, some people just struggle with budgeting but hopefully the PFM and real time credit score would help them to keep on track.

Now, here’s the problem. In collections any model will tell you the best way to get money out of people is to harass them as much as you can within the limits of the law. In fact you have to add a weighting to balance for the number of contacts into a collections model to come up with a non-trivial solution (kind of along the lines of “if I have only one action to take, what should my next action be?”). But this is still harassing people, which is powerfully evil and uncool.

What if you could dip into their bank statements, see what they can actually afford based on data and suggest 3 or 4 different repayment plans for them in a message to their mobile? What if it was a chatbot? What if it required no human intervention at all? It would be less invasive for people who fall behind on payments. I mean it’s better than < 2% contact rates and money walking out the door. I’m a big believer in people being generally good and that they will do the right thing. I also think some banks can be complete bastards when it comes to collections.

7. Auto-approval of invoices for invoice finance/ supply chain finance

Invoice finance is horribly manual (even in the so-called tech companies). Their idea of automation is often just cracking the whip on those doing manual assessment, their idea of scale is generally just hiring more people and buying more whips.

What you’d like to see is a solution that hooks into the company bank account, looks at the transactions, classifies them and then automatically offers rates on specific invoice streams to suppliers for invoice finance / supply chain finance. That is, the AI system would be able to work out the regularity, the schedule length, the price and the terms. You could also have a suite of checks depending on how risky the invoice is.

Another option would be to manually assess the first invoice, but to then automate 2nd and subsequent invoices from the same supplier/ debtor combination. The fraud risk is really loaded on that first invoice.

Caveats and stuff to work out

  • ATM withdrawals represent a black hole, it is impossible to see where this money is being spent. But, people who tend to use cash a lot are either really old school (so won’t bank with a digital bank) or are really dodgy.

  • Each bank has its own weird and wacky way of displaying transaction items in the bank statement. With time and resources you’d need to take a bank-centric approach to cleaning of the data to feed into the categorisation engine.

  • Bank statement categorization is a long tail problem. You can cover most of the big supermarket chains, retail shops etc pretty easily, but it is those weird and wacky businesses selling ivory back scratchers, pug cufflinks etc that cause this long tail. Different businesses come and go, so you need to set up a process of new transaction item discovery. This is similar to the “cold start” problem in online advertising when you might have 3 different ads starting and you don’t know which one users will prefer. In the same way you might have 3 different possible categorisations and you will have to work out which one is the correct one pretty quickly.

  • You will have to ask users to verify some transactions, but you’d have to nail it for 95% of transactions. If I upload my bank statements and you ask me to classify 100 different transactions I just won’t do it. In the same way you need the right balance between granularity and sensible transactions. At the extreme you could have a category for each unique line item or you could just have income and expenses as the two categories. So the mapping of categories is important (Groceries, Petrol, Taxis/ Ubers, Gambling etc).

  • You would need a team to do this, not really just a guy with a pug. Those people would need to be paid. If you were doing this as one person it’s a pain.

  • Access to some of the data can be a pain. Open data isn’t really open if it isn’t you know open? So you have to be a government agency to access this data. Web Scraping is definitely against the terms and conditions of the WEGA page, but I mean, you know churning through the pdfs might be the only real way to access a mapping of business name to ANZSIC code.

Check out line 1 and line 3 of the example WEGA report pdf. That’s the mapping of business name to business description that we are after. There are thousands of docs in pdf you’d need to scrape - possible but very painful. Definitely worth it though to put together a training dataset to kick start the categorisation engine.


Actually build the thing, if anyone has the team and resources to do it just go right ahead and do it. This one is bigger than one person can handle. You’d really need a team to make this happen. Hopefully this blueprint will help.

Here is the output of a simple jaro winkler check in R for the following:

ref <- c('Wages ABC Pty Ltd SEP')

words <- c('Wages ABC Pty Ltd JUN', 'Wages ABC Pty Ltd JUL', 'Wages ABC Pty Ltd AUG',

'UBER Some Random Town 56437', 'Ice Cream Shop random digits 23456', 'Ice Cream Shop random digits 23456')

ref <- toupper(words) words <- toupper(words) jw_data_input <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE) jw_data_output <- jw_data_input %>% group_by(words) %>% mutate(match_score = 1-stringdist(words, ref, method="jw")) # typical bimodal distribution hist(jw_data_output$match_score, main="Jaro Winkler Scores", col="pink", xlab="match score")

This distribution is typically what you see in these applications, once you change the reference and words vectors to upper case. Sketching ideas like this in R is quick and easy, you are really just pulling down packages, but you’d go with a Python implementation. You could have a Flask or Django web app running the thing under the hood, but you’d probably need to incorporate React to bounce some borderline transaction line items back to the user for re-categorization.

Here’s a bit of Python code to extract the Legal Name and ANZSIC code from the WEGA PDF reports:

import PyPDF2

import re

pdfFileObj = open('tempPublicReport_sthkgzi7ad.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

pageObj = pdfReader.getPage(1)

page = pageObj.extractText()

legalName ='Legal name(.*?)ABN', page).group(1)

ANZSIC ='ANZSIC(.*?)\/', page).group(1)

print("Legal Name: {} , ANZSIC Code: {}".format(legalName, ANZSIC))

'Legal Name: Woolworths Ltd , ANZSIC Code: G Retail Trade4110 Supermarket and Grocery StoresBusiness'

Anyway, if anyone with a medium sized team wants to have a crack at it feel free and let me know how you go. If you want a second opinion I’m here to help.

All the best,