Can we detect fraud in the blockchain using machine learning? | by Noah Mukhtar | January 2023
An elaborate guide on how to catch scammers using an Ethereum dataset and machine learning
Since the advent of blockchain, it has never been more seamless for companies, banks and customers to trade goods and transfer money. With this new era of e-commerce, blockchain has served as an attractive alternative that bypasses traditional intermediaries, and with it we are discovering new ways to commit financial crime, and with the vast collection of data we have today, we need to develop new ways to beat them.
Is fraud changing?
Fraudsters are constantly looking for new media to commit crimes, and with the advent of the blockchain, they have managed to find a new way to exploit the potential for money laundering and fraud.
Bad actors cover their tracks through one of the community’s most accredited tokens: Ethereum.
Can Ethereum be exploited?
Ethereum’s blockchain technology has rapidly exploded in popularity over the past two years despite having protocols that are “uniquely vulnerable to hacking” due to their open source code, large amount of assets and rapid growth that may have led to a lapse in best practices for security.
Is there an increase in crime?
A dizzying one $1.9 billion worth of cryptocurrency was stolen in the first seven months of 2022, 60% higher than the same period the previous year.
“Decentralized finance” (DeFi) protocols (ie including Ethereum) was responsible for 17% of all funds sent from illegal wallets, and the quick switching between different types of cryptocurrencies was only useful for money launderers.
Why do we need data science?
It is important to find hidden patterns in data to prevent fraudulent transactions from happening in the first place. This can be as simple as detecting unusual transaction patterns relevant to common consumption behavior, or as complex as detecting when a hacker attempts to change a process block in the blockchain (i.e. tampering with a transaction and its corresponding hashes on the blockchain)
The following steps explain the approach in data construction:
Our data set is taken from Ethereum Blockchain records and contains 9,841 rows, of which only 7,662 (ie ~80%) are legitimate.
Problem: Imbalanced data set
Our data set is highly unbalanced, making the model more effective at identifying legitimate transactions than fraudulent ones, making it ineffective at identifying new fraud cases.
Trade-off: Recall vs. precision
Our goal is to maximize the recall and trade off some of the precision, since predicting “fraud” on non-fraudulent transactions is less financially damaging than missing any fraud.
Solution:
Balance the classes by resampling the minority upscaling (fraud transactions) to have the same frequency as the majority (non-fraud) class.
Classification models
The dataset was split into training and testing, to train our models and objectively measure their performance.
A number of different algorithms were calculated to classify whether a transaction was considered fraudulent or legitimate.
Models run were Logistic Regression, Random Forest, LGBM Classifier, Multi-layer perceptron (MLP), XGB, KNN, SVM & ADABoost.
The LGBM classifier excels in classification tasks, with high accuracy on both training and test sets. To improve performance, we use hyperparameter tuning. This technique fine-tunes the model to reduce over- and under-fitting.
Using randomized search, we found the optimal parameters for our LGBM classifier, resulting in our accuracy increasing from 98.6% to 99.03%
Meaning of function
In this study, we aimed to understand the importance of each feature in determining fraudulent transactions using the best model we developed.
To achieve this, we ran a function importance visualization, which allowed us to gain insight into the relative importance of each function in the model.
The results of the visualization revealed that the two features that emerged as the most important features for determining fraudulent transactions are:
(1) “Time difference between first and last (min)”: Time difference between first and last transaction.
(2) “Unique received from addresses”: Total unique addresses from which the account received transactions.
(1) “Time difference between first and last (min)”
“Time Diff between first and last (Mins)” can be a good indication of fraud on the blockchain because it can help detect suspicious activities that happen within a short time. For example, if a large number of transactions are made within a very short time frame, it may indicate that the transactions are being made by a bot or automated script rather than by a human.
Additionally, it could be a sign of a coordinated attack where multiple transactions are made at the same time to flood the network with fake transactions.
(2) “Unique received from addresses”
“Uniques received from addresses” can be a good indication of fraud on the blockchain because it can help detect suspicious activities involving multiple addresses.
For example, if a single transaction is made from many different addresses, it may indicate that the transactions are being made by someone trying to evade detection. It can also indicate a case of a group of individuals working together to commit fraud, or a possible money laundering operation.
Also, having multiple sources of funding in a transaction, or many different “from addresses” can also be a sign of a transaction being executed by an entity that may not have the proper authorization to execute the transaction, or an entity attempting to anonymize its identity.
These findings can help organizations allocate resources to detect these specific attributes during the transaction monitoring process, ultimately leading to more efficient and effective fraud detection.
Furthermore, such visualization of functional importance can be useful for other researchers and practitioners in fraud detection, and provide a valuable starting point for further research and development.
https://www.linkedin.com/in/nmukhtar/
GitHub code
https://github.com/NoahMMA