Detecting & Classifying Fraudulent Ethereum Accounts

Project Overview

This project develops a unified machine-learning framework for identifying fraudulent activity on the Ethereum blockchain. By analyzing on-chain transaction patterns and network relationships, it combines unsupervised anomaly detection with supervised classification to flag suspicious accounts. The end-to-end solution is exposed via an interactive Streamlit application, allowing users to explore anomalies, model predictions, and network graphs in real time.

System Architecture

Data Extraction Layer

Web3.py: Connects to Ethereum nodes (e.g., via Infura or Alchemy) to stream transaction data.
Etherscan API: Supplements on-chain data with metadata (e.g., internal txns, contract events).

Feature Engineering Module

Pandas & NumPy: Cleans and aggregates transaction histories.
NetworkX: Computes graph-based metrics (centrality, clustering) to capture network effects.

Modeling Pipeline

Isolation Forest & Autoencoder: Unsupervised models to detect anomalous transaction patterns.
Random Forest & XGBoost: Supervised classifiers trained on labeled fraud samples to assign fraud scores.
Ensemble Framework: Merges unsupervised anomaly scores with classifier outputs for robust predictions.

Deployment Interface

Streamlit: Hosts the interactive dashboard, enabling dynamic filtering, threshold adjustments, and network visualizations.

Key Challenges Solved

Pseudonymous Data
Extracted rich behavioral features from address-level activity despite lack of identity labels.
Scalability
Processed over half a million transactions through vectorized Pandas pipelines and batch inference.
False-Positive Control
Tuned ensemble thresholds to keep false alarms below 5% while preserving recall.
Model Integration
Seamlessly combined unsupervised and supervised approaches into a single evaluation pipeline.
Interactive Reporting
Delivered real-time insights via a user-friendly web app, accelerating investigation workflows.

Implementation Details

from web3 import Web3
import pandas as pd
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from xgboost import XGBClassifier
import streamlit as st

# Connect to Ethereum node
w3 = Web3(Web3.HTTPProvider("https://mainnet.infura.io/v3/<KEY>"))
tx = w3.eth.get_transaction("0x...")

# Feature engineering example
df = pd.DataFrame([...])  # transaction records
df['hour'] = pd.to_datetime(df.timestamp, unit='s').dt.hour
X = df[['value', 'hour', 'gas', 'degree_centrality']]

# Unsupervised detection
iso = IsolationForest(contamination=0.02).fit(X)
df['anomaly_score'] = iso.decision_function(X)

# Supervised classification
rf = RandomForestClassifier().fit(X_train, y_train)
df['fraud_prob'] = rf.predict_proba(X)[:, 1]

# Streamlit app
st.title("Ethereum Fraud Detection Dashboard")
st.dataframe(df[['from', 'to', 'value', 'fraud_prob']])