About Phishing Detection & Education

Overview

The Project

ImageAware+ addresses a specific gap in traditional phishing detection: most existing tools analyse URLs and email headers, but completely miss phishing content embedded inside images. A fake Geek Squad invoice or a DocuSign impersonation email that delivers its content as a graphic bypasses text-based detectors entirely.

This system combines Optical Character Recognition (OCR) to read image content, email header analysis to detect spoofing and authentication failures, and a rule-based scoring engine with 29 indicators to produce an explainable risk verdict for any submitted file.

Unlike black-box ML classifiers, every point in the final score is traceable to a specific indicator making the system suitable for forensic analysis where an analyst needs to understand and justify each verdict.

Architecture

Analysis Pipeline

When a file is submitted, it passes through five stages regardless of whether it's an image or email file.

⬆

Input

Image or .eml file uploaded

🔍

Extract

OCR, href URLs, QR codes, email headers

🌐

Threat Intel

VirusTotal, URLScan, PhishTank

📊

Score

29 indicators across 8 categories

📄

Report

JSON + PDF forensic output

For email files (.eml), the pipeline always scores from body text first, then additionally analyses any embedded images, taking the higher of the two scores. This ensures no signal is lost regardless of how the phishing content is delivered.

Scoring Engine

29 Detection Indicators

The scoring engine evaluates evidence across eight categories. Scores range from 0–100: Low (0–34), Medium (35–69), High (70–100). Every contribution is fully traceable.

Threat Intelligence

VirusTotal malicious detections
VirusTotal suspicious detections

URL Signals

URLs present in content
Multiple URLs detected
QR code found
QR code contains URL
Credential harvesting URL paths

Domain Intelligence

Suspicious TLDs (.xyz, .top)
Lookalike/typosquat domains
Newly registered domains

Content Layout

Invoice layout vocabulary
Structured invoice table

Social Engineering

Urgency indicators
Legal threat / DMCA language
Sextortion indicators
Delivery scam indicators
Job scam indicators

Brand & Identity

Brand impersonation
Display name spoofing

Email Security

SPF / DKIM / DMARC failures
Reply-to mismatch

Financial / Credential

Financial lure vocabulary
Credential harvesting terms
Tech support scam terms
Banking keyword cluster
BEC indicators

Technical

Phone numbers detected
Hidden hyperlinks (OCR miss)
Low OCR confidence
Keyword density hits

Evaluation

Evaluation Results

The system was formally evaluated on 300 labelled samples 150 phishing emails from the Nazario 2025 corpus and 150 legitimate emails from the TREC 2007 ham corpus.

Pipeline	Threshold	Precision	Recall	F1	False Positive Rate	TP / FP / TN / FN
Email (.eml)	High ≥70		0.00%	0.000	0.00%	0 / 0 / 150 / 150
Email (.eml)	Medium ≥35	80.95%	11.33%	0.199	2.67%	17 / 4 / 146 / 133
Image (.png)	High ≥70		0.00%	0.000	0.00%	0 / 0 / 10 / 12
Image (.png)	Medium ≥35	100%	58.33%	0.737	0.00%	7 / 0 / 10 / 5

Understanding the Results

The image pipeline achieved 100% precision with 0% false positive rate every image flagged as phishing was genuinely phishing, and no legitimate image was incorrectly flagged. Recall of 58.33% reflects the system's dependency on OCR invoice-style phishing with readable text scored consistently in the Medium-High range, while image-only graphics where OCR extracts nothing scored near zero.

For the email pipeline, 80.95% precision at the Medium threshold with a 2.67% false positive rate demonstrates the system's precision-first design. Analysis of the 133 false negatives revealed four systematic failure patterns: attachment-based phishing (payload in PDF/Word files the system cannot read), image-rendered HTML emails with near-empty plain text bodies, BEC-style sparse-content emails designed to evade vocabulary-based detection, and emails scoring just below the Medium threshold.

Evaluation Datasets

Nazario 2025 Phishing Corpus

150 samples phishing

Hand-classified real phishing emails collected from the personal inbox of security researcher Jose Nazario. Widely cited in academic phishing detection research. Available at monkey.org/~jose/phishing/.

TREC 2007 Ham Corpus

150 samples benign

Legitimate emails from the TREC 2007 spam track evaluation. From 2007, meaning emails have proper SPF/DKIM context unlike older corpora. Standard academic baseline for spam/ham classification.

Technology

Tech Stack

🐍

Python 3.11

Core application language

🌶️

Flask

Web framework & REST API

👁️

Tesseract OCR

Multi-pass text extraction

🖼️

OpenCV

Image preprocessing & analysis

🛡️

VirusTotal API

URL threat intelligence

🔍

URLScan.io

URL scanning & history

📄

ReportLab

PDF forensic report generation

🐳

Docker / Render

Containerised cloud deployment

Honest Assessment

Limitations

No detection system is perfect. These are the known limitations of ImageAware+ and the areas identified for future improvement.

OCR Dependency

When phishing content is embedded entirely as a graphic with no readable text, OCR returns nothing and content-based indicators cannot fire. Image-only attacks with no visible text score near zero.

Attachment Analysis

Phishing delivered entirely within PDF or Word attachments is undetectable. The system analyses email body content and embedded images, not file attachments.

API Rate Limits

VirusTotal's free tier allows 500 requests/day. High-volume scanning will exhaust this quota. Results for unknown or recently registered domains may be inconclusive.

Unvalidated Weights

Indicator weights were set by reasoning rather than empirical optimisation. A larger labelled dataset would enable data-driven calibration of the scoring thresholds.

English Language Focus

Keyword indicators are primarily in English. Phishing in Irish, Spanish, Portuguese, or other languages may score lower due to vocabulary mismatch.

Sparse BEC Detection

Business Email Compromise emails are deliberately understated. Without rich vocabulary to trigger indicators, BEC attacks may score below the Medium threshold even when genuinely malicious.

Author

About the Project

ImageAware+ was developed by Lorcan Kelly Zazera as a Final Year Project for the BSc (Hons) in Cybercrime & IT Security at SETU Carlow, 2026.

The project source code is available on GitHub. Issues, suggestions, and contributions are welcome.

View on GitHub → Try the Tool →

References

Sources & Citations

[1] Anti-Phishing Working Group (APWG). Phishing Activity Trends Report, Q1 2025. Available at: apwg.org

[2] Verizon. 2023 Data Breach Investigations Report (DBIR). Available at: verizon.com/business/resources/reports/dbir

[3] FBI Internet Crime Complaint Center (IC3). 2023 Internet Crime Report. Available at: ic3.gov

[4] Nazario, J. Phishing Corpus 2025. Available at: monkey.org/~jose/phishing/

[5] TREC 2007 Spam Track. TREC 2007 Spam Track Public Corpus. Available at: trec.nist.gov

[6] Tesseract OCR. Tesseract Open Source OCR Engine. github.com/tesseract-ocr/tesseract

[7] VirusTotal. VirusTotal Public API v3. virustotal.com

[8] URLScan.io. URLScan.io API. urlscan.io

About ImageAware+