About ImageAware+

A hybrid phishing detection system combining OCR, computer vision, email analysis, and threat intelligence built as a Final Year Project at SETU Carlow.

The Project

ImageAware+ addresses a specific gap in traditional phishing detection: most existing tools analyse URLs and email headers, but completely miss phishing content embedded inside images. A fake Geek Squad invoice or a DocuSign impersonation email that delivers its content as a graphic bypasses text-based detectors entirely.

This system combines Optical Character Recognition (OCR) to read image content, email header analysis to detect spoofing and authentication failures, and a rule-based scoring engine with 29 indicators to produce an explainable risk verdict for any submitted file.

Unlike black-box ML classifiers, every point in the final score is traceable to a specific indicator making the system suitable for forensic analysis where an analyst needs to understand and justify each verdict.

Analysis Pipeline

When a file is submitted, it passes through five stages regardless of whether it's an image or email file.

Input

Image or .eml file uploaded

🔍

Extract

OCR, href URLs, QR codes, email headers

🌐

Threat Intel

VirusTotal, URLScan, PhishTank

📊

Score

29 indicators across 8 categories

📄

Report

JSON + PDF forensic output

For email files (.eml), the pipeline always scores from body text first, then additionally analyses any embedded images, taking the higher of the two scores. This ensures no signal is lost regardless of how the phishing content is delivered.

29 Detection Indicators

The scoring engine evaluates evidence across eight categories. Scores range from 0–100: Low (0–34), Medium (35–69), High (70–100). Every contribution is fully traceable.

Threat Intelligence

  • VirusTotal malicious detections
  • VirusTotal suspicious detections

URL Signals

  • URLs present in content
  • Multiple URLs detected
  • QR code found
  • QR code contains URL
  • Credential harvesting URL paths

Domain Intelligence

  • Suspicious TLDs (.xyz, .top)
  • Lookalike/typosquat domains
  • Newly registered domains

Content Layout

  • Invoice layout vocabulary
  • Structured invoice table

Social Engineering

  • Urgency indicators
  • Legal threat / DMCA language
  • Sextortion indicators
  • Delivery scam indicators
  • Job scam indicators

Brand & Identity

  • Brand impersonation
  • Display name spoofing

Email Security

  • SPF / DKIM / DMARC failures
  • Reply-to mismatch

Financial / Credential

  • Financial lure vocabulary
  • Credential harvesting terms
  • Tech support scam terms
  • Banking keyword cluster
  • BEC indicators

Technical

  • Phone numbers detected
  • Hidden hyperlinks (OCR miss)
  • Low OCR confidence
  • Keyword density hits

Evaluation Results

The system was formally evaluated on 300 labelled samples 150 phishing emails from the Nazario 2025 corpus and 150 legitimate emails from the TREC 2007 ham corpus.

Pipeline Threshold Precision Recall F1 False Positive Rate TP / FP / TN / FN
Email (.eml) High ≥70 0.00% 0.000 0.00% 0 / 0 / 150 / 150
Email (.eml) Medium ≥35 80.95% 11.33% 0.199 2.67% 17 / 4 / 146 / 133
Image (.png) High ≥70 0.00% 0.000 0.00% 0 / 0 / 10 / 12
Image (.png) Medium ≥35 100% 58.33% 0.737 0.00% 7 / 0 / 10 / 5

Understanding the Results

The image pipeline achieved 100% precision with 0% false positive rate every image flagged as phishing was genuinely phishing, and no legitimate image was incorrectly flagged. Recall of 58.33% reflects the system's dependency on OCR invoice-style phishing with readable text scored consistently in the Medium-High range, while image-only graphics where OCR extracts nothing scored near zero.

For the email pipeline, 80.95% precision at the Medium threshold with a 2.67% false positive rate demonstrates the system's precision-first design. Analysis of the 133 false negatives revealed four systematic failure patterns: attachment-based phishing (payload in PDF/Word files the system cannot read), image-rendered HTML emails with near-empty plain text bodies, BEC-style sparse-content emails designed to evade vocabulary-based detection, and emails scoring just below the Medium threshold.

Evaluation Datasets

Nazario 2025 Phishing Corpus

150 samples phishing

Hand-classified real phishing emails collected from the personal inbox of security researcher Jose Nazario. Widely cited in academic phishing detection research. Available at monkey.org/~jose/phishing/.

TREC 2007 Ham Corpus

150 samples benign

Legitimate emails from the TREC 2007 spam track evaluation. From 2007, meaning emails have proper SPF/DKIM context unlike older corpora. Standard academic baseline for spam/ham classification.

Tech Stack

🐍

Python 3.11

Core application language

🌶️

Flask

Web framework & REST API

👁️

Tesseract OCR

Multi-pass text extraction

🖼️

OpenCV

Image preprocessing & analysis

🛡️

VirusTotal API

URL threat intelligence

🔍

URLScan.io

URL scanning & history

📄

ReportLab

PDF forensic report generation

🐳

Docker / Render

Containerised cloud deployment

Limitations

No detection system is perfect. These are the known limitations of ImageAware+ and the areas identified for future improvement.

OCR Dependency

When phishing content is embedded entirely as a graphic with no readable text, OCR returns nothing and content-based indicators cannot fire. Image-only attacks with no visible text score near zero.

Attachment Analysis

Phishing delivered entirely within PDF or Word attachments is undetectable. The system analyses email body content and embedded images, not file attachments.

API Rate Limits

VirusTotal's free tier allows 500 requests/day. High-volume scanning will exhaust this quota. Results for unknown or recently registered domains may be inconclusive.

Unvalidated Weights

Indicator weights were set by reasoning rather than empirical optimisation. A larger labelled dataset would enable data-driven calibration of the scoring thresholds.

English Language Focus

Keyword indicators are primarily in English. Phishing in Irish, Spanish, Portuguese, or other languages may score lower due to vocabulary mismatch.

Sparse BEC Detection

Business Email Compromise emails are deliberately understated. Without rich vocabulary to trigger indicators, BEC attacks may score below the Medium threshold even when genuinely malicious.

About the Project

ImageAware+ was developed by Lorcan Kelly Zazera as a Final Year Project for the BSc (Hons) in Cybercrime & IT Security at SETU Carlow, 2026.

The project source code is available on GitHub. Issues, suggestions, and contributions are welcome.

View on GitHub → Try the Tool →

Sources & Citations

[1] Anti-Phishing Working Group (APWG). Phishing Activity Trends Report, Q1 2025. Available at: apwg.org

[2] Verizon. 2023 Data Breach Investigations Report (DBIR). Available at: verizon.com/business/resources/reports/dbir

[3] FBI Internet Crime Complaint Center (IC3). 2023 Internet Crime Report. Available at: ic3.gov

[4] Nazario, J. Phishing Corpus 2025. Available at: monkey.org/~jose/phishing/

[5] TREC 2007 Spam Track. TREC 2007 Spam Track Public Corpus. Available at: trec.nist.gov

[6] Tesseract OCR. Tesseract Open Source OCR Engine. github.com/tesseract-ocr/tesseract

[7] VirusTotal. VirusTotal Public API v3. virustotal.com

[8] URLScan.io. URLScan.io API. urlscan.io