A hybrid phishing detection system combining OCR, computer vision, email analysis, and threat intelligence built as a Final Year Project at SETU Carlow.
ImageAware+ addresses a specific gap in traditional phishing detection: most existing tools analyse URLs and email headers, but completely miss phishing content embedded inside images. A fake Geek Squad invoice or a DocuSign impersonation email that delivers its content as a graphic bypasses text-based detectors entirely.
This system combines Optical Character Recognition (OCR) to read image content, email header analysis to detect spoofing and authentication failures, and a rule-based scoring engine with 29 indicators to produce an explainable risk verdict for any submitted file.
Unlike black-box ML classifiers, every point in the final score is traceable to a specific indicator making the system suitable for forensic analysis where an analyst needs to understand and justify each verdict.
When a file is submitted, it passes through five stages regardless of whether it's an image or email file.
Image or .eml file uploaded
OCR, href URLs, QR codes, email headers
VirusTotal, URLScan, PhishTank
29 indicators across 8 categories
JSON + PDF forensic output
For email files (.eml), the pipeline always scores from body text first, then additionally analyses any embedded images, taking the higher of the two scores. This ensures no signal is lost regardless of how the phishing content is delivered.
The scoring engine evaluates evidence across eight categories. Scores range from 0–100: Low (0–34), Medium (35–69), High (70–100). Every contribution is fully traceable.
The system was formally evaluated on 300 labelled samples 150 phishing emails from the Nazario 2025 corpus and 150 legitimate emails from the TREC 2007 ham corpus.
| Pipeline | Threshold | Precision | Recall | F1 | False Positive Rate | TP / FP / TN / FN |
|---|---|---|---|---|---|---|
| Email (.eml) | High ≥70 | 0.00% | 0.000 | 0.00% | 0 / 0 / 150 / 150 | |
| Email (.eml) | Medium ≥35 | 80.95% | 11.33% | 0.199 | 2.67% | 17 / 4 / 146 / 133 |
| Image (.png) | High ≥70 | 0.00% | 0.000 | 0.00% | 0 / 0 / 10 / 12 | |
| Image (.png) | Medium ≥35 | 100% | 58.33% | 0.737 | 0.00% | 7 / 0 / 10 / 5 |
The image pipeline achieved 100% precision with 0% false positive rate every image flagged as phishing was genuinely phishing, and no legitimate image was incorrectly flagged. Recall of 58.33% reflects the system's dependency on OCR invoice-style phishing with readable text scored consistently in the Medium-High range, while image-only graphics where OCR extracts nothing scored near zero.
For the email pipeline, 80.95% precision at the Medium threshold with a 2.67% false positive rate demonstrates the system's precision-first design. Analysis of the 133 false negatives revealed four systematic failure patterns: attachment-based phishing (payload in PDF/Word files the system cannot read), image-rendered HTML emails with near-empty plain text bodies, BEC-style sparse-content emails designed to evade vocabulary-based detection, and emails scoring just below the Medium threshold.
Hand-classified real phishing emails collected from the personal inbox of security researcher Jose Nazario. Widely cited in academic phishing detection research. Available at monkey.org/~jose/phishing/.
Legitimate emails from the TREC 2007 spam track evaluation. From 2007, meaning emails have proper SPF/DKIM context unlike older corpora. Standard academic baseline for spam/ham classification.
Core application language
Web framework & REST API
Multi-pass text extraction
Image preprocessing & analysis
URL threat intelligence
URL scanning & history
PDF forensic report generation
Containerised cloud deployment
No detection system is perfect. These are the known limitations of ImageAware+ and the areas identified for future improvement.
When phishing content is embedded entirely as a graphic with no readable text, OCR returns nothing and content-based indicators cannot fire. Image-only attacks with no visible text score near zero.
Phishing delivered entirely within PDF or Word attachments is undetectable. The system analyses email body content and embedded images, not file attachments.
VirusTotal's free tier allows 500 requests/day. High-volume scanning will exhaust this quota. Results for unknown or recently registered domains may be inconclusive.
Indicator weights were set by reasoning rather than empirical optimisation. A larger labelled dataset would enable data-driven calibration of the scoring thresholds.
Keyword indicators are primarily in English. Phishing in Irish, Spanish, Portuguese, or other languages may score lower due to vocabulary mismatch.
Business Email Compromise emails are deliberately understated. Without rich vocabulary to trigger indicators, BEC attacks may score below the Medium threshold even when genuinely malicious.
ImageAware+ was developed by Lorcan Kelly Zazera as a Final Year Project for the BSc (Hons) in Cybercrime & IT Security at SETU Carlow, 2026.
The project source code is available on GitHub. Issues, suggestions, and contributions are welcome.
[1] Anti-Phishing Working Group (APWG). Phishing Activity Trends Report, Q1 2025. Available at: apwg.org
[2] Verizon. 2023 Data Breach Investigations Report (DBIR). Available at: verizon.com/business/resources/reports/dbir
[3] FBI Internet Crime Complaint Center (IC3). 2023 Internet Crime Report. Available at: ic3.gov
[4] Nazario, J. Phishing Corpus 2025. Available at: monkey.org/~jose/phishing/
[5] TREC 2007 Spam Track. TREC 2007 Spam Track Public Corpus. Available at: trec.nist.gov
[6] Tesseract OCR. Tesseract Open Source OCR Engine. github.com/tesseract-ocr/tesseract
[7] VirusTotal. VirusTotal Public API v3. virustotal.com
[8] URLScan.io. URLScan.io API. urlscan.io