I Turned 50 Cybersecurity Books Into a Searchable Brain

The Problem With Security Books

I have a lot of cybersecurity books. PDFs from Humble Bundles, O’Reilly downloads, books I’ve bought and never finished, reference material I collected “just in case.” Like most people, they lived in a folder I rarely opened.

The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I’d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.

That’s not a knowledge base. That’s a graveyard.

So I built something better: a local semantic search engine over all of them, powered by PostgreSQL, pgvector, and OpenAI embeddings. Now I ask questions in plain English and get back the exact passages — with the book and chapter — that answer them. The whole thing runs locally on my machine.

Here’s how I built it, and why it’s become one of the most useful tools in my PAI (Personal AI Infrastructure) stack.


What Semantic Search Actually Means

Traditional search is keyword matching. You type “SQL injection” and it finds documents containing those exact words.

Semantic search is different. It converts your query and your documents into vectors — lists of numbers that represent meaning in high-dimensional space. Similar concepts cluster together regardless of exact wording. Ask “how to bypass database input validation” and you’ll surface the same SQL injection content, even though you never typed “SQL injection.”

This matters enormously for a security knowledge base. Security concepts have dozens of names. “Privilege escalation,” “privesc,” “root access,” “vertical privilege abuse” — these all mean the same thing. Semantic search finds all of them.


The Stack

  • PostgreSQL 17 — the database
  • pgvector 0.8.2 — vector similarity search extension for Postgres
  • OpenAI text-embedding-3-small — converts text chunks to 1536-dimensional vectors
  • CyberSecKB.ts — a custom Bun/TypeScript CLI I built to tie it all together

Everything runs locally. The only external call is to OpenAI’s embedding API (which runs once at ingest time, not at query time).


The Pipeline: From PDF to Searchable Knowledge

Step 1: Convert PDFs to Markdown

Raw PDFs are terrible for text processing. I convert everything to Markdown first using a pdf2md Python tool:

cd ~/projects/pdf-to-markdown
source venv/bin/activate

# Text-based PDFs (most books):
python pdf2md input/mybook.pdf

# Image-based or scanned PDFs (use OCR first):
ocrmypdf --force-ocr input/mybook.pdf /tmp/ocr.pdf
python pdf2md /tmp/ocr.pdf output/mybook.md

# Move to library:
mv output/mybook.md ~/projects/cybersecurity-library/books/

Step 2: Ingest into the Database

TOOL=~/.claude/skills/PAI/USER/KNOWLEDGE/CYBERSECURITY/Tools/CyberSecKB.ts

# Single book with topics tagged:
bun $TOOL ingest \
  --file ~/projects/cybersecurity-library/books/mybook.md \
  --title "My Book Title" \
  --topics web,network,linux

# Or load everything at once:
bun $TOOL ingest --batch ~/projects/cybersecurity-library/books/

The ingest process:

  1. Reads the Markdown file
  2. Splits it into ~800-token chunks, preserving chapter headings
  3. Sends chunks to OpenAI’s embedding API in batches
  4. Stores chunks + their vector embeddings in PostgreSQL
# Plain English query:
bun $TOOL search "how do attackers bypass WAF rules for SQL injection"

# Filter by topic:
bun $TOOL search "privilege escalation" --topics linux --limit 5

# Check what's in the KB:
bun $TOOL list
bun $TOOL stats

What It Looks Like in Practice

Here’s a real query. I asked:

bun $TOOL search "SQL injection bypass techniques" --limit 3

Result:

━━━ [63.3%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
    The `;` metacharacter in a SQL statement is used similarly to how it's used
    in command injection to combine multiple queries on the same line...

━━━ [62.5%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
    If user input is used without prior validation, and it is concatenated
    directly into a SQL query, a user can inject different data...

━━━ [60.4%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
    Input taken from cookies, input forms, and URL variables is used to build
    SQL statements that are passed back to the database...

Each result shows the similarity score, book title, chapter, and a preview. I can immediately tell which book to go deeper in.

Another query — privilege escalation:

bun $TOOL search "privilege escalation linux" --limit 3
━━━ [66.1%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
    Most systems are built using the least privilege concept — users are
    purposefully given the least privileges they need to perform their work...

━━━ [65.9%] Kali Linux Cookbook → Privilege Escalation
    CVE-2015-1328: overlayfs vulnerability affecting Ubuntu where it does not
    do proper checking of file creation in the upper filesystem area...

━━━ [65.8%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
    On Linux, vertical escalation allows attackers to have root privileges
    that enable them to modify systems and programs...

This is the power of the system: I asked about a concept, not a keyword, and got specific, sourced, actionable results from three different books.


The Current State of the KB

After the initial batch ingest:

  • 50 books indexed
  • 11,757 chunks stored and embedded
  • Coverage spans: penetration testing, malware analysis, forensics, identity and access, cloud security, social engineering, cryptography, threat modeling, and more

Some of what’s in there:

  • Practical Malware Analysis (620 chunks)
  • Cybersecurity Threats, Malware Trends and Strategies (552 chunks)
  • Cybersecurity Attack and Defense Strategies (460 chunks)
  • Security Chaos Engineering (387 chunks)
  • Hardware Hacking Handbook (378 chunks)
  • Modern Data Protection (338 chunks)

Why This Fits Into PAI

This knowledge base is part of my PAI system — Personal AI Infrastructure. The idea behind PAI is to build infrastructure that amplifies what I can do with AI, rather than using AI one prompt at a time.

The Security KB is a perfect example. It’s not about asking ChatGPT “explain SQL injection.” It’s about having my own curated library, chunked, embedded, and ready to surface exactly the passage I need — from books I trust, with sources I can trace back.

When I’m working through a security challenge or studying for a certification, I can query the KB directly. Luna (my PAI assistant) can also query it as part of a larger workflow — search the KB, pull context into the prompt, and answer questions grounded in my actual library rather than generic training data.


Building It With Claude Code

The entire CyberSecKB tool was built using Claude Code through PAI. The process:

  1. Described what I wanted: ingest markdown books, chunk by section, embed with OpenAI, store in pgvector
  2. Claude Code scaffolded the TypeScript CLI
  3. We hit a few real-world issues along the way:
    • The OpenAI project key needed embedding model access enabled separately
    • Batch size of 2048 hit the 300k token/request limit — tuned down to 200
    • The 1M tokens/minute rate limit required adding a 15-second delay between batches
    • A SQL type error in the search function when no topics filter was passed

Each issue was diagnosed and fixed in the same conversation. The tool went from concept to 50 books indexed in a single session.


What’s Next

A few things I want to add:

  • Tag all books with proper topics — the batch ingest skipped topic assignment; I’ll tag each book so --topics web or --topics linux filters actually work
  • Tier 1 topic files — condensed 5-15KB reference files for the most-used topics (SQLi, XSS, privilege escalation, etc.) that load directly into context
  • AI Security KB integration — the AI Security research KB shares the same database; queries cross both domains automatically

The knowledge base is live. The friction is gone. Now the books actually get used.


Built with PAI, Claude Code, PostgreSQL, pgvector, and OpenAI embeddings. All processing runs locally except the embedding API calls at ingest time.