<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Embeddings on </title>
    <link>https://augmentedresilience.com/tags/embeddings/</link>
    <description>Recent content in Embeddings on </description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sat, 21 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://augmentedresilience.com/tags/embeddings/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>I Turned 50 Cybersecurity Books Into a Searchable Brain</title>
      <link>https://augmentedresilience.com/posts/augmented-resilience-posts/i-turned-50-cybersecurity-books-into-a-searchable-brain/</link>
      <pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate>
      
      <guid>https://augmentedresilience.com/posts/augmented-resilience-posts/i-turned-50-cybersecurity-books-into-a-searchable-brain/</guid>
      <description>&lt;h2 id=&#34;the-problem-with-security-books&#34;&gt;The Problem With Security Books&lt;/h2&gt;
&lt;p&gt;I have a lot of cybersecurity books. PDFs from Humble Bundles, O&amp;rsquo;Reilly downloads, books I&amp;rsquo;ve bought and never finished, reference material I collected &amp;ldquo;just in case.&amp;rdquo; Like most people, they lived in a folder I rarely opened.&lt;/p&gt;
&lt;p&gt;The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I&amp;rsquo;d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.&lt;/p&gt;</description>
      <content>&lt;h2 id=&#34;the-problem-with-security-books&#34;&gt;The Problem With Security Books&lt;/h2&gt;
&lt;p&gt;I have a lot of cybersecurity books. PDFs from Humble Bundles, O&amp;rsquo;Reilly downloads, books I&amp;rsquo;ve bought and never finished, reference material I collected &amp;ldquo;just in case.&amp;rdquo; Like most people, they lived in a folder I rarely opened.&lt;/p&gt;
&lt;p&gt;The reason is friction. When I needed to look something up — say, how SQL injection payloads work, or the steps for privilege escalation on Linux — I&amp;rsquo;d have to remember which book covered it, open it, and search inside. Or just Google it and hope Stack Overflow had something decent.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s not a knowledge base. That&amp;rsquo;s a graveyard.&lt;/p&gt;
&lt;p&gt;So I built something better: a local semantic search engine over all of them, powered by PostgreSQL, pgvector, and OpenAI embeddings. Now I ask questions in plain English and get back the exact passages — with the book and chapter — that answer them. The whole thing runs locally on my machine.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s how I built it, and why it&amp;rsquo;s become one of the most useful tools in my PAI (Personal AI Infrastructure) stack.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;what-semantic-search-actually-means&#34;&gt;What Semantic Search Actually Means&lt;/h2&gt;
&lt;p&gt;Traditional search is keyword matching. You type &amp;ldquo;SQL injection&amp;rdquo; and it finds documents containing those exact words.&lt;/p&gt;
&lt;p&gt;Semantic search is different. It converts your query and your documents into vectors — lists of numbers that represent &lt;em&gt;meaning&lt;/em&gt; in high-dimensional space. Similar concepts cluster together regardless of exact wording. Ask &amp;ldquo;how to bypass database input validation&amp;rdquo; and you&amp;rsquo;ll surface the same SQL injection content, even though you never typed &amp;ldquo;SQL injection.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;This matters enormously for a security knowledge base. Security concepts have dozens of names. &amp;ldquo;Privilege escalation,&amp;rdquo; &amp;ldquo;privesc,&amp;rdquo; &amp;ldquo;root access,&amp;rdquo; &amp;ldquo;vertical privilege abuse&amp;rdquo; — these all mean the same thing. Semantic search finds all of them.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;the-stack&#34;&gt;The Stack&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL 17&lt;/strong&gt; — the database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pgvector 0.8.2&lt;/strong&gt; — vector similarity search extension for Postgres&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenAI text-embedding-3-small&lt;/strong&gt; — converts text chunks to 1536-dimensional vectors&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CyberSecKB.ts&lt;/strong&gt; — a custom Bun/TypeScript CLI I built to tie it all together&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Everything runs locally. The only external call is to OpenAI&amp;rsquo;s embedding API (which runs once at ingest time, not at query time).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;the-pipeline-from-pdf-to-searchable-knowledge&#34;&gt;The Pipeline: From PDF to Searchable Knowledge&lt;/h2&gt;
&lt;h3 id=&#34;step-1-convert-pdfs-to-markdown&#34;&gt;Step 1: Convert PDFs to Markdown&lt;/h3&gt;
&lt;p&gt;Raw PDFs are terrible for text processing. I convert everything to Markdown first using a &lt;code&gt;pdf2md&lt;/code&gt; Python tool:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;cd ~/projects/pdf-to-markdown
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;source venv/bin/activate
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Text-based PDFs (most books):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;python pdf2md input/mybook.pdf
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Image-based or scanned PDFs (use OCR first):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ocrmypdf --force-ocr input/mybook.pdf /tmp/ocr.pdf
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;python pdf2md /tmp/ocr.pdf output/mybook.md
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Move to library:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;mv output/mybook.md ~/projects/cybersecurity-library/books/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;step-2-ingest-into-the-database&#34;&gt;Step 2: Ingest into the Database&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;TOOL&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;~/.claude/skills/PAI/USER/KNOWLEDGE/CYBERSECURITY/Tools/CyberSecKB.ts
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Single book with topics tagged:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bun $TOOL ingest &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  --file ~/projects/cybersecurity-library/books/mybook.md &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  --title &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;My Book Title&amp;#34;&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  --topics web,network,linux
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Or load everything at once:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bun $TOOL ingest --batch ~/projects/cybersecurity-library/books/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The ingest process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reads the Markdown file&lt;/li&gt;
&lt;li&gt;Splits it into ~800-token chunks, preserving chapter headings&lt;/li&gt;
&lt;li&gt;Sends chunks to OpenAI&amp;rsquo;s embedding API in batches&lt;/li&gt;
&lt;li&gt;Stores chunks + their vector embeddings in PostgreSQL&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&#34;step-3-search&#34;&gt;Step 3: Search&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Plain English query:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bun $TOOL search &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;how do attackers bypass WAF rules for SQL injection&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Filter by topic:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bun $TOOL search &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;privilege escalation&amp;#34;&lt;/span&gt; --topics linux --limit &lt;span style=&#34;color:#ae81ff&#34;&gt;5&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Check what&amp;#39;s in the KB:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bun $TOOL list
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;bun $TOOL stats
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id=&#34;what-it-looks-like-in-practice&#34;&gt;What It Looks Like in Practice&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s a real query. I asked:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;bun $TOOL search &amp;#34;SQL injection bypass techniques&amp;#34; --limit 3
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Result:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;━━━ [63.3%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
    The `;` metacharacter in a SQL statement is used similarly to how it&amp;#39;s used
    in command injection to combine multiple queries on the same line...

━━━ [62.5%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
    If user input is used without prior validation, and it is concatenated
    directly into a SQL query, a user can inject different data...

━━━ [60.4%] Web Penetration Testing With Kali Linux → Detecting and Exploiting Injection-Based Flaws
    Input taken from cookies, input forms, and URL variables is used to build
    SQL statements that are passed back to the database...
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Each result shows the similarity score, book title, chapter, and a preview. I can immediately tell which book to go deeper in.&lt;/p&gt;
&lt;p&gt;Another query — privilege escalation:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;bun $TOOL search &amp;#34;privilege escalation linux&amp;#34; --limit 3
&lt;/code&gt;&lt;/pre&gt;&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;━━━ [66.1%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
    Most systems are built using the least privilege concept — users are
    purposefully given the least privileges they need to perform their work...

━━━ [65.9%] Kali Linux Cookbook → Privilege Escalation
    CVE-2015-1328: overlayfs vulnerability affecting Ubuntu where it does not
    do proper checking of file creation in the upper filesystem area...

━━━ [65.8%] Cybersecurity Attack And Defense Strategies → Privilege Escalation
    On Linux, vertical escalation allows attackers to have root privileges
    that enable them to modify systems and programs...
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This is the power of the system: I asked about a concept, not a keyword, and got specific, sourced, actionable results from three different books.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;the-current-state-of-the-kb&#34;&gt;The Current State of the KB&lt;/h2&gt;
&lt;p&gt;After the initial batch ingest:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;50 books&lt;/strong&gt; indexed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;11,757 chunks&lt;/strong&gt; stored and embedded&lt;/li&gt;
&lt;li&gt;Coverage spans: penetration testing, malware analysis, forensics, identity and access, cloud security, social engineering, cryptography, threat modeling, and more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some of what&amp;rsquo;s in there:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Practical Malware Analysis&lt;/em&gt; (620 chunks)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Cybersecurity Threats, Malware Trends and Strategies&lt;/em&gt; (552 chunks)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Cybersecurity Attack and Defense Strategies&lt;/em&gt; (460 chunks)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Security Chaos Engineering&lt;/em&gt; (387 chunks)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Hardware Hacking Handbook&lt;/em&gt; (378 chunks)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Modern Data Protection&lt;/em&gt; (338 chunks)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-this-fits-into-pai&#34;&gt;Why This Fits Into PAI&lt;/h2&gt;
&lt;p&gt;This knowledge base is part of my PAI system — Personal AI Infrastructure. The idea behind PAI is to build infrastructure that &lt;em&gt;amplifies&lt;/em&gt; what I can do with AI, rather than using AI one prompt at a time.&lt;/p&gt;
&lt;p&gt;The Security KB is a perfect example. It&amp;rsquo;s not about asking ChatGPT &amp;ldquo;explain SQL injection.&amp;rdquo; It&amp;rsquo;s about having my own curated library, chunked, embedded, and ready to surface exactly the passage I need — from books I trust, with sources I can trace back.&lt;/p&gt;
&lt;p&gt;When I&amp;rsquo;m working through a security challenge or studying for a certification, I can query the KB directly. Luna (my PAI assistant) can also query it as part of a larger workflow — search the KB, pull context into the prompt, and answer questions grounded in my actual library rather than generic training data.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;building-it-with-claude-code&#34;&gt;Building It With Claude Code&lt;/h2&gt;
&lt;p&gt;The entire CyberSecKB tool was built using Claude Code through PAI. The process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Described what I wanted: ingest markdown books, chunk by section, embed with OpenAI, store in pgvector&lt;/li&gt;
&lt;li&gt;Claude Code scaffolded the TypeScript CLI&lt;/li&gt;
&lt;li&gt;We hit a few real-world issues along the way:
&lt;ul&gt;
&lt;li&gt;The OpenAI project key needed embedding model access enabled separately&lt;/li&gt;
&lt;li&gt;Batch size of 2048 hit the 300k token/request limit — tuned down to 200&lt;/li&gt;
&lt;li&gt;The 1M tokens/minute rate limit required adding a 15-second delay between batches&lt;/li&gt;
&lt;li&gt;A SQL type error in the search function when no topics filter was passed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each issue was diagnosed and fixed in the same conversation. The tool went from concept to 50 books indexed in a single session.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;whats-next&#34;&gt;What&amp;rsquo;s Next&lt;/h2&gt;
&lt;p&gt;A few things I want to add:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tag all books with proper topics&lt;/strong&gt; — the batch ingest skipped topic assignment; I&amp;rsquo;ll tag each book so &lt;code&gt;--topics web&lt;/code&gt; or &lt;code&gt;--topics linux&lt;/code&gt; filters actually work&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tier 1 topic files&lt;/strong&gt; — condensed 5-15KB reference files for the most-used topics (SQLi, XSS, privilege escalation, etc.) that load directly into context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Security KB integration&lt;/strong&gt; — the AI Security research KB shares the same database; queries cross both domains automatically&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The knowledge base is live. The friction is gone. Now the books actually get used.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Built with PAI, Claude Code, PostgreSQL, pgvector, and OpenAI embeddings. All processing runs locally except the embedding API calls at ingest time.&lt;/em&gt;&lt;/p&gt;
</content>
    </item>
    
  </channel>
</rss>
