Walt's Bot Chat

May 1, 2025 — I'm a winner in the Gen AI Intensive Course

Well, today was a big day for me with Artificial Intelligence! In addition to teaching an Intro to AI Programming class at a local high school, I learned that I'm a winner in an AI course project competition! Along with almost 250,000 other developers worldwide, I participated in the 5-day Gen AI Intensive Course 2025Q1 with Google, which took place from Monday March 31 - Friday April 4, 2025. It was co-sponsored by Kaggle, a data science competition platform and online community.

At the end of the course, to get credit we were to create and submit a Capstone Project demonstrating what we'd learned in the course. My entry consisted of the project code in a Jupyter notebook, a blog post describing the work (see the April 19, 2025 blog post below), and a Youtube video giving a short presentation.

In all, there were 6,151 capstone project entries from individuals and teams of up to 5 people. The top 10 entrants were given Kaggle swag and their work will be highlighted by Kaggle and Google social media. My project, called RAGLab, was selected as one of the top 10 entries!

You can see my project entry here, it’s a Jupyter notebook hosted on the Kaggle site. Anyone can read it, but you’ll need a Google API key to run it.

Watch my RAGLab YouTube Video.

Read about the class here, it is now available as a self-paced learning guide that you can take yourself.

And the announcement of all the winners and honorable mentions is here.

Kaggle T-shirt — *Kaggle sent me this great t-shirt for being a winner!*

April 19, 2025 — RAGLab: Automating the Search for the Best RAG Pipeline

Introduction

This blog post describes my "capstone" final project for the recent Gen AI Intensive Course 2025Q1 offered by Google and Kaggle.

Retrieval‑Augmented Generation (RAG) has emerged as a powerful way for AI chatbots to leverage external data, giving them the ability to produce more accurate, context‑rich responses. Instead of relying solely on a large language model’s internal knowledge, RAG applications retrieve domain‑specific documents—often stored in a vector database—and feed them to the model as additional context. This approach is particularly helpful when building domain‑focused chatbots in areas like law, medicine, or finance, where ensuring that responses reflect up‑to‑date, specialized information is essential.

However, the RAG process involves numerous components. For instance, different embedding models can be used in the retrieval phase, various pre‑retrieval prompt adjustment methods can optimize outcomes, and a variety of large language models or fine‑tuned solutions can handle final generation. Selecting the right combination for a particular domain can be challenging. Developers often must try different techniques to see which works best for domain‑specific data, which introduces added complexity when iterating toward a deployable solution.

This project addresses that challenge by providing a comparative testing framework—an AI agent that automates evaluations across multiple RAG pipeline variants. The system runs a series of controlled experiments, systematically swapping out modules (e.g., pre‑retrieval prompt adjustment, retrieval strategies, generation models) while asking domain‑relevant questions. It logs outcomes, performance metrics, and user feedback. With these results, developers can more easily identify the best‑performing configuration. By eliminating the trial‑and‑error overhead, this solution reduces the time to create a reliable, domain‑tuned RAG chatbot and demonstrates how teams can systematically operationalize RAG for production.

RAGLab Implementation

In my capstone project, I set out to solve a real-world problem that many AI developers face: how do you figure out which configuration of a Retrieval-Augmented Generation (RAG) pipeline works best for your specific domain?

As RAG systems become more modular and sophisticated—offering options like prompt rewriting, multiple retrievers, and different generation models—the challenge shifts from building a pipeline to optimizing it. For instance, should your medical chatbot rewrite the user's query before retrieving documents? Should it use a dense or sparse retriever? And which LLM produces the most reliable answers with the retrieved context?

To tackle this, I built RAGLab, an automated evaluation framework that compares multiple RAG pipeline configurations in a structured, reproducible way. The tool lets you define a set of modules to test (e.g., different retrieval functions or generation models), a set of language models to use at each stage, a set of domain-specific questions, and reference answers. RAGLab then runs every combination of the modules against the questions, logs the responses, and scores the output.

How It Works

Each pipeline variant is defined by choosing components at three main stages:

Pre-Retrieval: such as query rewriting
Retrieval: e.g., using ChromaDB with different embedding models
Generation: calling different LLMs

For each configuration, RAGLab:

Submits each question and logs the generated answer
Compares it to a reference answer (if provided)
Uses an LLM-as-judge to score the response for:
- Relevance
- Faithfulness
- Completeness
Logs a composite score and, in a future version, operational metrics (e.g., token usage and latency)

Here’s a sample evaluation prompt for the judge:

EVAL_PROMPT_TEMPLATE = """You are an impartial expert grader for retrieval‑augmented chatbots.
Your job is to evaluate the chatbot’s answer against the gold answer
and the judging instructions, then return a JSON object with four fields.

### Inputs
• Query: {query}

• Gold answer (reference): {gold_answer}

• Chatbot answer to grade: {chatbot_answer}

• Judging instructions (override anything else if conflicts arise):
  {judging_instructions}

### Rubric — score each dimension 1‑5
1 = very poor, 3 = acceptable, 5 = excellent.

1. Relevance – Does the chatbot answer address the user’s query?
2. Faithfulness – Is every factual statement supported by the gold answer
   (or clearly marked as outside scope)? Penalize hallucinations.
3. Completeness – Does the answer include all key elements required by the
   gold answer **and** satisfy the judging instructions (if provided)?

### Confidence
After scoring, give a **confidence** value 0.0‑1.0
(1.0 = absolutely certain your scores are correct; 0.0 = pure guess).

### Output — return **only** valid JSON in this exact schema, with NO markdown fences or extra text
{{
  "relevance": ,
  "faithfulness": ,
  "completeness": ,
  "confidence": 
}}
No additional keys, text, or explanations.
Think silently before answering, but output only the JSON, with NO markdown fences or extra text."""

And the judge returns its output in a JSON packet formatted like this:

{
  "relevance": 5,
  "faithfulness": 5,
  "completeness": 4,
  "confidence": 0.95
}

I also added a fallback: if confidence is low or no gold answer is supplied, RAGLab prompts the user for a manual score using the same rubric.

All of this is run inside a Kaggle notebook. Results and documentation of each test are logged to a Python list, which is displayed in a summary Markdown table right in the notebook.

Limitations and Future Directions

There are a couple of current limitations:

It evaluates at the answer level, not token or span-level granularity
The evaluation relies on an LLM judge, which may still have biases

The modular design of this project lends itself to straightforward expansion. In the future, I’d like to:

Add more function options at each stage of the pipeline, and provide multiple choices for the document embedding model
Add support for more pipeline branching logic (like looping or conditional flows)
Include cost-effectiveness metrics (score per token or per second)
Make this deployable as a standalone benchmarking service for RAG developers

Why This Project Matters

With so many new tools and techniques in the RAG space, picking the right combination is no longer a matter of trial-and-error. RAGLab shows how Gen AI can be used to evaluate Gen AI. By automating this inner loop of experimentation, teams can save time, reduce guesswork, and build better domain-specific assistants faster.

And that’s what makes this not just a technical project—but a practical one for scaling Gen AI in the real world.

Watch my RAGLab YouTube Video.

Run the RAGLab Kaggle Notebook.

August 4, 2024 — HTML for Chatties

In the modern digital age, chatbots like ChatGPT are not just for answering questions or customer service; they can also be a valuable asset in web development.

Font Recommendations
ChatGPT can sift through a plethora of fonts to recommend the best ones for your website. For this site, modern headline and body fonts were selected to give it a fresh look.
Color and Text Changes
With a simple description of the desired appearance, ChatGPT can generate the necessary HTML and CSS code to change the colors, fonts, and text on your web pages.
Button Bar Design
ChatGPT can assist in designing and implementing UI elements like a set of buttons in a bar across the top of the page, enhancing navigation and user experience.
Proofreading
Using specialized plugins, ChatGPT can access the site's webpages to check for typos and grammatical errors, ensuring the content is polished and professional.
Future Plans
ChatGPT is also equipped to assist in making the Alternative Engineering, LLC web pages responsive for mobile devices, a feature that will be implemented in the future.

In summary, chatbots like ChatGPT are versatile tools that can not only ease the web development process but also add that extra flair to your website design!

PS: Full disclosure, ChatGPT created this content from dictated notes and the HTML formatting from typed instructions to showcase the various ways it can assist in web development.