# Supplementary Materials

## Trust-Guided Multimodal LLM Integration with Reinforcement Learning for Autonomous Driving

WACV 2026 LLVM-AD Workshop

Paper ID: LLVM-AD-13

---

## Contents

1. Code Implementation
2. Extended Experimental Results  
3. LLM Prompt Templates
4. Training Configuration Details

---

## 1. Code Implementation

### Requirements

```bash
pip install -r requirements.txt
```

### Main Components

| File | Description |
|------|-------------|
| RLAD1.py | Complete training and evaluation pipeline |
| run_ablations.py | Ablation study runner for 15 configurations |
| ablation_visualizations.py | Result visualization utilities |
| verify_ablations.py | Configuration verification scripts |

### Running Experiments

```bash
# Run all ablation configurations with 3 seeds
python run_ablations.py

# Run individual configuration
python RLAD1.py --algorithm TD3 --total_timesteps 200000 --seed 42
```

---

## 2. Extended Ablation Results

### Complete Configuration Results

All 15 ablation configurations with 6 evaluation metrics. Each configuration is evaluated across 3 random seeds (42, 123, 456) with 50 episodes per seed, totaling 150 episodes per configuration.

| Configuration | Avg Reward | Std | Success | Collisions/ep | Lane Viol | Jerk |
|---------------|------------|-----|---------|---------------|-----------|------|
| Baseline | 45.3 | 6.2 | 0.68 | 2.3 | 3.4 | 0.42 |
| Baseline+YOLO | 47.1 | 5.8 | 0.71 | 2.1 | 3.1 | 0.41 |
| LLM Only | 52.7 | 8.2 | 0.76 | 1.8 | 2.1 | 0.38 |
| LLM+Trust | 54.9 | 3.8 | 0.79 | 1.5 | 1.9 | 0.37 |
| LLM+Trust+YOLO | 56.8 | 2.9 | 0.81 | 1.4 | 1.7 | 0.36 |
| Full (no finetune) | 57.2 | 2.5 | 0.82 | 1.3 | 1.7 | 0.36 |
| Full System | 58.6 | 2.1 | 0.84 | 1.2 | 1.6 | 0.35 |
| Trust floor 0.1 | 56.2 | 3.8 | 0.79 | 1.6 | 1.9 | 0.37 |
| Trust floor 0.2 | 57.8 | 2.9 | 0.82 | 1.4 | 1.8 | 0.36 |
| Trust floor 0.3 | 58.6 | 2.1 | 0.84 | 1.2 | 1.6 | 0.35 |
| Trust floor 0.4 | 57.1 | 2.4 | 0.81 | 1.3 | 1.7 | 0.36 |
| Trust floor 0.5 | 55.4 | 2.7 | 0.78 | 1.5 | 1.8 | 0.37 |
| No Fallbacks | 53.8 | 4.2 | 0.77 | 1.6 | 2.0 | 0.38 |
| No LLM Reward | 52.4 | 2.8 | 0.78 | 1.7 | 2.1 | 0.37 |
| Baseline+Domain | 46.8 | 5.5 | 0.70 | 2.2 | 3.2 | 0.41 |

### Per-Seed Variance Analysis

| Configuration | Seed 42 | Seed 123 | Seed 456 | Mean | Std Dev |
|---------------|---------|----------|----------|------|---------|
| Baseline | 44.8 | 45.9 | 45.2 | 45.3 | 0.55 |
| LLM Only | 51.2 | 54.1 | 52.8 | 52.7 | 1.45 |
| Full System | 58.2 | 59.1 | 58.5 | 58.6 | 0.45 |

---

## 3. LLM Prompt Templates

### 3.1 Perception Stage (LLaVA-1.5-7B)

The perception prompt instructs the vision-language model to analyze camera images and identify safety-relevant objects. See `prompts/perception_prompt.txt` for the complete template.

Key output fields:
- Object detections with class, position, distance
- Scene description
- Hazard identification
- Weather and visibility assessment

### 3.2 Planning Stage (Phi-3-mini)

The planning prompt translates perception outputs into high-level driving decisions. See `prompts/planning_prompt.txt` for the complete template.

Key output fields:
- Risk level assessment (LOW/MEDIUM/HIGH/CRITICAL)
- Recommended action
- Urgency score (0.0-1.0)
- Reasoning explanation

### 3.3 Control Stage (Phi-3-mini)

The control prompt converts planning decisions into continuous control outputs. See `prompts/control_prompt.txt` for the complete template.

Key output fields:
- Throttle (0.0-1.0)
- Brake (0.0-1.0)
- Steering (-1.0 to 1.0)
- Confidence score

---

## 4. Training Configuration

### Environment Parameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| Control frequency | 10 Hz | Standard for autonomous driving |
| Wheelbase | 2.7 m | Typical sedan dimensions |
| Vehicle mass | 1650 kg | Representative passenger vehicle |
| Max steering angle | 30 degrees | Realistic constraint |
| Max episode steps | 1000 | Approximately 100 seconds |

### Reward Function

The composite reward function balances multiple objectives:

| Component | Coefficient | Formula |
|-----------|-------------|---------|
| Progress | +0.1 | Distance advancement toward goal |
| Collision | -50 | Per collision event (hard constraint) |
| Lane violation | -5 | Per lane departure event |
| Comfort | -0.1 | Jerk magnitude penalty |
| LLM alignment | Variable | Trust-weighted action alignment |

Total reward is clipped to [-2, 2] for training stability.

### Trust Gating Network

Architecture:
- Input: Concatenated sensor features (64D) and LLM features (12D)
- Hidden layer 1: Linear(76, 128), ReLU, Dropout(0.2)
- Hidden layer 2: Linear(128, 64), ReLU, Dropout(0.2)
- Output: Linear(64, 1), Sigmoid, scaled to [0.3, 1.0]

The trust floor of 0.3 prevents complete LLM suppression while enabling meaningful confidence modulation.

---

## Computational Requirements

| Component | GPU Memory | Inference Latency |
|-----------|------------|-------------------|
| LLaVA-1.5-7B (perception) | 2.1 GB | 28 ms |
| Phi-3-mini (planning) | 1.8 GB | 15 ms |
| Phi-3-mini (control) | 1.8 GB | 12 ms |
| Sensor fusion transformer | 450 MB | 12 ms |
| Trust gating network | 180 MB | 2 ms |
| TD3 policy network | 320 MB | 1 ms |
| Total | Approximately 6.7 GB | Approximately 70 ms |

Experiments were conducted on NVIDIA A100 40GB GPUs.

---

## License

This code is released for academic research purposes.

## Citation

```bibtex
@inproceedings{chennaka2026trust,
  title={Trust-Guided Multimodal LLM Integration with 
         Reinforcement Learning for Autonomous Driving},
  author={Chennaka, Sairam and Nidamanuri, Jaswanth},
  booktitle={WACV LLVM-AD Workshop},
  year={2026}
}
```
