Phil 5.7.2024

SBIRs

Sent the sanitized paper to Orest and Tivern
Sent the latest version of the white paper to Matt
Dahlgren expenses – done
Get flight, car and hotel for the 92nd Symposium. Got flight and car. Working on hotel.
9:00 Standup
Work on pulling out layer activations and using UMAP and/or aligned UMAP
I just discovered that you can plot inside of Visual Studio, you need to run a Jupyter notebook once to set things up, then just “run in interactive window“. Works for plotly and matplotlib!

Phil 4.6.2024

Happy Seis de Maio!

Vote!

The Kinetic Sculpture Race was wet, but still fun:

SBIRs

Ranked a bunch of potential topics for BD
Write up notes from Thursday’s meeting
Work with Protima to access activations on the GPT just using HFace, then visualize with UMAP. Started by downloading and prompting the chess model, which is working!

Got the environment running again after the password reset!
Dahlgren expenses

Phil 3.4.2024

The Kinetic Sculpture Race is tomorrow! hoping that the rain holds off

Last year’s winner

Tasks:

3:30 Ground Rent – done
Change AC service date – done
Chores – finished all the ones that don’t require a car. Didn’t feel like driving more

SBIRs

Initial forms – done
Tech transition meeting – went well
Reset AWS password – done

Phil 5.2.2024

USNA Capstone all day yesterday

SBIRs

9:00 standup from the car. A truck lost a tire in front of me while I was talking. Sheesh!
Meetings in Dahlgren all day – not as much fun as USNA 🙂
Make a LinkedIn and BlueSky post on the new paper
No book club today

GPT Agents

2:00 Weekly meeting – cancelled since I was stuck in the car

Phil 4.30.2024

Tasks

Reschedule AC
International driver’s license
Plants!
Till
iPhone stuff – done

SBIRs

10:00 SEG Meeting – agreed to work initial financials and schedule – done
1:00 SBIR Prop meeting – much discussion of paperwork
Capstone reception 5:00 – 7:00
War Elephants on ArXiv – done

Phil 4.29.2024

Tasks

International driver’s license
~~Screen door~~
Plants!
Till

SBIRs

9:00 – Sprint Demos
12:30 Kickoff
3:00 Sprint Planning

Phil 4.26.2024

Today on AI used with Ill intent:

Baltimore County Police arrested Pikesville High School’s former athletic director Thursday morning and charged him with crimes related to the alleged use of artificial intelligence to impersonate Principal Eric Eiswert, leading the public to believe Eiswert made racist and antisemitic comments behind closed doors.

Also, it seems he probably scored high on the SDO scale: What it was like to be a student of Dazhon Darien, accused of framing principal with AI

It was a good day for a few students at Pikesville High School on Thursday hanging out in the parking lot after school. Their former athletic director, who they said belittled them and made them feel uncomfortable, wasn’t coming back.

Phil 4.25.2024

Can’t seem to backup my phone using itunes any more. Doing the cloud thing

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Followup: Simple probes can catch sleeper agents

This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon.

We present experiments measuring the generalization abilities of probes trained off-policy in a toy setting. We show that probes can generalize well to different text formats and also generalize from harmful text the LLM wouldn’t output to harmful text where the LLM has been jailbroken to actually output the harmful text.

Large language models (LLMs) can “lie”, which we define as outputting false statements despite “knowing” the truth in a demonstrable sense. LLMs might “lie”, for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting — prompting GPT-3.5 to lie about factual questions — the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

SBIRs

9:00 Standup
3:00 AFRL meeting – looks like we’ll set up an overleaf project and start generating a white paper every few months. Topic 1-(something) will be first. Going to see what goes on with the MORS talk first?
4:00 ONR meeting – We can repurpose the M30 content into the slide format, then maybe do that with the AFRL white papers
4:30 Book club

GPT Agents

2:00 Meeting

Phil 4.24.2024

Or 4/24/24. Or 24/4/24, which also looks nice

SBIRs

1:30: Some CwoC discussion? Yup
Spent the rest of the day setting up my dev environment

Phil 4.23.2024

Woke up nice and relaxed after a good night’s sleep. The night before a presentation is not easy for me.

I’ve been thinking about this slide from the talk yesterday:

I think that AI researchers are in a place that nuclear researchers were in the ’30’s. There is this amazing technology that is going to change the world, but no one is sure how. Then the world engages in a total war that depends on technology and the Allies are not doing well. Some of the researchers think that a nuclear weapon might turn the tide. It works, but in retrospect it was too much too late. But for 10 years the chance that there could be a broad nuclear war was high, and take as just an extension of current developments – a bigger bomb. It took decades for that viewpoint to shift. AI weapons are probably here already, and there are nations and organizations that are working out the best way to use them – as an extension of current “active measures” strategies and tactics. And like the atomic bomb, we really have no idea where this will go.

SBIRs

Read a bunch of stuff for upcoming meetings
Fire up the NNM instance and see if I can remember how to use it. Add an instruction section to the notebook – got sidetracked into doing a detailed read of a BAA
9:00 standup
11:30 AI Ethics training discussion with Hall Research. They are legit as it gets. Let’s see what kind of training they put together, but for now I give a ringing endorsement
3:30 meeting on the Phase IIe. We have three weeks to respond, but it doesn’t seem like they are asking for much? Very confused. Maybe because it’s an extension?

Phil 4.22.2024

Finished the slide deck and gave the talk. A single question. Still, it’s a nice deck that could get used elsewhere. Also need to update my CV – done

SBIR’s

Need to show Protima how to set up an Overleaf project for research documentation.
Need to start again on the NNM code, but at 4:30, it’s too late in the day

Phil 4.19.2024

Chores

~~Ground rent research~~
~~House cleaning~~
~~Yard~~
~~See if truck is ready~~ – back in the driveway, and it’s electronics magically fixed themselves when it was getting its oil change at no charge. I’m agape.
~~BSO~~
~~Water bill~~

SBIRs

Slides <- could have done more, but now I have a PLAN!

Phil 4.18.2024

Dentist!

7:30 syphony

SBIRs

Did some work on the slides but not enough.
Helped Aaron on the paper and go motivated to create an ASRC template so we never have to do *that* again
9:00 standup
11:00 CUI meeting – cancelled
4:30 book club – cancelled for the best – and saddest – of reasons

GPT-Agents

Set Alden’s overleaf project up. Probably overkill
2:00 meeting – Just Jimmy, but fun!
CUI provocation v3 submitted

Phil 4.17.2024

SBIRs

Big kerfuffle on the report yesterday afternoon. As a result, I just worked on it till it was done. Submitted this morning to show my commitment. Sigh
Need to work on the slide deck today but my motivation is lacking

Phil 4.6.2024

World’s First AI Pageant To Judge Winner On Beauty And Social Media Clout

SBIRs

Fold KillerApps slides into presentation, then start thinking about images
9:00 standup
2:30 AI ethics maybe?
Need some guidance on the way to handle the “culling,” which I think is going to be pretty weird unless the writing around the culls are reworked.

viztales

Dimension reduction, State, Orientation, and Speed

Phil 5.7.2024

Phil 4.6.2024

Phil 3.4.2024

Phil 5.2.2024

Phil 4.30.2024

Phil 4.29.2024

Phil 4.26.2024

Phil 4.25.2024

Phil 4.24.2024

Phil 4.23.2024

Phil 4.22.2024

Phil 4.19.2024

Phil 4.18.2024

Phil 4.17.2024

Phil 4.6.2024