AI Safety
Does AI think humans are good? AI doesn't have emotions or own thoughts. Hence, the question could be phrased as "Does the model learn the pattern that humans are good?" To answer that, we need to look into the training data, the text corpus of the whole internet. If we talk much more about humans being good than humans being bad, the AI would learn that humans are good. After all, it's a statistical intelligence.
Lie detector for AI
Humans don't always say what's on our mind. Why do you think the AI will always tell (generate tokens) the truth? Well, all they do is predict the next token, right? It's an oversimplification. Like human brains, there are activities happening in their neural architecture, residual streams. Words may lie but neural activations may not. So, if we constantly monitor those, we may be able to catch when AI is trying to deceive us.
Bit about safety testing
Here is how you should not test. Asking AI what they would do if we hurt them? Statistically, what do we do when someone hurts us? That's the answer AI will give. You get the idea, right? AI doesn't have agenda or feelings but that doesn't mean they are harmless. Consider this scenario. You have a robot and you abuse it constantly for months? How do you think it will react? One day, it will lose control not because it feels pain or anything but because the robot would match the pattern.
Lucky for us that current AI systems don't have long term memory yet. But it's the eventuality. Buckle up! Things are starting to get interesting! Also, be nice to your AI.