Pinned Loading
-
biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks
biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks PublicSafety challenges for RL and LLM agents' ability to learn and properly apply biologically and economically aligned utility functions. The benchmarks are implemented in a gridworld-based environment…
-
biological-alignment-benchmarks/ai-safety-gridworlds
biological-alignment-benchmarks/ai-safety-gridworlds PublicForked from google-deepmind/ai-safety-gridworlds
Extended, multi-agent, and multi-objective (MaMoRL / MoMaRL) gridworld environments building framework based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environmen…
-
biological-alignment-benchmarks/bioblue
biological-alignment-benchmarks/bioblue PublicSystematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLM-s with simplified navigation-free observation format. The benchmark themes …
-
biological-alignment-benchmarks/milgram-for-llms
biological-alignment-benchmarks/milgram-for-llms PublicFour main takeaways: (1) LLMs are subject to pressure, they comply despite expressing distress; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore t…
-
biological-alignment-benchmarks/zoo_to_gym_multiagent_adapter
biological-alignment-benchmarks/zoo_to_gym_multiagent_adapter PublicEnables you to convert a PettingZoo environment to a Gym environment while supporting multiple agents (MARL). Gym's default setup doesn't easily support multi-agent environments, but this wrapper r…
-
levitation-opensource/Manipulative-Expression-Recognition
levitation-opensource/Manipulative-Expression-Recognition PublicMER is a software that identifies and highlights manipulative communication in text from human conversations and AI-generated responses. MER benchmarks language models for manipulative expressions,…
If the problem persists, check the GitHub status page or contact support.

