Berkeley, California, United States
Conducting independent research on the limitations of Reinforcement Learning with Human Feedback (RLHF), including value pluralism, reward hacking, and the impossibility theorem of fairness in inherently biased large-scale human-sourced datasets.
Developing a theoretical framework for internalized value learning, where AI agents learn from human datasets while evaluating information through an internal system of ethics and objectives, enabling moral reasoning rather than mere imitation of observed behavior
Surveying existing literature in RL, cognitive science, and moral philosophy, and connecting them to alignment mechanisms. Collaborating with Google engineer on a blog-style project to communicate technical research insights to a broader audience.
Current Research:
1."Reinforcement Learning through Human Feedback (RLHF) may not be enough"
(Presented at CITRIS and Banatao Institute Tech Policy Symposium, paper coming soon)
2. Investigation of Emotion in RL