
Delphi recently hosted a live text-based AMA on Telegram with Teknium, Co-Founder and Head of Post-Training at Nous Research. The conversation was moderated by Tommy Shaughnessy and featured questions from the Ex-Machina Telegram Group Chat.
Disclosure: Delphi is an investor in Nous. This AMA is shared for informational purposes only, not investment advice. The views expressed are those of the speaker and do not necessarily reflect the views of Delphi Ventures.
⸻
Tommy Shaughnessy: Hi everyone, I’m extremely excited to host Teknium, Co-Founder and Head of Post-Training at @NousResearch for a text AMA here. I’ll kick off with a few prepared questions to keep momentum. Everyone can post questions anytime; if a question gets a lot of thumbs-up I’ll surface it to Teknium so the flow stays manageable. We’ll aim for 60 minutes.
Teknium: 👋
Tommy Shaughnessy: Hey Teknium, first question — who are you, what do you do, and how do we know you’re not AGI?
Teknium: Hey everybody, I’m R***, but the internet knows me as Teknium — I’m the Head of Post-Training at Nous Research, and you never know, I may be an AGI :)
Tommy Shaughnessy: How would you describe Nous Research to Elon Musk? You have pre-training, models, RL environments, and more.
Teknium: Nous started as a group of high-signal AI researchers and developers trying to bring out the most from open-source models, and became a company to drive a free and open future toward AGI. We do post-training, including chain-of-thought reasoning training, RL environments, agents, exploring new interfaces for model interactions, and of course decentralized training infrastructure using breakthrough research techniques like DiSTro.
Tommy Shaughnessy: Narrowing in on that post-training section of your answer — what does post-training mean? Can you walk us through how you actually do this work? How closely do you interface with pre-training folks?
Tommy Shaughnessy: (For the crowd: feel free to send questions in group; I’ll flag for Tek if people thumbs-up them.)
Teknium: Yeah, well — before a model gets to me, it starts as a pre-trained model. These models are trained on everything and anything available on the internet — Twitter posts, webpages, textbooks, etc. These models are decent next-word predictors but not great at understanding the role of being your assistant, co-programmer, friend, etc.
That leaves pre-trained models in a state kind of like clay — they can be molded into anything you want from there. Post-training takes that clay and molds it into a smarter, more steerable, and fine-tuned version of it, which can be aligned to the morals or ethics you want, or built specifically for working with your product or other tooling. This is where post-training work happens. Post-training is where the specificity of downstream use cases and product integration begins.
Teknium: The post-training team works with the pre-training team on a variety of things, including data ratios — for instance, if you want a coding model out of post-training, it’s very good to inform the pre-training team to allocate more resources to coding data. We also help specify architecture needs, like the context length you need (especially if your downstream use case is agentic or RAG-focused), modalities you want to explore (image, text, video, audio are just a few examples), and inference speed required for the use cases, which can help determine how big (in parameters) the model should be and their shape.
Tommy Shaughnessy: Without sharing too much of your secret sauce, how do you actually do this post-training work? How do you take a pre-trained model and turn it into Hermes — turn it into a great coding model? What’s the day-to-day like for you?
Teknium: Day-to-day looks like a lot of data engineering: finding, curating, cleaning, creating, and sourcing the right kind of data to fill any gaps for the tasks you care about, while keeping it high quality and diverse.
From there it’s a lot of infrastructure-related challenges like sourcing compute, getting the training and hyperparameters just right, and pushing through the training.
Finally, there’s a lot of post-training work on testing, evaluations, documentation and all that, which sometimes takes as long as training — and if the evaluations don’t look right, going back and starting from square one to find out what went wrong.
Tommy Shaughnessy: This is really interesting. Maybe an odd question, but everyone is so focused on emergent behavior with models and new use cases, yet it sounds like you need to have your task in mind (coding, etc.) when you are doing post-training work. Do you need to have a very specific focus on the end game (say Hermes)?
Teknium: If you want a specialist model, yes — or if you have at least some ideas of end use cases you want to support. The models do generalize, but they generalize best when given a good, diverse dataset to work with. We try to be as generalized as possible to support as many use cases as possible, but when you know a task is desired downstream, it’s always a good idea to give it special attention.
With the agentic future, these often require much more foresight into the specific downstream use cases you want to target — as often the model is paired with an agentic harness, and training with that pairing in mind can unlock a lot more value.
Tommy Shaughnessy: This makes sense. Before switching gears, do you believe we are running out of data, or is that a myth? I ask since you are very focused on the data that goes into Hermes.
Teknium: I think the data issue that’s popularly discussed is more in the realm of pre-training data, where the entire internet and beyond has been scraped and input into these models. For post-training, especially with RL environments, we are just barely scratching the surface in my opinion.
Tommy Shaughnessy: This is awesome color. Switching gears for a bit and we will come back to Hermes.
Stepan: Teknium, thanks for doing this — super interesting! My question is about the future of models and agents. Broadly, I see two schools:
• One thinks that over time models and apps will become more specialized, heterogeneous — some kind of division of labor for AIs.
• Others believe that agentic capabilities will become baked into even larger models (it’s cheaper to send gradients within a single network rather than JSONs via HTTP).
Which one (or both, or neither) do you believe is more likely, and why?
Tommy Shaughnessy: When thinking through Nous Research, what is the fundamental reason an open-source AI lab has the ability to be impactful in the face of hyperscalers that have orders of magnitude more compute, data, and capital to spend?
Tommy Shaughnessy: Helpful, Tek — feel free to answer mine then Stepan’s if easier.
Austin Muñoz | Amber Group: Can you elaborate on this? What data seems to have been left out of pre-training that you guys are capturing in post-training?
Teknium: @Shaughnessy119 The best shot to win against tech oligopolies is to embrace open source and open science — open research compounds on each other and it undermines their industry capture. If you can get the world to work on the same problem together, it can be a massive boon against the centralized players.
Teknium: @sgershuni I do think that training for specific use cases paired with specific harnesses, products, etc. can unlock a lot of value. You can see it even at big labs — they have specialized harnesses and training for different “products,” i.e., Deep Research or the web-browsing agent. I think over time these capabilities will find their way back into their primary models though. So, I suppose, both.
Teknium: @moon_yose It’s not really that the data has been left out of the pre-training, it’s more about the fact that in RL the data is generated by the model itself. You can take existing data or completely new, generated data and take it way further, because all you need is a tiny slice of information in some cases to create an insane amount of problems for verification in RL.
Tommy Shaughnessy: On deck for you, Teknium, as people read your answers: Hermes 4 is your latest model release that I’ve described as an uncensored model, with 60% non-refusals vs. ChatGPT OSS at 5%. This opens up a new design space for anyone using the model who wants to build something where past models said no or dropped disclaimers. Can you walk through your view on the impact Hermes 4 will have?
Teknium: On its less-censored nature and its use cases, I think there’s a pretty wide array of tasks that just can’t work within the paradigm corporate-sanitized models like ChatGPT exist in — and this only gets worse year by year, not just with AI, but with a whole variety of internet products and services. More censorship, corporate “enshittification,” etc., makes the world more sterile and closes up the breadth of human expression. As AI models do more and more of our work, the paths that have been closed will constrain the forms of expression humanity can take, and I’ve called this phenomenon the mode collapse of the world.
Simple things like automating sales often become extremely tricky to pull off with models like ChatGPT, because the slightest hint that the model is asked to put any pressure whatsoever on a prospective buyer may cause it to break down entirely and refuse to work with the problem. Writing stories of authentic human experiences — even everyday ones, like romance, horror, comedy — becomes severely suppressed.
White-hat hackers, doing things like vulnerability discovery and red-teaming, have no viable way to use these models to do their work.
These are just a few examples of the real value of reducing where the model says “no, you can’t express that.” There are countless more.
Pondering Durian: Everyone seems very excited about RL environments, but they seem quite manual to craft and not very “bitter-lesson”-pilled. Curious if you agree with that and, if so, other approaches you are excited about — or is that where distributed versions become more competitive/interesting?
Tommy Shaughnessy: This is great framing. Do you see any way ChatGPT or a big AI lab could reverse this situation of continuous restriction?
Anil | Delphi: Also, what unique datasets or collection methods do you believe others cannot easily copy? Curious how you think about this. How do you prove provenance and quality for that data then?
Tommy Shaughnessy: Pd/Anil’s questions are great — no rush; Teknium can add to the docket.
Teknium: @Shaughnessy119 I think xAI is the only lab even considering doing so; I do not see ChatGPT, Google, or Claude doing so.
Teknium: @PonderingDurian I think a way that it can be bitter-pilled is if the entire world contributes toward making environments that are meaningful to them. A thing about RL environments is that, despite them being pretty deeply integral to machine learning, they rely more on domain expertise than ML engineering skills. This is perfect for an open-source initiative, as even gigantic companies can’t source specialists in every field, but open contributions can.
I’d also note that before RL, big labs were spending hundreds of millions of dollars sourcing complete prompt:response samples from humans — i.e., Scale AI — so it hasn’t really made the “bitter lesson” calculation any different here; it’s likely even more efficient.
Harrison | Hack: Sorry if it’s already been asked before, but in post-training Hermes 4 (which is awesome btw):
1. Once the weights of the base underlying model change (i.e., Llama 3.2 → 3.3 → 4), how much repeat post-training work do you need to do? Or is there a repeatable layer so you’re not spending a bunch of GPU cycles?
2. Does that process translate across different architectures (i.e., Llama transformers → Kimi or Qwen / more MoE)?
3. Re data set collection — without sharing any secret sauce, can you share a little more about where you’ve found good sources for the high-quality post-training data Hermes finds useful/orthogonal to the major labs’?
Tommy Shaughnessy: Love this.
Teknium: @anildelphi For Nous, specifically, we have the creator of Generative Reward Models as our lead on RL — so we have a big leg up when it comes to rubric- and LLM-as-a-judge-based environments, and things like RLAIF (Reinforcement Learning from AI Feedback). The Generative Reward Models paper describes the best ways to utilize LLMs for scoring, rather than a hard verifier like you might use in math.
This is similar to the rubric-based rewards that OpenAI used for things like Deep Research, where the reward is fuzzier and less verifiable than a completely objective scenario. We have a lot of expertise on this front.
Teknium: Hey Harrison, unfortunately if you want to change the base, unless the base is a continued pre-train of the previous base model, you would have to retrain the model.
In a case like Llama 3.1 to 3.3 — if they had released base variants of both — you could do model merging to upgrade for basically free, though.
Merging across architectures is currently not possible, but friends of ours like Charles Goddard and our Researcher-in-Residence Bhavesh have been deeply exploring it for a while now, looking to find a way.
On data curation, we have three big sources:
1. Our DataForge pipeline has great mechanisms for creating new problems and diversification through pre-train mix-ins as random salt, multiple stages of validation on instruction and answer generation, and more for synthetic data creation and distillation.
2. Rejection sampling and RL through Atropos, which is our repository of RL environments that can generate reasoning data on over 1,200 different tasks and growing.
3. Great filters and curation tools for sourcing data from open-source datasets.
Tommy Shaughnessy: While people read, Tek — I’m curious on the interplay of Nous’s initiatives. How does Psyche Network interface with Atropos on the RL side, with your consumer-facing chat app, and all the research work you put out?
Teknium: We are building Psyche out in stages, with traditional pre-training and SFT as the first phase for training capabilities. Next is RL — which is a bit more complicated, requiring coordination of not only training nodes, but inference nodes and environments (which are really just data creation and validation servers) all together. These will be integrated in some form as a next step.
For the consumer app, there’s some research on greatly personalizing models to each user, similar to how Midjourney has personalized models you can make of their v7 model. The work is a bit early though, so stay tuned!
Tommy Shaughnessy: Maybe a bit of a curveball and a product question: how do we get people to have muscle memory for Nous? ChatGPT has this with AI, Google with search — how do we get Hermes in the hands of people to have that feeling from a brand/project perspective? Or does Hermes live behind the curtain powering apps and use cases?
Teknium: I think it’s a bit of both. We can see Hermes is used in a lot of applications already, especially role-playing and creative use cases. I’d like it to break out into more personalized assistant use cases, especially when we have the agentic prowess that will be coming soon. We see a lot of people using it privately at home for things like legal questions and strategy that they don’t trust in OpenAI’s datacenter, so there’s a lot of value in a powerful generalist assistant that can be easy to access when you want it and fully private when you need it.
I think our biggest draw is that it’s aligned to you right now, in a way no other company currently offers, but it will become much more powerful as we enter into agents and build products directly integrated with the training to unlock a lot of potential value.
Tommy Shaughnessy: This introduces an interesting question — Hermes is way more open and aligned to you (60% non-refusals vs. ChatGPT OSS at 5%) but ChatGPT has muscle memory and integrations (Docs, Chat, email, Slack, etc.). Do you think alignment beats integrations for an assistant?
Christine Yip: Drawing from your experience, what emerging trends in AI research (e.g., multimodal models, edge computing, AI memory) could revolutionize AI/agent ecosystems?
Teknium: @christinetyip first, then back to Tommy — I think we’ve done a lot of work on memory internally, and there is a beta of our memory system accessible to some users right now, which I think has the potential to be a path to some sort of online learning and in-context personalization for the end user.
In other research-related unlocks — DiSTro was a huge step in enabling and empowering us to coordinate mass-scale compute for training our (and the community’s) next models, open doors to research on new architectures, and more.
Our Chief Scientist is also researching right now any-to-any model architectures and developing novel ways to produce them, so we may see a Psyche-powered any-to-any model (this means any modality — text, audio, PDF files, anything — to anything) soon.
I do think a big open area of research is on context length, which Nous pioneered with YaRN around two years ago — the unlock for OpenAI, DeepSeek, Llama and others that brought us from 2–4k token context windows to 128k — but with agents and reasoning, we need to go a lot further.
Teknium: @Shaughnessy119 I think in the long term, alignment is crucial. We cannot have this mode collapse of humanity and continue on as the same species, and more people look to alternatives to re-open the door to their expression and needs every day.
Integrations are quite easy — open source or not — with so many harnesses to enable that, and potentially Nous’s own soon.
Tommy Shaughnessy: Great questions by the crowd, thank you — keep it going, we have 15 mins more.
Tommy Shaughnessy: Teknium, Meta has been the main OS champion in the U.S. If Meta drops a clearly stronger Llama tomorrow, how does Nous stay essential?
Pondering Durian: Kind of a high-level question but same direction: are you more bullish on the current U.S. closed-source approach to model development (and the investment capital it is able to command), or the Chinese approach with open-source compounding but with more limited capital investment? Guess it’s a long-winded way of asking about open vs. closed source. Curious for your take, and then a steel-man of how someone at DeepMind would reply?
Christine Yip: Very cool — and yes, DiSTro and YaRN were very promising and exciting. Quick follow-up about the first point re memory: can you tell us more about the beta of the memory system Nous has built and how it helps users?
Nav: Teknium, how do you alignment-audit/safety-test your models? Is it fully manual or have you built any automations?
Teknium: @Shaughnessy119 Nous has been keeping pace and often outperforming Llama instruct models with their own base models since Llama 1, despite their much more massive team, for years. They are also the reason Nous exists, because they provided the first block of clay Nous used to sculpt with.
We have done very well against their own instruct models, and are much more steerable and free to explore anything you want to.
With our latest Hermes 4 model, we outperform Meta’s own instruct variant on MATH by something like 60 points, and MMLU-Pro by 21 points.
I think ultimately, more open-access models are always a good thing, and we build off of each other continuously and push the field forward with each step.
Stepan: Question on tool use and orchestration: for more complex tasks we typically run planner/orchestrator of multiple LLM instances. Recently I’ve seen a ton of research around Chain-of-Tools/Pre-Act — ways to create plans based on unlimited and unbounded pools of tools or MCP servers. Do you think specialized models for planning and tool-calling are needed? Do you guys work on this?
Tommy Shaughnessy: A lot here — I’ll pause my questions given how many from the crowd we have 🔥
Teknium: @PonderingDurian I think a mix of both, and I think there is room for both. I’m obviously someone who leans toward the fully open path, but there can be value in closed-source models as well, and to some degree the grab bag of options keeps these models all differentiated and unique in their own ways, which is a great thing — the last thing anyone should want is a single model everyone has to use without any choice in the matter.
Teknium: @christinetyip The memory system right now is primarily designed as a personalization unlock — it will remember you and all the things you want the model to be.
In the future, with agentic capabilities, it will be able to permanently (or as long as you want it to) learn new tasks and information as it performs its work for you and you provide it feedback.
Christine Yip: Another follow-up. You mentioned: “I think our biggest draw is that it’s aligned to you right now, in a way no other company currently offers, but it will become much more powerful as we enter into agents and build products directly integrated with the training to unlock a lot of potential value.” Are you actively building agents/products directly integrated with the training today? If so, what can you share about the form factor and who they’re for?
Tommy Shaughnessy: Teknium — how effective has Nous been in attracting talent? You’ve raised capital but it’s obviously well below the billions hyperscalers are spending on talent. Zuck has been firing off $100m offers, but this seems to be backfiring as some people have left (granted it appears less-senior folks). Are you competing for mercenaries, or do AI folks come for the mission/ethos?
Teknium: @sgershuni The work we did on Forge was one of the earlier products that used this sort of capability. It could create tools on the fly and used embedded tools to find the right tool(s) for the job as it went about assisting you.
Right now I think there is still a place for this, but even more exciting to me are much more general tools. Right now every task — or at least set of related tasks — needs a specialized harness for performing the duties of that role, but ultimately there are three generalized harnesses for all tasks:
1. A CLI terminal on a Linux box — computer use there unlocks a huge spectrum of tools as a single tool.
2. GUI computer use, which can access the full range of visual-based tools like browsers and interfaces.
3. Long-term, embodied AI inside of robots as the ultimate harness.
@christinetyip — related to your question as well.
Teknium: Currently, Nous has a very good lock on CLI/terminal-use harnessing in our internal agent that will be a big focus of our RL efforts (alongside a suite of other very generalist tools).
Teknium: @Shaughnessy119 The AI industry is definitely more competitive for talent now than it has ever been, but we’ve found some real star players and continue to.
Our researchers — like our Chief Scientist Bowen — have unlocked unimaginable things like YaRN and DiSTro. The post-training team has some superstars like Dakota (our lead on RL infra) who created Generative Reward Models, and overall the team composition has been strong and continues to grow stronger.
I think we attract talent because of our unique philosophy and position in the space, and that this drives us and our prospective new team members more than anything else.
Stepan: Doesn’t the challenge — the most important task — then become planning the right order and parameters to call these tools?
Tommy Shaughnessy: Any final questions for Teknium, please post — I’m sure he has meetings to hop to and models to post-train!
Teknium: @sgershuni In some ways, yes. If there’s only one tool like the CLI tool, it’s more about in what ways to use that tool and in what order you use its specific capabilities.
RL is perfect for this, especially outcome-based reward RL. It can learn quickly how to determine the most efficient and effective path to use tools (or not use any) to solve the problem at hand from these environments.
The best part about generalized tools like this is that you can drop these agents into any environment; they don’t have to be made specifically for the tools.
Tommy Shaughnessy: A final one from my side as we gather others’ final questions: looking ahead in AI, what is your strongest contrarian view? It can be anything related to AI.
Christine Yip: Many people in the space like how Nous releases are uncensored. Out of curiosity — have you faced any concrete hurdles (hosting refusals, takedowns, legal/IP concerns), and if so, does/will that affect Nous’s approach?
Teknium: @Shaughnessy119 I probably have many — and wouldn’t want to give just one haha.
First would be that we should all focus exclusively on benchmark tasks — I don’t agree, and Nous has made trade-offs to retain base-model qualities for creativity enhancement and exploration that many others haven’t.
Second would be that evals are useless or dead — they aren’t. Despite the last point I made, I think benchmarks are an essential tool in inspecting the model. The real problem is, people focus on too few — i.e., LMSYS — as the sole arbiter of model performance, when in reality every model should be judged on your downstream tasks exclusively, or a much wider array of evaluations than most people often do. We need way more evals, not fewer.
And the final one I’d say is I disagree that synthetic data is bad. There are lazy ways to do it that are undeniably bad, but we have found so many ways to make synthetic data useful to our models that 90%+ of our data is synthetic.
Pondering Durian: What company/network do you think will be the most valuable in the world in 2030?
Teknium: @christinetyip Our RefusalBench results have two forms. The first tests the model on a wide array of commonly refused categories, but you set which categories you want refusals for and which you don’t.
So there is an overall score but also a score for how much you refused categories you chose to disallow — for us that’s CSAM and suicide. Most everything else is in the realm of free speech and free use, and we attempt to stand by that as best we can.
On takedowns, etc. — since most of our post-training data is synthetic, it generally has much less risk of matching copyrighted data as well.
Teknium: @PonderingDurian That one is a hard one and I am not sure! Sorry :) Nvidia might be near the top of the list though, haha.
Tommy Shaughnessy: I’ll answer this one — Nous, among other great projects here. 😂
Tommy Shaughnessy: Teknium, one last one since this is so good. Looking back over the last year, what is your biggest mistake/learning — a constructive growth item you’d share? It could be something you got wrong or were very surprised by.
Pondering Durian: Just have my mouse hovering over 50× leverage :)
Teknium: @Shaughnessy119 Hmm — probably underestimating the work that it takes beyond the model itself. Writing papers, documentation, and blogs has always been harder for me than creating state-of-the-art models, haha. Luckily we’ve added many team members who are extremely good at this in the last few months, and we will be unifying, expanding, and cleaning up a bunch of our documentation from across all our projects in a much cleaner and central location soon!
Tommy Shaughnessy: Teknium, any closing thoughts based on this convo and the questions you’ve gotten?
Stepan: Also: what question should we have asked but didn’t? xD
Teknium: Yes — my closing thought would be to stay tuned to Psyche’s progress, as we have a lot coming out soon on that. The whole team is now focused on the final infrastructure push and new models, architecture experiments, and more on there!
Teknium: Thanks for having me, everyone! I have to run momentarily for another meeting, but feel free to ping me any time in here.
Tommy Shaughnessy: Teknium, you knocked it out of the park — thanks for being our first AMA guest in Ex-Machina.
***
Disclosure: Delphi is an investor in Nous. This AMA is shared for informational purposes only, not investment advice. The views expressed are those of the speaker and do not necessarily reflect the views of Delphi Ventures.Thanks to @sgershuni, @moon_yose, @christinetyip, @anildelphi and many others for great questions, but in particular @Teknium1 for being an awesome first guest. Please stay tuned @delphi_intel for more AMA sessions with leading thinkers in the space.