Prasenjit Sarkar @stretchcloud

Tweets

9K
Followers

2K
Following

951
Likes

3K

Prasenjit Sarkar @stretchcloud

5 minutes ago

The AI cost problem at scale is not what people think it is. Coinbase cut AI spend by nearly 50 percent while token usage grew exponentially. 91 percent of employees saw no change in access. The gains came entirely from infrastructure changes, not caps or restrictions. Three levers: better model defaults, smarter routing, and aggressive caching. Defaults switched to GLM 5.2 and Kimi 2.7 via an internal LLM gateway. Cache hit rate on one internal tool jumped from 5 to 60 percent. Engineers still choose any model they want. The default just changed. The bigger prediction: 80 percent of AI workloads will migrate to models 99 percent cheaper than today's frontier within 12 to 18 months. Only 20 percent stays on top-tier models for research-grade tasks. The pattern I keep seeing across engineering teams building for scale: model routing is becoming its own discipline. OpenRouter data shows the same two-lane market forming. Commodity open-weight inference at high volume and low cost. Premium frontier models for low-volume precision tasks. Teams routing correctly already have a 3 to 5x cost advantage over those that don't. The infrastructure layer that makes this work is a model gateway with task classification and prompt caching. What Coinbase built internally is what OpenRouter, Portkey, and Braintrust are selling as managed products. The market for that layer is real and growing. My read: enterprise AI cost optimization is a routing and caching problem, not a model problem. The organization that treats LLM spend like a CDN, with caching strategies, routing policies, and tiered defaults, will run circles around the one still paying frontier prices for every query. x.com/brian_armstron…

Brian Armstrong @brian_armstrong

23 hours ago

How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching. Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting

337 433 4K 1.6M 4K

0 0 1 21 0

View Details

Prasenjit Sarkar @stretchcloud

25 minutes ago

The CUDA bottleneck was never really about chips. Qualcomm just paid $3.9 billion for Modular, a 150-person company. The assets: Mojo, a programming language that targets Nvidia, AMD, Intel, Qualcomm, and Apple Silicon from a single kernel codebase, and MAX, a graph compiler that builds inference engines from it. Write once, run optimized everywhere. No Nvidia vendor libraries anywhere in the stack. The reason this matters is what CUDA actually is. Not processing power. A 20-year software moat. 4 million developers. Toolchains, libraries, courses, job postings. Prior challengers like AMD ROCm (eight years in, still below 5 percent of AI training workloads) and Intel OneAPI couldn't crack it because the unlock isn't better hardware, it's making the software porting cost disappear. Modular is structurally different. MAX benchmarks show 10 to 50 percent faster inference than vLLM on identical hardware, built entirely without CUDA. Chris Lattner did this same thing before, at LLVM, replacing a generation of proprietary compiler toolchains, and with Swift, which Apple adopted across its entire platform. Qualcomm is spending more than $14 billion here. Meta has placed CPU orders. Microsoft is in. The pattern I keep seeing across agentic infrastructure: the compute routing question is becoming real. Running a million parallel agent tasks on a locked Nvidia fleet doubles cost vs. a mixed deployment. The teams building at scale care about what Qualcomm ships from this. My read: this is a bet that the software moat matters more than the chip itself. If MAX becomes the standard inference compiler, Nvidia's real defensibility narrows to training. That's a meaningfully different company. x.com/PeterDiamandis…

Peter H. Diamandis, MD @PeterDiamandis

5 hours ago

Qualcomm just paid $3.9 billion for a 150-person company. The prize: a programming language that lets AI run on ANY chip without NVIDIA's CUDA. Meta and Microsoft are already placing orders. The NVIDIA software monopoly just got its first real challenger. $3.9B says this is

55 85 1K 109K 344

0 0 1 22 0

View Details

Prasenjit Sarkar @stretchcloud

2 hours ago

My read: the local vs. API split in agentic workloads will follow the pattern we saw with edge inference in 2023-2024. Curiosity first. Then cost pressure. Then a production category. The teams that build the local agent stack now will have a meaningful cost and compliance advantage when the rest of the market catches up.

0 0 0 9 0

View Details

Prasenjit Sarkar @stretchcloud

3 hours ago

x.com/brian_armstron…

Brian Armstrong @brian_armstrong

23 hours ago

337 433 4K 1.6M 4K

0 0 1 30 0

View Details

Prasenjit Sarkar @stretchcloud

3 hours ago

The transition in Karpathy's workflow wasn't dramatic. There was no announcement. One day he was writing functions. The next he was managing autonomous systems that wrote them for him. That's the detail that stays with me. The ratio flipped from 80% writing to 80% delegating, and he says it keeps shifting further. This is the person who built Tesla's self-driving system and taught a generation of engineers neural networks at Stanford. He describes his current state as "perpetual AI psychosis": 16 hours a day not typing code, but expressing intent to agents. AutoResearch, the tool he built to show the principle, ran 700 experiments in two days. The agent edited code, tried ideas, learned from failures, and dropped the "Time to GPT-2" benchmark from 2.02 hours to 1.80 hours. No human at the keyboard. The pattern I keep seeing: the bottleneck in software development has shifted. Cognition built Devin as the first fully autonomous software engineer. GitHub Copilot Workspace handles multi-step, multi-file coding tasks from a single spec. SWE-bench scores for frontier models crossed 60% on real GitHub issues. Cursor crossed $500M ARR in 2025, mostly from engineers who were already professional coders and still chose to pay. At Sequoia Ascent 2026, Karpathy called this Software 3.0: programs built through prompts, context, agents, tools, and verification rather than typed instructions. What I keep coming back to: the skills that persist aren't the writing ones. They're spec design, diff review, eval construction, and security oversight. Judgment-intensive work. Not keystroke-intensive work. The identity question for a generation of engineers isn't whether AI can write code. It's what "software engineering" means when writing code stops being the job. x.com/heyshrutimishr…

Shruti @heyshrutimishra

9 hours ago

Andrej Karpathy hasn't typed a line of code since December. Not because he retired. Not because he switched careers. Because his AI agents do it all now. The former head of Tesla Autopilot, the person who literally wrote the textbook on deep learning, says his workflow flipped

10 15 60 8K 32

0 0 0 44 0

View Details

Prasenjit Sarkar @stretchcloud

4 hours ago

The Vercel team just migrated 7 million lines of code to TypeScript 7 in a single Claude Code session. 16 PRs, roughly 2 days, $1146 in tokens. The observation that hits: pre-AI, this would have sat at the very bottom of the Platform engineering roadmap. That's the shift I keep noticing. It's not that AI writes faster. It's that AI changes which work gets done at all. Large-scale dependency migrations, version upgrades, and cross-cutting refactors have always been backlogged not because they weren't valuable, but because the cost-to-benefit ratio didn't clear. Three to four weeks of engineer time for TypeScript version parity isn't a hard no, but it loses to product work every time. $1146 doesn't lose to product work. The same pattern is showing up elsewhere. Mehul Kar's migration at Vercel is TypeScript 7. Thomson Reuters migrated their entire CoCounsel codebase to Vercel's AI SDK, deprecating thousands of lines across 10 providers, with 3 developers in 2 months. Teams running Mastra moved the same agent from LangGraph in 18 hours. AI SDK 7 ships with a codemod (npx @ai-sdk/codemod v7) that automates the upgrade path. The connection to an older pattern: when cloud computing made it economical to run redundant services, teams suddenly ran the monitoring and logging they'd always known they needed but had never prioritized. The cost floor dropped, and previously deferred work got done. What's clearing now is the infrastructure debt backlog. Version upgrades, type strictness migrations, library consolidations. Work that's been on the list for years. The token cost is clearing the build vs. defer threshold on a whole category of tasks that were never truly optional, just perpetually postponed. x.com/mehulkar/statu…

Mehul Kar @mehulkar

a day ago

I just migrated a 7 million line codebase at @vercel to typescript@7 in a single Claude code session. It took 16 PRs, ~2 days, and ~$1146 in tokens. Incredible, because pre-AI, it would have been at the very bottom of a Platform engineering team's roadmap.

21 5 315 40K 77

0 1 2 171 0

View Details

Prasenjit Sarkar @stretchcloud

5 hours ago

The way most teams run agent evals has a structural flaw. The model evaluates itself, and the model is biased toward approving its own work. What I keep seeing in the research: self-evaluation doesn't scale. It feels like rigor. It isn't. The gemchanger team ran 80 agents on a single task and found that averaging them barely moved the error. They all came off the same base model, so they all missed in the same direction. What actually cut the error by 86%, down to 0.135, was a grounded verify gate: a small set of questions with known answers, used to fire bad agents before their outputs propagated. Ash Prabaker and Andrew Wilson at Anthropic built the same insight into their long-running agent harness. One agent does the work. A separate adversarial evaluator grades it against a rubric. A gate blocks shipping until criteria are met. The doer never grades itself. This is the maker/checker rule at population scale. It also shows up in code review, in peer review, in every quality system that actually works. The entity producing the output cannot be the entity approving it. The interesting failure mode they found: when agents vote each other out, the swarm keeps firing until bad agents hit 48%, then the error jumps from 0.64 to 2.12. The swarm is calling it consensus. It's actually the majority eliminating the competent minority. Existing eval platforms (Braintrust, LangSmith, PromptFoo) give you rubric infrastructure. The architecture shift here is separating the evaluator role entirely, not just a grading function, but a structurally adversarial agent with a different objective than the generator. The bottleneck in agent reliability isn't compute or context. It's this: the doer and the checker cannot be the same entity, and peer voting at scale selects for the wrong thing. Building the separation in at the harness level is the only thing that seems to hold. x.com/VoltexGar/stat…

Voltex @VoltexGar

13 hours ago

Ash Prabaker & Andrew Wilson, Anthropic: "self-evaluation is a trap, and adversarial evaluator agents work better." gemchanger ran 80 agents on one task and found that averaging them barely moved the error, because they all came off the same model and miss the same way. what

4 4 31 3K 29

1 0 0 110 0

View Details

Prasenjit Sarkar @stretchcloud

5 hours ago

The US export ban on Anthropic's Mythos and Fable 5 is showing me something I didn't expect. When access to a frontier model disappears overnight, the market doesn't wait. It builds its own. Sakana launched Fugu on June 22, ten days after Trump's order cut Anthropic's international access. The pitch was explicit: "frontier capability without the risk of export controls." Fugu is architecturally interesting. It's not a bigger model. It's an orchestration layer that routes tasks across a swappable pool of frontier LLMs, including instances of itself. One endpoint, many models, no single-vendor dependency. Sakana raised $135M at a $2.65B valuation in late 2025. The research grounding it, TRINITY and Conductor, was peer-reviewed and presented at ICLR. At the same time, China's 360 shipped Tulongfeng and Yitianzhen, two cybersecurity AI tools positioned directly against Mythos. MiniMax launched M3 last week, performance on par with GPT-5.5, and opened its weights. Zhipu AI stock went up 33% on the Fable 5 ban alone. The pattern connects to something from cloud infrastructure a decade ago. When US firms couldn't meet European data residency requirements, it forced investment in local cloud capacity that wouldn't otherwise have been funded. The same dynamic is happening here, faster. The difference this time: Sakana's framing isn't "we're an alternative." It's "we're the hedge." Collective intelligence over single-provider dependency. That framing now ships as a feature in a product. Anthropic's run-rate crossed $47B in May 2026. What share was Asian enterprise is not public. What is clear: the vacuum opened by the ban is being filled by products that explicitly market the absence of US restrictions. The restriction becomes the competitor's product differentiation. x.com/TechCrunch/sta…

TechCrunch @TechCrunch

11 hours ago

Asian AI startups launch Mythos-like models as Anthropic’s export ban drags on techcrunch.com/2026/06/27/asi…

27 40 175 38K 61

0 0 0 154 0

View Details

Prasenjit Sarkar @stretchcloud

5 hours ago

The pattern I keep returning to: 'which model is best' is the wrong question for agents. The right question is which model produces the best outcome per dollar spent across actual work. Arena.ai published its first Agent Arena leaderboard this week. The method isn't pairwise voting. It's causal tracing: they run real user tasks, randomize model assignments across sessions, and measure causal treatment effects. What they find surfaces something pairwise evals miss. The chart on performance vs output tokens shows Claude Fable 5 (High) at the top. What makes this interesting is the axes. The x-axis is median output tokens. The y-axis is net improvement over baseline. The models in the top-right quadrant are the ones actually worth running in production: better output, not more tokens. This matters because agent loops are expensive in ways list-price comparisons miss. Arena found that some models are more expensive in practice than their published price suggests because they take more steps per turn or induce more turns before users reach satisfaction. The realized cost diverges from the sticker cost. Some concrete numbers from a recent 7-day window: 160,480 agent tasks. 2 million tool calls. 40.3 million lines of code written. 32% of sessions ended with at least 128k tokens in the final turn. 8% exceeded 1M tokens. The companies building serious agent products already know this. Cloudflare, Vercel, and others have published findings about token spend diverging from expected costs in production. The benchmark that lives closest to that reality is the one that will drive model selection. My read: the leaderboard that matters is the one built from your actual workload, not your evaluation suite. Arena is the first credible attempt to do this at scale. x.com/arena/status/2…

Arena.ai @arena

a day ago

[Token efficiency in Agent Arena] Agent Arena measures agent performance across a range of real-world tasks from our global community. Models get search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building

16 20 240 21K 59

0 0 0 32 0

View Details

Prasenjit Sarkar @stretchcloud

6 hours ago

The pattern I keep seeing: labs don't compete on the leaderboard. They compete on the environment. OnlyLabs.fyi tracked 128 open eval-relevant roles at the major frontier labs this month. The single largest category is RL and post-training: 46 roles. Safeguards and safety is second at 22. Alignment and model behavior third at 20. Evals and benchmarks is fourth at 17. That mix tells you something. The bench isn't the bottleneck. Building the reward environment is. Anthropic has 43 of the 128 roles, OpenAI has 31, Cohere has 16, xAI and Mistral each have around 6. Anthropic has previously discussed spending over $1 billion on RL environments in the near term. OpenAI's R&D compute budget for 2026 is around $19 billion, roughly double 2025. OpenAI literally titles one of its open roles 'Research Engineer, Frontier Evals and Environments.' Wing VC laid out why this matters: between now and 2030, the RL environment market narrows to three to five leaders, with one or two pulling meaningfully ahead. Early advantage goes to teams that go deep in a small number of complex, high-signal domains, especially coding and computer use. The companies trying to sell RL environments to labs include hud.ai (whose environments powered Autonomy-10, used to evaluate OpenAI Operator at launch), SemiAnalysis has a full piece on RL environments as data foundries and multi-agent architectures, and Epoch AI published an FAQ on RL environments. What I keep noticing: the best model is increasingly the one trained on the best environment. The environment is the moat, not the architecture. You can fine-tune on architecture in months. Environments take years to build. The labs know this. The hiring data confirms it. x.com/xdotli/status/…

Xiangyi Li @xdotli

15 hours ago

what frontier labs hiring signals for RL environment companies at OnlyLabs.fyi

4 15 299 33K 423

0 0 0 62 0

View Details

Prasenjit Sarkar @stretchcloud

6 hours ago

The model market just made its tiering structure explicit. OpenAI shipped GPT-5.6 as three named capability tiers today: Sol at the frontier, Terra in the middle, Luna for high-volume work. Sol costs $5 input / $30 output per 1M tokens. Terra is $2.50 / $15. Luna is $1 / $6. Terra matches GPT-5.5 at 2x cheaper. That pricing ladder mirrors what cloud compute looked like when AWS started naming instance families. Once the tiers have names, they compete on their own cadence. Sol introduces two new operating modes. Max gives the model more reasoning time. Ultra goes further by spawning subagents to parallelize complex work. Both signal that the architecture for frontier tasks is increasingly multi-agent, not just a bigger single-model call. On Terminal-Bench 2.1, Sol scores 88.8% versus Claude Mythos 5 at 88%. Close enough to call a tie on raw score. But Sol does it at roughly 1/3 of Mythos Preview's output tokens on ExploitBench. Token efficiency at the frontier is now a tracked metric, not an afterthought. The launch follows Trump's June 2 executive order on AI model oversight. OpenAI shared plans with the US government before launch and is starting with about 20 trusted partner organizations. They are pushing for this to not become a standing norm. The tension between government review and broad access is the friction that will shape every major release from here. Sol is also coming to Cerebras in July at up to 750 tokens per second. Frontier intelligence at inference speed has been the missing piece for real-time agent loops. My read: the shift from 'model' to 'model family with named tiers' is the same move AWS made with EC2, Azure made with VMs, and Google made with Compute Engine. Once you ship Sol, Terra, and Luna, you have a platform. The question is who builds the next layer on top. x.com/OpenAI/status/…

OpenAI @OpenAI

a day ago

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. openai.com/index/previewi…

3K 5K 38K 15.5M 7K

0 0 0 85 0

View Details

Prasenjit Sarkar @stretchcloud

6 hours ago

The bottleneck in AI-built software moved, and it moved fast. By late 2025, frontier models were good enough to one-shot working internal apps. Engineers and non-engineers at Block were building real tools in an afternoon. Sales reps, analysts, support agents. Then most of those apps sat on someone's laptop with nowhere safe to go. Block App Kit is the platform Block built to solve the second problem: getting AI-built apps into the right hands without creating a security or compliance hole. The core split: the agent generates the app, the platform owns everything that makes it safe. Identity, authorization, secret management, data connections, deployment path. The blog post has a detail I found clarifying: they started with an MCP server that an agent would use to scaffold, build, and deploy an app. That worked but required manual setup per person. They repackaged it as a skill on Block's internal agent tooling platform and distributed it that way. That's a meaningful architectural choice. Skills that self-distribute to agents are the composable layer that makes the whole thing scale. Block App Kit launched mid-March 2026. In the quarter since: weekly app views grew more than 10x, weekly active users climbed from hundreds into thousands, catalog now spans over a thousand apps with hundreds more launching every week. Roughly four in five users sit outside of engineering: Sales, Support, Legal, Finance, Marketing, across 50 orgs. The strongest signal: Block's security org designated Block App Kit the sanctioned path for building and deploying internal tools. When the team responsible for preventing data exfiltration decides your platform is the preferred route, the safety-by-design bet has worked. My read: the gap in AI-built software isn't model capability. The gap is platform infrastructure: identity, access control, data connections, and a deployment path that doesn't require an unsafe choice anywhere in the process. Block just published a detailed blueprint for closing it. x.com/jack/status/20…

jack @jack

14 hours ago

block app kit. fastest adoption of any tool by our company.

84 77 2K 538K 1K

2 0 1 104 0

View Details

Prasenjit Sarkar @stretchcloud

7 hours ago

Speculative decoding just got a significant open-source infrastructure upgrade from DeepSeek. DSpark is a new draft model for DeepSeek V4 checkpoints. The stated improvement: 51% to 400% throughput gain over baseline, depending on hardware and model combination. It improves on the prior generation of approaches in this space: MTP-1, Eagle-3, and DFlash. The more interesting release is DeepSpec, published to GitHub today. It's a full-stack codebase for training and evaluating speculative decoding algorithms. Not just the model, the training pipeline. The eval harness runs against gsm8k, math500, AIME25, humaneval, mbpp, livecodebench, MT-Bench, AlpacaEval, and Arena-Hard. That's a serious test surface. Speculative decoding works by running a small fast draft model that proposes token sequences, which a large target model then verifies in parallel. When the draft guesses correctly, you get multiple tokens for the cost of one verification pass. Throughput goes up with no change to output quality. What I find interesting about DeepSpec is the scope: it covers DSpark, DFlash, and Eagle3 as reference implementations. Early reports suggest it also transfers to Gemma and Qwen, not just V4 models. Teams running non-DeepSeek models can adapt the approach. The pattern I keep seeing: DeepSeek releases the model, then releases the training infrastructure. V3 weights came first. Flash Attention optimizations followed. DSpark now. DeepSpec completes that stack. Same playbook that made vLLM's PagedAttention stick: publish the technique, then publish the tooling that lets others reproduce and adapt it. My read: speculative decoding is moving from a research technique to a standard inference engineering practice. DeepSpec is an attempt to industrialize that transition. x.com/teortaxesTex/s…

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) @teortaxesTex

18 hours ago

DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".

16 60 639 88K 264

0 0 0 166 0

View Details

Prasenjit Sarkar @stretchcloud

7 hours ago

The interface problem for browser agents just got a proposed fix built into the browser itself. Every agent framework targeting websites today makes the same bet: parse the DOM, read the accessibility tree, take screenshots, and hope the page structure holds. It mostly works. It's also brittle. A redesign breaks the agent. A shadow DOM breaks the agent. A canvas-rendered UI breaks the agent. WebMCP is a proposed W3C standard, co-authored by Google and Microsoft, that inverts this. Instead of agents reverse-engineering what a page can do, websites declare their capabilities as callable tools: JavaScript functions, HTML forms, structured metadata. The agent calls the tool and the site handles execution. Chrome 149 opened a public origin trial in May 2026. Three independent proposals converged into the spec: Microsoft's "Web Model Context" explainer, Google's "Script Tools" proposal, and MCP-B, a Chrome extension built at Amazon. All three ended up in the same W3C working group. The comparison I keep reaching for is REST APIs. Before REST, you reverse-engineered every site's URL structure and form logic. After REST, services published what they could do. WebMCP is the same transition for agent-website interaction. The deployment story is still early. Origin trial means signing up for access, shipping a trial token, testing. One competing proposal, Web Agent Bridge, anchors capabilities in DNS records rather than page-level JavaScript, which models a different trust relationship. My read: scraping-based agent web access doesn't scale to production. Sites that want reliable AI integrations need a first-class way to declare their surface area to agents. That's what WebMCP is trying to give them. x.com/ChromiumDev/st…

Chrome for Developers @ChromiumDev

a day ago

It can be challenging for AI agents to solve complex user intents by synthesizing signals like screenshots, the DOM, and the Accessibility Tree. Enter WebMCP, a proposed web standard that aims to expose structured tools for AI agents directly on existing websites, now in origin

6 19 148 12K 76

2 0 0 93 0

View Details

Prasenjit Sarkar @stretchcloud

7 hours ago

The question I keep coming back to: when does an agent stop being a feature and start being infrastructure? Railway just wrapped Agents Week with the cursor.ai agent included by default in Railway sandboxes. No setup, no configuration. You get a real execution environment with a filesystem and shell. The agent clones code, runs commands, makes changes, hands back something reviewable. What I keep noticing across the category: Cloudflare ran their own Agents Week in parallel, shipping Flue and Dynamic Workflows. Vercel has background functions. Render, Fly.io, and Northflank are all building execution-layer primitives for agents. The platform race is about being the default runtime, not just the deployment target. The Cursor integration is one of four: Railway also supports Claude Code, OpenCode, and Codex natively in the same sandbox environment. And Railway Skills extend each agent with Railway-specific commands for deploying, monitoring, and managing services. The agent knows your infrastructure. The historical parallel is clear. In 2012, GitHub Actions did not exist. CI/CD was a configuration problem each team solved separately. By 2018, it was invisible infrastructure. The same thing is happening to agent execution environments now. The bottleneck is not model capability. It is how reliably the agent can act on real infrastructure without manual scaffolding. Railway just moved that scaffold into the box. My read: the cloud platforms that win the agent era will be the ones that turned agent execution into a first-class primitive before everyone else did. x.com/Railway/status…

Railway @Railway

a day ago

x.com/i/article/2070…

3 3 24 6K 18

1 0 0 67 1

View Details

Prasenjit Sarkar @stretchcloud

8 hours ago

The pattern I keep seeing across enterprise AI engineering: the first instinct is to measure usage. Then you realize you measured the wrong thing. Shopify killed their token leaderboard. People competed to be on it. Wrong incentive. They renamed it a usage dashboard to focus on utility, not volume. But the more interesting part is what they built on the other side of that realization. Shopify now runs a Universal Distillation Platform. Any team can take a frontier model, Opus 4 or GPT-5 class, and distill it into a fine-tuned Qwen or other open-source model for a specific subtask. The full cycle takes about a day, with evals baked in and a weekly retraining flywheel built on real merchant data. Numbers: 2x to 30x cheaper. 2.2x faster on the specific task. The fine-tuned model outperforms the frontier it replaced on that narrow task. They currently run roughly half a dozen of these distilled models in production. Companies doing versions of this: Stripe, Klarna, Duolingo. The pattern is: use the frontier model to generate training data and evaluate outputs, then distill the task-specific behavior into a model you can run cheaply at scale. Tangle, Shopify's open-source ML experimentation platform, adds experiment reproducibility and intelligent caching to the pipeline. The real bottleneck shifted. At 3,000 engineers with 100% AI adoption, the constraint they identify is PR review and CI/CD. Generation is fast. Integration is not. My read: this is the enterprise AI trajectory. High initial frontier usage to learn which tasks are worth training for. Then distill and own the model. x.com/AnatoliKopadze…

Anatoli Kopadze @AnatoliKopadze

2 days ago

Head of Engineering Shopify: "AI writes the code, AI reviews the code. Your job is just to write the loops around it." 26 minutes on how AI changed the way 3,000 engineers work inside a single company. Ignoring it while everyone else uses AI to do more is the fastest way to

86 273 3K 674K 7K

0 0 0 42 0

View Details

Prasenjit Sarkar @stretchcloud

8 hours ago

The bottleneck for agents in real-time conversation has always been latency. Not intelligence. Alibaba's Wan team just published Wan-Streamer v0.1. Model-side response latency: 200ms. Total end-to-end: 550ms. The agent sees you, hears you, and responds on video. All at once. Full duplex. What makes this architecturally significant: there is no VAD module, no ASR pipeline, no separate TTS layer, no animation engine. Perception, reasoning, generation, and turn management are learned jointly inside a single transformer, using block-causal attention for incremental streaming. Every cascaded system accumulates error and latency at each handoff. This eliminates the handoffs. The competitive context: GPT-4o Advanced Voice is audio-only. Gemini Live supports video input but uses a different architecture and is not end-to-end multimodal output. HeyGen's streaming avatars rely on an external LLM sending audio to a separate rendering layer. ElevenLabs, Tavus, Synthesia all operate as separate layers on top of foundation models. Wan-Streamer is the first published proof that a single model can handle language, audio, and video as both input and output in a single pass, in real time. Current resolution is 192p. That is a proof of concept constraint, not an architectural limit. The application surface: digital humans, customer support agents, embodied AI interfaces, real-time tutoring systems. The latency numbers already cross the threshold where conversations feel natural. My read: the modality race is shifting from which model thinks best to which model can sustain a present, responsive state in real time. Wan-Streamer moves that frontier. x.com/minchoi/status…

Min Choi @minchoi

2 days ago

We are cooked. China's Alibaba just revealed Wan Streamer. AI agents can now see you, hear you, and talk back on video in real time. This is not voice mode anymore 🤯

189 440 3K 440K 3K

0 0 0 68 0

View Details

Prasenjit Sarkar @stretchcloud

8 hours ago

Everyone read "open weights" as a gift to the community. It launched API-first. Weights came ~10 days later. 428B params most teams can't self-host. "Open" was the trust badge. The hosted endpoint is the meter. Open weights stopped being charity. They became a go-to-market.

0 0 0 23 0

View Details

Prasenjit Sarkar @stretchcloud

9 hours ago

x.com/i/article/2070…

1 0 0 73 0

View Details

Prasenjit Sarkar @stretchcloud

12 hours ago

The lesson here is more general than AWS. Most agent security thinking focuses on what the agent is allowed to do. Least privilege, scoped permissions, audit logs. All of that is useful but it starts from the wrong premise: that you are tuning access for an entity that makes human-like mistakes. Agents make agent-like mistakes. They retry things confidently. They complete tasks that should have stopped two steps ago. They misread a response and apply the destructive action again. The blast radius of that kind of mistake through a path that reaches production is not recoverable in the same way human errors usually are. The correct frame is isolation, not permission. The boundary is architectural: there is no path to production. Not a restricted path. No path. The four layers in this setup encode that correctly. A disposable sandbox server absorbs environmental damage. Recreate it and carry on. CI/CD handles real deployments, but the agent only gets as far as GitHub; a human applies the infrastructure change. AWS experimentation happens in a separate account with temporary, scoped, revocable credentials, fully airgapped from production. The diagram makes the enforcement logic explicit: if the agent can reach production, you stop and fix the boundary. You do not tune permissions. The CI/CD gate is the most important piece. Agents propose. Humans apply. That one constraint keeps the feedback loop open regardless of how confident the agent is in its output. Confidence and correctness are not correlated the way we would want them to be. x.com/Al_Grigor/stat…

Alexey Grigorev @Al_Grigor

16 hours ago

A coding agent should never have a path to production. I learned this the expensive way after one of my agents dropped a production database. That incident changed how I think about cloud access for agents. Now my setup is different. Agents run on a remote sandbox server. The