Pulpie is open source and available on Hugging Face today. See our article to get started and for more details on how we built Pulpie.
x.com/FeynAI/status/…
The gains are architectural.
Today's best extractors are decoders limited by memory bandwidth. Pulpie is an encoder bound by compute.
GPUs are starved for bandwidth, not compute. As a result, Pulpie is performant everywhere. Our testing shows Pulpie to be 7x faster on A100, and 20x faster on L4 GPUs.
Introducing Pulpie, models for cleaning the web.
70% of a typical HTML page is ads, navigation, and sidebars. Pulpie removes this noise and returns a clean document you can use in pre-training or as context.
Previously, cleaning 1 billion pages with the best extractors cost $159,000. Pulpie brings that down to $7,900. A 20x decrease.
🔥🚀Chonkie 1.6.8 is out with PyEmscripten wheels!
🐍You can now run the full Python Chonkie library directly in your browser using Pyodide🦛
This is an early release that brings Chonkie’s core functionality to the web, but it's a big step towards 𝕓𝕣𝕠𝕨𝕤𝕖𝕣 𝕟𝕒𝕥𝕚𝕧𝕖 𝕡𝕪𝕥𝕙𝕠𝕟 𝕔𝕙𝕦𝕟𝕜𝕚𝕟𝕘.
We’d love to hear your feedback! try it out and let us know what works, what doesn't, and what you'd like to see next!
Happy coding! 🦛✨
Happy to share that @ChonkieAI just crossed 4K GitHub stars and 4 million PyPI downloads! 🎉
from a small side project to a fast growing library used by millions.
Huge thanks to the community, so stay tuned as bigger things are coming 🔥
@TeraflopAI You can also check out this excellent blog post by @EnricoShippole where he breaks down what you can build on top of this:
🔗 teraflopai.com/blog/chonkie
Highly recommended if you want to see the real possibilities 👀
🐦⬛+🦛 = 🔥
You can now leverage @TeraflopAI directly inside the Chonkie ecosystem.
Seamless integration, zero friction. Just pure power.
Check the post below for all the details 👇
We are happy to announce our official integration into @ChonkieAI. You can now use @TeraflopAI’s powerful text segmentation API directly within the Chonkie open-source library to chunk text seamlessly at scale.
As always, a huge shout out to all our contributors, they're the reason why open source thrives. Check out all of our releases on GitHub:
ChonkieJS: github.com/chonkie-inc/ch…
Chonkie Python: github.com/chonkie-inc/ch…
Happy Chonking! 🦛✨
Chonkie Python continues to grow with our latest addition, the TeraFlopAI Chunker.
With help from TeraflopAI's segmentation API, this chunker specializes in semantic splitting. We've found it particularly good for legal documents.
🚀 chunk v0.10.1 — multi-byte pattern chunking
Split text at CJK punctuation, emoji, tokenizer markers, or any UTF-8 pattern. Compose with single-byte delimiters. Zero-copy. UTF-8 safe.
SIMD-accelerated for 1-3 patterns, Aho-Corasick for 4+.
🔗 github.com/chonkie-inc/ch…
Our OSS engineer @itsclelia recently built 𝗹𝗶𝘁𝗲𝘀𝗲𝗮𝗿𝗰𝗵, a fully local document ingestion and retrieval CLI/TUI application powered by LiteParse ⚡
litesearch demonstrates how developers can assemble a high-performance, local-first retrieval pipeline using open tools from across the ecosystem:
• Parsing: LiteParse, the fast and accurate document parser we recently open sourced
• Chunking: @ChonkieAI
• Embeddings: A local @nomic_ai model via @huggingface transformers.js
• Vector storage: A local @qdrant_engine edge shard (custom-built in Rust and compiled as a native add-on)
• Retrieval: Query stored files with optional path-based filtering and configurable relevance thresholds
• Runtime: @bunjavascript for speed and versatility
💻 Check out the repository and try it yourself: github.com/AstraBert/lite…
📚 LiteParse docs: developers.llamaindex.ai/liteparse?utm_…
stop using MCPs to give your agents documentation.
mandex gives your AI agent the same docs — but local, version-pinned, and 25ms instead of 800ms.
One download. Works offline. No API keys. No rate limits. No monthly bill.
100+ packages. 1,500+ versions. Free forever.
🔗👇
Shoutout to all our contributors who chipped in big for this update. Check out our full release notes on GitHub. github.com/chonkie-inc/ch…
Happy chonking! 🦛✨
⚡ Native async support is here.
await chunker.achunk(text)
Every chunking method now has an async equivalent. Chonkie no longer blocks your event loop.
🦛 Chonkie v1.6 just dropped. Our biggest release yet.
→ HTML table chunking
→ Self-hostable chunking API
→ Native async support
The hippo is growing bigger and stronger. Here's what's new 🧵
2K Followers 3K FollowingPolyglot Software Developer. AI, Ruby on Rails, iOS, Mac Apps and more. Hiker, Cyclist, and Ice Climber based in the beautiful city of Boston, Massachusetts.
132 Followers 3K FollowingCerco di seguire persone in buona fede che abbiano opinioni diverse dalla mia.
L'ignorante non si conosce mica dal lavoro che fa ma da come lo fa (C. Pavese)
202 Followers 625 Followingcore contributor @aoagents | research @IITPAT | prev. @UofSC, ai*crypto stealth | ml, research | absorbing art in every shape it takes | que sera, sera