Skip to content
View lvhan028's full-sized avatar
  • China
  • 00:23 (UTC -12:00)

Block or report lvhan028

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Rust 9,234 822 Updated Jan 8, 2025

Tile primitives for speedy kernels

Cuda 1,907 92 Updated Jan 4, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,332 132 Updated Jan 7, 2025

Material for gpu-mode lectures

Jupyter Notebook 3,425 347 Updated Jan 6, 2025
C++ 50 2 Updated Dec 4, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 677 52 Updated Sep 4, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 686 29 Updated Sep 21, 2024

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 279 43 Updated Jan 8, 2025

Data annotation toolbox supports image, audio and video data.

Python 945 92 Updated Jan 8, 2025

A Comprehensive Toolkit for High-Quality PDF Content Extraction

Python 6,303 413 Updated Jan 3, 2025

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Python 23,584 1,730 Updated Jan 8, 2025

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…

Python 861 39 Updated Dec 28, 2024

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,746 520 Updated Dec 14, 2024

Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.

Python 10,351 657 Updated Jan 6, 2025

Self-host LLMs with LMDeploy and BentoML

Python 17 2 Updated Dec 25, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 5,113 456 Updated Jan 8, 2025

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Python 2,697 163 Updated Dec 26, 2024

✨✨Latest Advances on Multimodal Large Language Models

13,415 850 Updated Jan 6, 2025

[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

334 10 Updated Mar 22, 2024

Get up and running with Llama 3.3, Mistral, Gemma 2, and other large language models.

Go 106,431 8,517 Updated Jan 8, 2025

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 21,012 2,311 Updated Aug 12, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,746 170 Updated Jan 8, 2025

整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。

17,341 1,661 Updated Sep 19, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 7,166 665 Updated Jan 8, 2025

LLM Group Chat Framework: chat with multiple LLMs at the same time. 大模型群聊框架:同时与多个大语言模型聊天。

TypeScript 260 23 Updated Apr 10, 2024

HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

Python 1,679 134 Updated Jan 7, 2025

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 8,045 418 Updated Sep 6, 2024

Mamba SSM architecture

Python 13,718 1,176 Updated Jan 6, 2025

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 1,937 176 Updated Nov 20, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,100 1,050 Updated Jan 8, 2025
Next