With TensorRT-LLM, Google's latest open language models can now run accelerated on NVIDIA AI platforms, including local RTX AI PCs.
On February 21, 2024, NVIDIA, in collaboration with Google, announced the release of optimized features for Gemma on all NVIDIA AI platforms. Gemma represents Google's cutting-edge lightweight 2B and 7B open language models, designed to run anywhere, thus reducing costs and speeding up innovation in specific user cases.
Teams from NVIDIA and Google have worked closely together, leveraging NVIDIA TensorRT-LLM, an open-source library for optimizing the inference performance of large language models. This acceleration of Gemma's performance (built on the same research and technology as the Gemini model) occurs when running on NVIDIA GPUs in data centers, cloud computing, and PCs equipped with NVIDIA RTX GPUs.
This enables developers to target over 100 million users with NVIDIA RTX GPUs in high-performance AI PCs globally as potential users for their development.
Developers can also run Gemma in the cloud with NVIDIA GPUs, where instances boast 141GB of HBM3e memory with speeds of 4.8TB/s, with Google set to deploy these instances later this year.
Furthermore, enterprise developers can fine-tune Gemma using NVIDIA's rich tool ecosystem, including NVIDIA AI Enterprise and TensorRT-LLM based on the NeMo framework, and deploy the optimized models into production applications.
For more details on how TensorRT-LLM accelerates Gemma inference speeds and other information for developers, including multiple model files for Gemma and FP8 quantized versions of the models, all optimized using TensorRT-LLM, visit [link].
You can directly experience Gemma 2B and Gemma 7B on the NVIDIA AI Playground through your browser.
Gemma is coming soon to Chat with RTX.
The NVIDIA Chat with RTX technology demo version will also soon support Gemma. This demo utilizes Retriever-Augmented Generation (RAG) and TensorRT-LLM software to provide users with generative AI capabilities on their local Windows RTX PC. Video Link: https://www.bilibili.com/video/BV1Ky421z7PT/
With Chat with RTX, users can effortlessly connect local files from their PCs to the large language model, enabling them to create personalized chatbots using their own data.
Since the model runs locally, results can be generated quickly, and user data remains on their local device. Unlike relying on cloud-based LLM services, Chat with RTX allows users to handle sensitive data on their local PCs without the need to share this data with third parties or connect to the internet.