Key Technology
Key Technology
Patent 1
Patent 1
Optimal Model Compression
Technology that lightweights a given model by searching for Pareto-optimal compressed candidates created through various combinations of a compression techniques.
Technology that lightweights a given model by searching for Pareto-optimal compressed candidates created through various combinations of a compression techniques.
Comparison with LLM
Comparison with LLM
vLLM is the current SOTA fast and easy-to-use library for LLM Inference and serving
vLLM is the current SOTA fast and easy-to-use library for LLM Inference and serving
As-Is
Our Solution 01
Our Solution 02
Target Model
Luxia 21.4B
Luxia 21.4B
DeepAuto
Model Search
Serving Framework
vLLM
(Flash & Paged attention)
DeepAuto
Compression
DeepAuto
Compression
Memory Size (GB)
43.72
13.5
(-69.12%)
6.10
(-86.04%)
Latency
(C=32k, B=16, ms)
186.45
76.85
(-58.78%)
40.07
(-78.50%)
Throughput
(C=32k, req/s)
0.039
0.059
(+51.28%)
0.062
(+58.97%)
LogicKor
Evaluation #1
3.775
3.775
(-0%)
5.675
(+50.33%)
KMMLU-Hard
Evaluation #2
0.21
0.24
(+14.3%)
0.23
(+9.52%)
As-Is
Our Solution 01
Our Solution 02
Target Model
Luxia 21.4B
Luxia 21.4B
DeepAuto
Model Search
Serving Framework
vLLM
(Flash & Paged attention)
DeepAuto
Compression
DeepAuto
Compression
Memory Size (GB)
43.72
13.5
(-69.12%)
6.10
(-86.04%)
Latency
(C=32k, B=16, ms)
186.45
76.85
(58.78%)
40.07
(-78.50%)
Throughput
(C=32k, req/s)
0.039
0.059
(+51.28%)
0.062
(+58.97%)
LogicKor
Evaluation #1
3.775
3.775
(-0%)
5.675
(+50.33%)
KMMLU-Hard
Evaluation #2
0.21
0.24
(+14.3%)
0.23
(9.52%)
Comparison with Flash Attention
Comparison with Flash Attention
Flash Attention is the SOTA acceleration technique of attention method
Flash Attention is the SOTA acceleration technique of attention method
LLaMA-2-7B
LLaMA-2-13B
PyTorch
Flash Attention
Ours(HiP Attention)
Latency(Prm, micro sec)
576.31 (x1.00)
562.76 (x1.02)
15.62 (x36.90)
Latency(Dec, milli sec)
OOM
56.64 (x1.00)
18.80 (x3.01)
Latency(Prm, micro sec)
719.72 (x1.00)
705.34 (x1.02)
19.67 (x36.58)
Latency(Dec, milli sec)
OOM
71.17 (x1.00)
23.40 (x3.04)
LLaMA-2-7B
LLaMA-2-13B
PyTorch
Flash Attention
Ours(HiP Attention)
Latency(Prm, micro sec)
576.31 (x1.00)
562.76 (x1.02)
15.62 (x36.90)
Latency(Dec, milli sec)
OOM
56.64 (x1.00)
18.80 (x3.01)
Latency(Prm, micro sec)
719.72 (x1.00)
705.34 (x1.02)
19.67 (x36.58)
Latency(Dec, milli sec)
OOM
71.17 (x1.00)
23.40 (x3.04)
Patent 2
Patent 2
Optimal Query Routing
Technology that instantly determines the difficulty of a given queries and routes them to the optimal language models for quality and cost.
Technology that instantly determines the difficulty of a given queries and routes them to the optimal language models for quality and cost.
Comparison of Response Quality
Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Evaluated by Prometheus v2 in a score range of 1 (min) and 5 (max).
Gemma-7B
AdaptLLM-Law
Meta-LLaMA-3
-8B
Meta-LLaMa-3
-70B
Our Query
Routing
Provider
Microsoft
Meta
Meta
DeepAuto.ai
Model
Size
7B
7B
8B
70B
Avg.1.28B
Fiance
(Min:1 Max:5)
2.24
4.33
4.2
4.35
4.52
Law
(Min:1 Max:5)
3.21
3.42
3.77
4.28
4.46
BioMed
(Min:1 Max:5)
3.6
3.75
4.26
4.02
4.5
Code
(Min:1 Max:5)
3.08
3.76
4.3
4.16
4.86
Math
(Min:1 Max:5)
2.36
3.66
3.91
3.48
4.68
General
(Min:1 Max:5)
3.7
3.52
4.49
4.32
4.51
Average
(Min:1 Max:5)
3.03
3.74
4.16
4.1
4.59
Frequency of Model Selection
Frequency of Model Selection
Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Gemma-7B
AdaptLLM-Law
Meta-LLaMA-3
-8B
Meta-LLaMa-3
-70B
Our Query
Routing
Provider
Microsoft
Meta
Meta
DeepAuto.ai
Model
Size
7B
7B
8B
70B
Avg.1.28B
Fiance
(Min:1 Max:5)
2.24
4.33
4.2
4.35
4.52
Law
(Min:1 Max:5)
3.21
3.42
3.77
4.28
4.46
BioMed
(Min:1 Max:5)
3.6
3.75
4.26
4.02
4.5
Code
(Min:1 Max:5)
3.08
3.76
4.3
4.16
4.86
Math
(Min:1 Max:5)
2.36
3.66
3.91
3.48
4.68
General
(Min:1 Max:5)
3.7
3.52
4.49
4.32
4.51
Average
(Min:1 Max:5)
3.03
3.74
4.16
4.1
4.59
Comparison of Response Quality
Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Evaluated by Prometheus v2 in a score range of 1 (min) and 5 (max).
Explore
Explore
Other Technologies
Model Hub & Dataset Hub
Explore the latest models & datasets
We periodically integrate the latest open-source models and datasets, offering intuitive visualization tools for effective analysis of key metrics, thereby enhancing the experience of exploring models and datasets.
We periodically integrate the latest open-source models and datasets, offering intuitive visualization tools for effective analysis of key metrics, thereby enhancing the experience of exploring models and datasets.
Optimal Model Search
Find the perfect model for you
DeepAuto.ai's Model Search technology can route the user query to the optimal LLM model from a database of over 10K+ SOTA models in real-time, considering the complexity of the given query. It can also recommend the model with the lowest serving cost among the optimal models
DeepAuto.ai's Model Search technology can route the user query to the optimal LLM model from a database of over 10K+ SOTA models in real-time, considering the complexity of the given query. It can also recommend the model with the lowest serving cost among the optimal models
Optimal Model Compression
Compress your model optimally
DeepAuto.ai's Model Compression technology can reduce the size of a given target model by up to 80% while minimizing performance loss. Additionally, by applying an efficient attention mechanism, it enables up to a 4x increase in generation speed.
DeepAuto.ai's Model Compression technology can reduce the size of a given target model by up to 80% while minimizing performance loss. Additionally, by applying an efficient attention mechanism, it enables up to a 4x increase in generation speed.
PEFT & Hyper-parameter Optimization
Fine-tune your model efficiently
DeepAuto.ai's Finetuning technology significantly improves training time and cost through Parameter Efficient methods. Additionally, it integrates Hyper-parameter optimization to enable effective training.
DeepAuto.ai's Finetuning technology significantly improves training time and cost through Parameter Efficient methods. Additionally, it integrates Hyper-parameter optimization to enable effective training.
Model Evaluation
Evaluate your model with ease
DeepAuto.ai's Evaluation feature allows you to effortlessly assess your model on important public or your own private benchmarks with just a few clicks. Utilize affordable computing infrastructure to evaluate and share your model.
DeepAuto.ai's Evaluation feature allows you to effortlessly assess your model on important public or your own private benchmarks with just a few clicks. Utilize affordable computing infrastructure to evaluate and share your model.
Efficient & Stable Auto-scaled Serving
Serve your model at low-cost
DeepAuto.ai's Model Serving technology reliably and quickly serves optimally compressed models, offering the ability to automatically scale instances up or down in response to customer traffic. It provides a standardized API format to facilitate quick integration with customer applications.
DeepAuto.ai's Model Serving technology reliably and quickly serves optimally compressed models, offering the ability to automatically scale instances up or down in response to customer traffic. It provides a standardized API format to facilitate quick integration with customer applications.
Low-cost High-end GPUs & Data Storage
Launch your cloud workspace
DeepAuto.ai's Cloud Workspace offers high-end GPUs at an affordable price, enabling customers to train and test large AI models anytime, anywhere.
DeepAuto.ai's Cloud Workspace offers high-end GPUs at an affordable price, enabling customers to train and test large AI models anytime, anywhere.
Dynamic Query Routing
Instantly route each question to the optimal model
DeepAuto.ai's Dynamic Query Routing technology instantly predicts the optimal model to answer a given user's question most accurately and appropriately. It then connects to the most cost-effective model, ensuring the highest quality while minimizing API costs.
DeepAuto.ai's Dynamic Query Routing technology instantly predicts the optimal model to answer a given user's question most accurately and appropriately. It then connects to the most cost-effective model, ensuring the highest quality while minimizing API costs.
Multi-modal Language AI
Capable of handling various input formats
DeepAuto.ai's Multimodal-Language AI supports various input formats, not just text. It enables question-answering from diverse input materials such as PDFs and images.
DeepAuto.ai's Multimodal-Language AI supports various input formats, not just text. It enables question-answering from diverse input materials such as PDFs and images.
RAG Technology
Retrieve accurate data from Web,DB,etc
DeepAuto.ai's RAG technology is equipped with a retriever that explores various external data and a verifier that validates the retrieved data, providing accurate and verified answers.
DeepAuto.ai's RAG technology is equipped with a retriever that explores various external data and a verifier that validates the retrieved data, providing accurate and verified answers.