Key Technology

Key Technology

Patent 1
Patent 1

Optimal Model Compression

Optimal Model Compression

Technology that lightweights a given model by searching for Pareto-optimal compressed candidates created through various combinations of a compression techniques.

Technology that lightweights a given model by searching for Pareto-optimal compressed candidates created through various combinations of a compression techniques.

Comparison with LLM

Comparison with LLM

vLLM is the current SOTA fast and easy-to-use library for LLM Inference and serving
vLLM is the current SOTA fast and easy-to-use library for LLM Inference and serving

As-Is

Our Solution 01

Our Solution 02

Target Model

Luxia 21.4B

Luxia 21.4B

DeepAuto

Model Search

Serving Framework

vLLM

(Flash & Paged attention)

DeepAuto

Compression

DeepAuto

Compression

Memory Size (GB)

43.72

13.5

(-69.12%)

6.10

(-86.04%)

Latency

(C=32k, B=16, ms)

186.45

76.85

(-58.78%)

40.07

(-78.50%)

Throughput

(C=32k, req/s)

0.039

0.059

(+51.28%)

0.062

(+58.97%)

LogicKor

Evaluation #1

3.775

3.775

(-0%)

5.675

(+50.33%)

KMMLU-Hard

Evaluation #2

0.21

0.24

(+14.3%)

0.23

(+9.52%)

As-Is

Our Solution 01

Our Solution 02

Target Model

Luxia 21.4B

Luxia 21.4B

DeepAuto

Model Search

Serving Framework

vLLM

(Flash & Paged attention)

DeepAuto

Compression

DeepAuto

Compression

Memory Size (GB)

43.72

13.5

(-69.12%)

6.10

(-86.04%)

Latency

(C=32k, B=16, ms)

186.45

76.85

(58.78%)

40.07

(-78.50%)

Throughput

(C=32k, req/s)

0.039

0.059

(+51.28%)

0.062

(+58.97%)

LogicKor

Evaluation #1

3.775

3.775

(-0%)

5.675

(+50.33%)

KMMLU-Hard

Evaluation #2

0.21

0.24

(+14.3%)

0.23

(9.52%)

Comparison with Flash Attention

Comparison with Flash Attention

Flash Attention is the SOTA acceleration technique of attention method
Flash Attention is the SOTA acceleration technique of attention method

LLaMA-2-7B

LLaMA-2-13B

PyTorch

Flash Attention

Ours(HiP Attention)

Latency(Prm, micro sec)

576.31 (x1.00)

562.76 (x1.02)

15.62 (x36.90)

Latency(Dec, milli sec)

OOM

56.64 (x1.00)

18.80 (x3.01)

Latency(Prm, micro sec)

719.72 (x1.00)

705.34 (x1.02)

19.67 (x36.58)

Latency(Dec, milli sec)

OOM

71.17 (x1.00)

23.40 (x3.04)

LLaMA-2-7B

LLaMA-2-13B

PyTorch

Flash Attention

Ours(HiP Attention)

Latency(Prm, micro sec)

576.31 (x1.00)

562.76 (x1.02)

15.62 (x36.90)

Latency(Dec, milli sec)

OOM

56.64 (x1.00)

18.80 (x3.01)

Latency(Prm, micro sec)

719.72 (x1.00)

705.34 (x1.02)

19.67 (x36.58)

Latency(Dec, milli sec)

OOM

71.17 (x1.00)

23.40 (x3.04)

Patent 2
Patent 2

Optimal Query Routing

Optimal Query Routing

Technology that instantly determines the difficulty of a given queries and routes them to the optimal language models for quality and cost.

Technology that instantly determines the difficulty of a given queries and routes them to the optimal language models for quality and cost.

Comparison of Response Quality

Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Evaluated by Prometheus v2 in a score range of 1 (min) and 5 (max).

Gemma-7B

AdaptLLM-Law

Meta-LLaMA-3

-8B

Meta-LLaMa-3

-70B

Our Query

Routing

Provider

Google

Microsoft

Meta

Meta

DeepAuto.ai

Model

Size

7B

7B

8B

70B

Avg.1.28B

Fiance

(Min:1 Max:5)

2.24

4.33

4.2

4.35

4.52

Law

(Min:1 Max:5)

3.21

3.42

3.77

4.28

4.46

BioMed

(Min:1 Max:5)

3.6

3.75

4.26

4.02

4.5

Code

(Min:1 Max:5)

3.08

3.76

4.3

4.16

4.86

Math

(Min:1 Max:5)

2.36

3.66

3.91

3.48

4.68

General

(Min:1 Max:5)

3.7

3.52

4.49

4.32

4.51

Average

(Min:1 Max:5)

3.03

3.74

4.16

4.1

4.59

Frequency of Model Selection

Frequency of Model Selection

Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains

Gemma-7B

AdaptLLM-Law

Meta-LLaMA-3

-8B

Meta-LLaMa-3

-70B

Our Query

Routing

Provider

Google

Microsoft

Meta

Meta

DeepAuto.ai

Model

Size

7B

7B

8B

70B

Avg.1.28B

Fiance

(Min:1 Max:5)

2.24

4.33

4.2

4.35

4.52

Law

(Min:1 Max:5)

3.21

3.42

3.77

4.28

4.46

BioMed

(Min:1 Max:5)

3.6

3.75

4.26

4.02

4.5

Code

(Min:1 Max:5)

3.08

3.76

4.3

4.16

4.86

Math

(Min:1 Max:5)

2.36

3.66

3.91

3.48

4.68

General

(Min:1 Max:5)

3.7

3.52

4.49

4.32

4.51

Average

(Min:1 Max:5)

3.03

3.74

4.16

4.1

4.59

Comparison of Response Quality

Tested on 600 unseen queries from Finance, Law, BioMed, Code, Math, and General domains
Evaluated by Prometheus v2 in a score range of 1 (min) and 5 (max).
Explore
Explore

Other Technologies

Other Technologies

Model Hub & Dataset Hub

Explore the latest models & datasets

We periodically integrate the latest open-source models and datasets, offering intuitive visualization tools for effective analysis of key metrics, thereby enhancing the experience of exploring models and datasets.

We periodically integrate the latest open-source models and datasets, offering intuitive visualization tools for effective analysis of key metrics, thereby enhancing the experience of exploring models and datasets.

Optimal Model Search

Find the perfect model for you

DeepAuto.ai's Model Search technology can route the user query to the optimal LLM model from a database of over 10K+ SOTA models in real-time, considering the complexity of the given query. It can also recommend the model with the lowest serving cost among the optimal models

DeepAuto.ai's Model Search technology can route the user query to the optimal LLM model from a database of over 10K+ SOTA models in real-time, considering the complexity of the given query. It can also recommend the model with the lowest serving cost among the optimal models

Optimal Model Compression

Compress your model optimally

DeepAuto.ai's Model Compression technology can reduce the size of a given target model by up to 80% while minimizing performance loss. Additionally, by applying an efficient attention mechanism, it enables up to a 4x increase in generation speed.

DeepAuto.ai's Model Compression technology can reduce the size of a given target model by up to 80% while minimizing performance loss. Additionally, by applying an efficient attention mechanism, it enables up to a 4x increase in generation speed.

PEFT & Hyper-parameter Optimization

Fine-tune your model efficiently

DeepAuto.ai's Finetuning technology significantly improves training time and cost through Parameter Efficient methods. Additionally, it integrates Hyper-parameter optimization to enable effective training.

DeepAuto.ai's Finetuning technology significantly improves training time and cost through Parameter Efficient methods. Additionally, it integrates Hyper-parameter optimization to enable effective training.

Model Evaluation

Evaluate your model with ease

DeepAuto.ai's Evaluation feature allows you to effortlessly assess your model on important public or your own private benchmarks with just a few clicks. Utilize affordable computing infrastructure to evaluate and share your model.

DeepAuto.ai's Evaluation feature allows you to effortlessly assess your model on important public or your own private benchmarks with just a few clicks. Utilize affordable computing infrastructure to evaluate and share your model.

Efficient & Stable Auto-scaled Serving

Serve your model at low-cost

DeepAuto.ai's Model Serving technology reliably and quickly serves optimally compressed models, offering the ability to automatically scale instances up or down in response to customer traffic. It provides a standardized API format to facilitate quick integration with customer applications.

DeepAuto.ai's Model Serving technology reliably and quickly serves optimally compressed models, offering the ability to automatically scale instances up or down in response to customer traffic. It provides a standardized API format to facilitate quick integration with customer applications.

Low-cost High-end GPUs & Data Storage

Launch your cloud workspace

DeepAuto.ai's Cloud Workspace offers high-end GPUs at an affordable price, enabling customers to train and test large AI models anytime, anywhere.

DeepAuto.ai's Cloud Workspace offers high-end GPUs at an affordable price, enabling customers to train and test large AI models anytime, anywhere.

Dynamic Query Routing

Instantly route each question to the optimal model

DeepAuto.ai's Dynamic Query Routing technology instantly predicts the optimal model to answer a given user's question most accurately and appropriately. It then connects to the most cost-effective model, ensuring the highest quality while minimizing API costs.

DeepAuto.ai's Dynamic Query Routing technology instantly predicts the optimal model to answer a given user's question most accurately and appropriately. It then connects to the most cost-effective model, ensuring the highest quality while minimizing API costs.

Multi-modal Language AI

Capable of handling various input formats

DeepAuto.ai's Multimodal-Language AI supports various input formats, not just text. It enables question-answering from diverse input materials such as PDFs and images.

DeepAuto.ai's Multimodal-Language AI supports various input formats, not just text. It enables question-answering from diverse input materials such as PDFs and images.

RAG Technology

Retrieve accurate data from Web,DB,etc

DeepAuto.ai's RAG technology is equipped with a retriever that explores various external data and a verifier that validates the retrieved data, providing accurate and verified answers.

DeepAuto.ai's RAG technology is equipped with a retriever that explores various external data and a verifier that validates the retrieved data, providing accurate and verified answers.

Naver D2 Startup Campus, Seoul, South Korea 🇰🇷

200 Rivserside Blvd #18G, New York, USA 🇺🇸

© DeepAuto.ai All rights reserved. Privacy Policy.

Naver D2 Startup Campus, Seoul, South Korea 🇰🇷

200 Rivserside Blvd #18G, New York, USA 🇺🇸

© DeepAuto.ai All rights reserved. Privacy Policy.