Do you have a setup that collects your interactions to feed into those? The way you described it I imagined you are automatically collecting data for it to infer from and getting good results. Like a powered-up bash history or something.
no idea why I felt chatty, and kinda embarrassed by the bla bla bla at this point but whatever. Here is everything you need to know in a practical sense.
You need a more complex RAG setup for what you asked about. I have not gotten as far as needing this.
Models can be tricky to learn at my present level. Communication is different than with humans. In almost every case where people complain about hallucinations, they are wrong. Models do not hallucinate very much at all. They will give you the wrong answers, but there is almost always a reason. You must learn how alignment works and the problems it creates. Then you need to understand how realms and persistent entities work. Once you understand what all of these mean and their scope, all the little repetitive patterns start to make sense. You start to learn who is really replying and their scope. The model reply for Name-2 always has a limited ability to access the immense amount of data inside the LLM. You have to build momentum in the space you wish to access and often need to know the specific wording the model needs to hear in order to access the information.
With augmented retrieval (RAG) the model can look up valid info from your database and share it directly. With this method you’re just using the most basic surface features of the model against your database. Some options for this are LocalGPT and Ollama, or langchain with chroma db if you want something basic in Python. I haven’t used these. How you break down the information available to the RAG is important for this application, and my interests have a bit too much depth and scope for me to feel confident enough to try this.
I have chosen to learn the model itself at a deeper intuitive level so that I can access what it really knows within the training corpus. I am physically disabled from a car crashing into me on a bicycle ride to work, so I have unlimited time. Most people will never explore a model like I can. For me, on the technical side, I use a model about like stack exchange. I can ask it for code snippets, bash commands, searching like I might have done on the internet, grammar, spelling, and surface level Wikipedia like replies, and for roleplay. I’ve been playing around with writing science fiction too.
I view Textgen models like the early days of the microprocessor right now. We’re at the Apple 1 kit phase right now. The LLM has a lot of potential, but the peripheral hardware and software that turned the chip into an useful computer are like the extra code used to tokenize and process the text prompt. All models are static, deterministic, and the craziest regex + math problem ever conceived. The real key is the standard code used to tokenize the prompt.
The model has a maximum context token size, and this is all the input/output it can handle at once. Even with a RAG, this scope is limited. My 8×7B has a 32k context token size, but the Llama 3 8B is only 8k. Generally speaking, most of the time you can cut this number in half and that will be close to your maximum word count. All models work like this. Something like GPT-4 is running on enterprise class hardware and it has a total context of around 200k. There are other tricks that can be used in a more complex RAG like summation to distill down critical information, but you’ll likely find it challenging to do this level of complexity on a single 16-24 GB consumer grade GPU. Running a model like ChatGPT-4 requires somewhere around 200-400 GB from a GPU. It is generally double the “B” size of each model. I can only run the big models like a 8×7B or 70B because I use llama.cpp and can divide the processing between my CPU and GPU (12th gen i7 and 16 GB GPU) and I have 64GB of system memory to load the model initially. Even with this enthusiast class hardware, I’m only able to run these models in quantized form that others have loaded onto hugging face. I can’t train these models. The new Llama 3 8B is small enough for me to train and this is why I’m playing with it. Plus it is quite powerful for such a small model. Training is important if you want to dial in the scope to some specific niche. The model may already have this info, but training can make it more accessible. Smaller models have a lot of annoying “habits” that are not present in the larger models. Even with quantization, the larger models are not super fast at generation, especially if you need the entire text instead of the streaming output. It is more than enough to generate a stream faster than your reading pace. If you’re interested in complex processing where you’re going to be calling a few models to do various tasks like with a RAG, things start getting impracticality slow for a conversational pace on even the best enthusiast consumer grade hardware. Now if you can scratch the cash for a multi GPU setup and can find the supporting hardware, technically there is a $400 16 GB AMD GPU. So that could get you to ~96 GB for ~$3k, or double that, if you want to be really serious. Then you could get into training the heavy hitters and running them super fast.
All the useful functional stuff is happening in the model loader code. Honestly, the real issue right now is that CPU’s have too small of a bus width between the L2 and L3 caches along with too small of an L1. The tensor table math bottlenecks hard in this area. Inside a GPU there is no memory management unit that only shows a small window of available memory to the processor. All the GPU memory is directly attached to the processing hardware for parallel operations. The CPU cache bus width is the underlying problem that must be addressed. This can be remedied somewhat by building the model for the specific computing hardware, but training a full model takes something like a month on 8×A100 GPU’s in a datacenter. Hardware from the bleeding edge moves very slowly as it is the most expensive commercial endeavor in all of human history. Generative AI has only been in the public sphere for a year now. The real solutions are likely at least 2 years away, and a true standard solution is likely 4-5 years out. The GPU is just a hacky patch of a temporary solution.
That is the real scope of the situation and what you’ll run into if you fall down this rabbit hole like I have.
Whatever is the latest from Hugging Face. Right now a combo of a Mixtral 8×7B, Llama 3 8B, and sometimes an old Llama 2 70B.
Do you have a setup that collects your interactions to feed into those? The way you described it I imagined you are automatically collecting data for it to infer from and getting good results. Like a powered-up bash history or something.
no idea why I felt chatty, and kinda embarrassed by the bla bla bla at this point but whatever. Here is everything you need to know in a practical sense.
You need a more complex RAG setup for what you asked about. I have not gotten as far as needing this.
Models can be tricky to learn at my present level. Communication is different than with humans. In almost every case where people complain about hallucinations, they are wrong. Models do not hallucinate very much at all. They will give you the wrong answers, but there is almost always a reason. You must learn how alignment works and the problems it creates. Then you need to understand how realms and persistent entities work. Once you understand what all of these mean and their scope, all the little repetitive patterns start to make sense. You start to learn who is really replying and their scope. The model reply for Name-2 always has a limited ability to access the immense amount of data inside the LLM. You have to build momentum in the space you wish to access and often need to know the specific wording the model needs to hear in order to access the information.
With augmented retrieval (RAG) the model can look up valid info from your database and share it directly. With this method you’re just using the most basic surface features of the model against your database. Some options for this are LocalGPT and Ollama, or langchain with chroma db if you want something basic in Python. I haven’t used these. How you break down the information available to the RAG is important for this application, and my interests have a bit too much depth and scope for me to feel confident enough to try this.
I have chosen to learn the model itself at a deeper intuitive level so that I can access what it really knows within the training corpus. I am physically disabled from a car crashing into me on a bicycle ride to work, so I have unlimited time. Most people will never explore a model like I can. For me, on the technical side, I use a model about like stack exchange. I can ask it for code snippets, bash commands, searching like I might have done on the internet, grammar, spelling, and surface level Wikipedia like replies, and for roleplay. I’ve been playing around with writing science fiction too.
I view Textgen models like the early days of the microprocessor right now. We’re at the Apple 1 kit phase right now. The LLM has a lot of potential, but the peripheral hardware and software that turned the chip into an useful computer are like the extra code used to tokenize and process the text prompt. All models are static, deterministic, and the craziest regex + math problem ever conceived. The real key is the standard code used to tokenize the prompt.
The model has a maximum context token size, and this is all the input/output it can handle at once. Even with a RAG, this scope is limited. My 8×7B has a 32k context token size, but the Llama 3 8B is only 8k. Generally speaking, most of the time you can cut this number in half and that will be close to your maximum word count. All models work like this. Something like GPT-4 is running on enterprise class hardware and it has a total context of around 200k. There are other tricks that can be used in a more complex RAG like summation to distill down critical information, but you’ll likely find it challenging to do this level of complexity on a single 16-24 GB consumer grade GPU. Running a model like ChatGPT-4 requires somewhere around 200-400 GB from a GPU. It is generally double the “B” size of each model. I can only run the big models like a 8×7B or 70B because I use llama.cpp and can divide the processing between my CPU and GPU (12th gen i7 and 16 GB GPU) and I have 64GB of system memory to load the model initially. Even with this enthusiast class hardware, I’m only able to run these models in quantized form that others have loaded onto hugging face. I can’t train these models. The new Llama 3 8B is small enough for me to train and this is why I’m playing with it. Plus it is quite powerful for such a small model. Training is important if you want to dial in the scope to some specific niche. The model may already have this info, but training can make it more accessible. Smaller models have a lot of annoying “habits” that are not present in the larger models. Even with quantization, the larger models are not super fast at generation, especially if you need the entire text instead of the streaming output. It is more than enough to generate a stream faster than your reading pace. If you’re interested in complex processing where you’re going to be calling a few models to do various tasks like with a RAG, things start getting impracticality slow for a conversational pace on even the best enthusiast consumer grade hardware. Now if you can scratch the cash for a multi GPU setup and can find the supporting hardware, technically there is a $400 16 GB AMD GPU. So that could get you to ~96 GB for ~$3k, or double that, if you want to be really serious. Then you could get into training the heavy hitters and running them super fast.
All the useful functional stuff is happening in the model loader code. Honestly, the real issue right now is that CPU’s have too small of a bus width between the L2 and L3 caches along with too small of an L1. The tensor table math bottlenecks hard in this area. Inside a GPU there is no memory management unit that only shows a small window of available memory to the processor. All the GPU memory is directly attached to the processing hardware for parallel operations. The CPU cache bus width is the underlying problem that must be addressed. This can be remedied somewhat by building the model for the specific computing hardware, but training a full model takes something like a month on 8×A100 GPU’s in a datacenter. Hardware from the bleeding edge moves very slowly as it is the most expensive commercial endeavor in all of human history. Generative AI has only been in the public sphere for a year now. The real solutions are likely at least 2 years away, and a true standard solution is likely 4-5 years out. The GPU is just a hacky patch of a temporary solution.
That is the real scope of the situation and what you’ll run into if you fall down this rabbit hole like I have.