• Pumpkin Escobar@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    There’s quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

    There’s also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc… It’s a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

    Then there are some neat projects to distribute models across multiple computers like exo and petals. They’re more targeted at a p2p-style random collection of computers. I’ve run petals in a small cluster and it works reasonably well.