Ollama releases Python and JavaScript Libraries (opens in new tab)

(ollama.ai)

607 pointsadamhowell2y ago149 comments

149 comments

Are these libraries for connecting to an ollama service that the user has already installed or do they work without the user installing anything? Sorry for not checking the code but maybe someone has the same question here.

I looked at using ollama when I started making FreeChat [0] but couldn't figure out a way to make it work without asking the user to install it first (think I asked in your discord at the time). I wanted FreeChat to be 1-click install from the mac app store so I ended up bundling the llama.cpp server instead which it runs on localhost for inference. At some point I'd love to swap it out for ollama and take advantage of all the cool model pulling stuff you guys have done, I just need it to be embeddable.

My ideal setup would be importing an ollama package in swift which would start the server if the user doesn't already have it running. I know this is just js and python to start but a dev can dream :)

Either way, congrats on the release!

[0]: https://github.com/psugihara/FreeChat

SnowLprd2y ago

On the subject of installing Ollama, I found it to be a frustrating and user-hostile experience. I instead recommend the much more user-friendly LLM[0] by Simon Willison.

Among the problems with Ollama include:

* Ollama silently adds a login item with no way to opt out: <https://github.com/jmorganca/ollama/issues/162>

* Ollama spawns at least four processes, some persistently in the background: 1 x Ollama application, 1 x `ollama` server component, 2 x Ollama Helper

* Ollama provides no information at install time about what directories will be created or where models will be downloaded.

* Ollama prompts users to install the `ollama` CLI tool, with admin access required, with no way to cancel, and with no way to even quit the application at that point. Ollama provides no clarity that about what is actually happening during this step: all it is doing is symlinking `/Applications/Ollama.app/Contents/Resources/ollama` to `/usr/local/bin/`

The worst part is that not only is none of this explained at install time, but the project README doesn’t tell you any of this information either. Potential users deserve to know what will happen on first launch, but when a PR arrived to at least provide that clarification in the README, Ollama maintainers summarily closed that PR and still have not rectified the aforementioned UX problems.

As an open source maintainer myself, I understand and appreciate that Ollama developers volunteer their time and energy into the project, and they can run it as they see fit. So I intend no disrespect. But these problems, and a seeming unwillingness to prioritize their resolution, caused me to delete Ollama from my system entirely.

As I said above, I think LLM[0] by Simon Willison is an excellent and user-friendly alternative.

[0]: https://llm.datasette.io/

WhackyIdeas2y ago

I think it boils down to a level of oblivious disrespect for the user from the points you raised about ollama. I am sure it’s completely unintentional from their dev’s, simply not prioritising the important parts which might be a little boring for them to spend time on, but to be taken seriously as a professional product I would expect more. Just because other apps may not have the same standards either re complete disclosure, it shouldn’t be normalised if you are wanting to be respected fully from other devs as well as the general public - after all, other devs who appreciate good standards will also be likely to promote a product for free (which you did for LLM[0]) so why waste the promotion opportunity when it results in even better code and disclosure.

b33j0r2y ago

I don’t have any direct criticisms about anyone in particular, but the other thing is that any rational person spending this much time and money probably attempts to think of a business plan. So vendor lock-in creeps into our ideas even unintentionally. Nothing wrong with intentionally, per se.

We all tell ourselves it’s value-add, but come on, there’s always an element of “we’ll make ourselves a one-stop shop!”

So for example, I think the idea of modelfiles is sound. Like dockerfiles, cool! But other than superficially, it’s a totally bespoke and incompatible with everything else we came up with last year.

Bespoke has its connotation for reasons. Last I checked, the tokenizer fails on whitespace sometimes. Which is fine except for “why did you make us all learn a new file format and make an improvised blunderbuss to parse it!?”

(Heh. Two spaces between the verb and the args gave me a most perplexing error after copy/pasting a file path).

skwirl2y ago

You don’t sound like the kind of user ollama was meant to serve. What you are describing is pretty typical of macOS applications. You were looking for more of a traditional Linux style command line process or a Python library. Looks like you found what you were after, but I would imagine that your definition of user friendly is not really what most people understand it to mean.

jsjohnst2y ago

Respectfully, I disagree. Not OP, but this “installer” isn’t a standard macOS installer. With a standard installer I can pick the “show files” menu option and see what’s being installed and where. This is home rolled and does what arguably could be considered shady dark patterns. When Zoom and Dropbox did similar things, they were rightly called out, as should this.

1 more reply

_ea1k2y ago

I partially agree. My only issue with them is that the documentation is a little more hidden than I'd like.

Their install is basically a tl;dr of an installer. That's great!

It'd be nice if it also pointed me directly to a readme with specific instructions on service management, config directories, storage directories, and where the history file is stored.

vegabook2y ago

nix-shell makes most of this go away, except the ollama files will still be in `~/.ollama` which you can delete at any time.

  nix-shell -p ollama

in two tmux windows, then

  ollama serve

in one and

  ollama run llama2

in the other.

Exit and all the users, processes etc, go away.

https://search.nixos.org/packages?channel=23.11&show=ollama&...

nulld3v2y ago

The Linux binary (pre-built or packaged by your distro) is just a CLI. The Mac binary instead also contains a desktop app.

I agree with OP that this is very confusing. The fact the Mac OS installation comes with a desktop app is not documented anywhere at all! The only way you can discover this is by downloading the Mac binary.

tesla_frunk2y ago

Is this any different from

    brew install ollama

siquick2y ago

"User hostile experience" is complete hyperbole and disrespectful to the efforts of the maintainers of this excellent library.

gremlinunderway2y ago

It's not hyperbole when he listed multiple examples and issues which clearly highlight why he calls it that.

I don't think there was anything hyperbolic or disrespectful in that post at all. If I was a maintainer there and someone put in the effort to list out the specific issues like that I would be very happy for the feedback.

People need to stop seeing negative feedback as some sort of slight against them. It's not. Any feedback should be seen as a gift, negative or positive alike. We live in a massive attention-competition world, so to get anyone to spend the time to use, test and go out of their way to even write out in detail their feedback on something you provide is free information. Not just free information, but free analysis.

Really wish that we could all understand and empathize with frustration on software has nothing to do with the maintainers or devs unless directly targeted.

You could say possibly that the overall tone of the post was "disrespectful" because of its negativity, but I think receiving that kind of post which ties together not just the issues in some bland objective manner but highlights appropriately the biggest pain points and how they're pain points in context of a workflow is incredibly useful.

I am constantly pushing and begging for this feedback on my work, so to get this for free is a gift.

1 more reply

SnowLprd2y ago

What I said is an utterly factual statement: I found the experience to be user-hostile. You might have a different experience, and I will not deny you your experience even in the face of your clearly-stated intention to deny me mine.

Moreover, I already conveyed my understanding of and appreciation for the work open-source maintainers do, and I outright said above that I intend no disrespect.

1 more reply

refulgentis2y ago

It's very, very, very annoying how much some people are tripping over themselves to pretend a llama.cpp wrapper is some gift of love from saints to the hoi polloi. Y'all need to chill. It's good work and good. It's not great or the best thing ever or particularly high on either simple user friendliness or power user friendly. It's young. Let it breathe. Let people speak.

2 more replies

config_yml2y ago

Indeed, I thought the user experience was great. Simple way to download, install and start: everything just worked.

fzysingularity2y ago

Big fan of Simon Willison's `llm`[1] client. We did something similar recently with our multi-modal inference server that can be called directly from the `llm` CLI (c.f. "Serving LLMs on a budget" [2]). There's also `ospeak` [3] which we'll probably try to integrate to talk to your LLM from console. Great to see tools that radically simplify the developer-experience for local LLMs/foundation models.

[1] https://github.com/simonw/llm

[2] https://docs.nos.run/docs/blog/serving-llms-on-a-budget.html...

[3] https://github.com/simonw/ospeak

jsjohnst2y ago

I agree that alternative is good, but if you want to try ollama without the user experience drawbacks, install via homebrew.

okasaki2y ago

There's also a docker container (that I can recommend): https://hub.docker.com/r/ollama/ollama

wrasee2y ago

I got the same feeling. I think it’s generally bad practice to ask a user for their admin password without a good rationale as to why you’re asking, particularly if it’s non-obvious. It’s the ‘trust me bro’ approach to security that that even if this is a trustworthy app it encourages the behaviour of just going ahead and entering your password and not asking too many questions.

The install on Linux is the same. You’re essentially encouraged to just

    curl https://ollama.ai/install.sh | sh

which is generally a terrible idea. Of course you can read the script but that misses the point in that that’s clearly not the intended behaviour.

As other commenters have said, it is convenient. Sure.

cqqxo4zV46cp2y ago

We really need to kill this meme. All the “pipe to shell” trick really did from a security perspective is lay bare to some naive people the pre-existing risks involved in running third-party code. I recall some secondary ‘exploits’ around having sh execute something different to what you’d see if you just inspected the script yourself, by way of serving different content, or some HTML/CSS wizardry to have you copy out something unexpected, or wherever. But really, modern-day Linux is less and less about ‘just’ installing packages from your first-party OS package manager’s repositories. Beyond that, piping a downloaded script to your shell is just a different way of being as insecure as most people already are anyway.

dinosaurdynasty2y ago

https://github.com/ollama/ollama/blob/main/docs/linux.md

They have manual install instructions if you are so inclined.

icyfox2y ago

Just for connecting to an existing service: https://github.com/ollama/ollama-python/blob/main/ollama/_cl...

thrdbndndn2y ago

For the client API it's pretty clear:

    from ollama import Client
    client = Client(host='http://localhost:11434')

But I don't quite get how the example in "Usage" can work:

    import ollama
    response = ollama.chat(model='llama2', messages=[
    {
        'role': 'user',
        'content': 'Why is the sky blue?',
    },
    ])
    print(response['message']['content'])

Since there is no parameter for host and/or port.

jmorgan2y ago

Once you have a custom `client` you can use it in place of `ollama`. For example:

  client = Client(host='http://my.ollama.host:11434')
  response = client.chat(model='llama2', messages=[...])

1 more reply

Abishek_Muthian2y ago

I used Ollama docker image to integrate with Gait Analyzer[1], a self-hosted gait analysis tool; all I had to do was to set up the docker compose file,

I was able to get the setup done with a single script for the end user and I used langchain to interact with Ollama.

[1] https://github.com/abishekmuthian/gaitanalyzer

ivanfioravanti2y ago

I posted about the Python library few hours after release. Great experience. Easy, fast and works well.

I create a GIST with a quick and dirty way of generating a dataset for fine-tuning Mistral model using Instruction Format on a given topic: https://gist.github.com/ivanfioravanti/bcacc48ef68b02e9b7a40...

jumperabg2y ago

How does this fine-tuning work? I can see that you are loading a train.jsonl file and the some instructions but is the output model generated or this is some kind of a new way of training the models?

jerpint2y ago

The gist is only to create the dataset not to fine tune

eurekin2y ago

What's your observations about finetunes - are they really useful for anything practical? :)

tinyhouse2y ago

Does olana support fine-tuning? I assume not. (Not asking about finetuned models that I know they support)

pknerd2y ago

can we use it on cloud or I gotta download it locally? it might not work on my MacBook 2015 with 8GB ram

LoganDark2y ago

Gist isn't an acronym, it's a word. (e.g. "get the gist of things")

filleokus2y ago

An off topic question: Is there such a thing as a "small-ish language model". A model that you could simple give instructions / "capabilities" which a user can interact with. Almost like Siri-level of intelligence.

Imagine you have an API-endpoint where you can set the level of some lights and you give the chat a system prompt explaining how to build the JSON body of the request, and the user can prompt it with stuff like "Turn off all the lights" or "Make it bright in the bedroom" etc.

How low could the memory consumption of such a model be? We don't need to store who the first kaiser of Germany was, "just" enough to kinda map human speech onto available API's.

andy992y ago

There are "smaller" models, for example tinyllama 1.1B (tiny seems like an exaggeration). PHI2 is 2.7B parameters. I can't name a 500M parameter model but there is probably one.

The problem is they are all still broadly trained and so they end up being Jack of all trades master of none. You'd have to fine tune them if you want them good at some narrow task and other than code completion I don't know that anyone has done that.

If you want to generate json or other structured output, there is Outlines https://github.com/outlines-dev/outlines that constrains the output to match a regex so it guarantees e.g. the model will generate a valid API call, although it could still be nonsense if the model doesn't understand, it will just match the regex. There are other similar tools around. I believe llama.cpp also has something built in that will constrain the output to some grammar.

nattaylor2y ago

https://pypi.org/project/languagemodels/ can load some small models but forming JSON-reliably seems to require a larger-ish model (or fine tuning)

Aside: I expect Apple will do exactly what you're proposing and that's why they're exposing more APIs for system apps

spaniard892772y ago

Not really. You can use small models for task like text classification etc (traditional nlp) and those run in pretty much anything. We're talking about BERT-like models like distillbert for example.

Now, models that have "reasoning" as an emergent property... I haven't seen anthing under 3B that's capable of making anything useful. The smaller I've seen is litellama and while it's not 100% useless, it's really just an experiment.

Also, everything requires new and/or expensive hardware. For GPU you really are about 1k€ at minumum for something decent for running models. CPU inference is way slower and forget about anythin that has no AVX and preferably AVX2.

I try models on my old thinkpad x260 with 8Gb ram, which is perfectly capable for developing stuff and those small task oriented I've told you about, but even though I've tried everything under the sun, with quantization etc, it's safe to say you can only run decent LLMs with a decent inference speed with expensive hardware now.

Now, if you want task like, language detection, classifying text into categories, etc, very basic Question Answering, then go on HugginFace and try youself, you'll be capable of running most models on modest hardware.

In fact, I have a website (https://github.com/iagovar/cometocoruna/tree/main) where I'm using a small flask server in my data pipeline to extract event information from text blobs I get scraping sites. That runs every day in an old Atom + 4Gb RAM laptop that I use as sever.

Experts in the field say that might change (somewhat) with mamba models, but I can't really say more.

I've been playing with the idea of dumping some money. But I'm 36, unemployed and just got into coding about 1.5 years ago, so until I secure some income I don't want to hit my saving hard, this is not the US where I can land a job easy (Junior looking for job, just in case someone here needs one).

oblio2y ago

Speaking of, I imagine Alexa, Siri, etc, should now be replaced by LLMs? Or where they already implemented using LLMs?

3abiton2y ago

Exactly this, I have not yet to run into a "small" model that is good enough (gpt-3) quality

reacharavindh2y ago

Not directly related to what Ollama aims to achieve. But, I’ll ask nevertheless.

Local LLMs are great! But, it would be more useful once we can _easily_ throw our own data for them to use as reference or even as a source of truth. This is where it opens doors that a closed system like OpenAI cannot - I’m never going to upload some data to ChatGPT for them to train on.

Could Ollama make it easier and standardize the way to add documents to local LLMs?

I’m not talking about uploading one image or model and asking a question about it. I’m referring to pointing a repository of 1000 text files and asking LLMs questions based on their contents.

jerpint2y ago

For now RAG is the best “hack” to achieve this at very low cost since it doesn’t require any fine tuning

I’ve implemented a RAG library if you’re ever interested but they are a dime a dozen now :)

https://www.github.com/jerpint/buster

jampekka2y ago

Sounds like Retrieval Augmented Generation. This is the technique used by e.g. most customized chatbots.

emmanueloga_2y ago

I don’t know if Ollama can do this but https://gpt4all.io/ can.

reacharavindh2y ago

Basically, I want to do what this product does, but locally with a model running on Ollama. https://www.zenfetch.com/

rex1232y ago

Hey - Akash from Zenfetch here. We’ve actually tested some of our features with local models and have found that they significantly underperform compared to hosted models. With that said, we are actively working on new approaches to offer a local version of Zenfetch.

In the meanwhile, we do have agreements in place with all of our AI providers to ensure none of our users information is used for training or any other purpose. Hope that helps!

reacharavindh2y ago

Hey. Congratulations on your product. I’m guessing it will be greatly useful for your target audience.

I don’t have a serious need that I’d think worth paying for. So, I’m probably not in your target. I wanted to do this for a personal use case.

Throw all my personal documents at a local model and ask very personal questions like “the investment I made on that thing in 2010, how did I do against this other thing?” Or “from my online activity, when did I start focusing on this X tech?” Or even “find me that receipt/invoice from that ebike I purchased in 2021 and the insurance I took out on it”.

There is no way I’m taking the promise of a cloud product and upload all my personal documents to it. Hence my ask about the ability to do this locally - slowly is perfectly fine for my cheap need :-)

1 more reply

camillomiller2y ago

Interactive smart knowledge bases is such a massively cool direction for LLMs. I’ve seen Chat with RTX at the NVIDIA preview at CES and it’s mindblowingly simple and cool to use. I believe that interactive search in limited domains is gonna be massive for LLMs

NetOpWibby2y ago

Ooh, I want this too.

asterix_pano2y ago

Llama_index basically does that. You have even some tuto using Streamlit that creates a UI around it for you.

BeetleB2y ago

> I’m never going to upload some data to ChatGPT for them to train on.

If you use the API, they do not train on it.

(However, that doesn't mean they don't retain it for a while).

As others have said, RAG is probably the way to go - although I don't know how well RAG performs on local LLMs.

martin822y ago

Data is the new oil.

You can be 100% sure that OpenAI will do whatever they want whenever they want with any and every little bit of data that you upload to them.

With GPTs and their Embeddings endpoint, they encourage you to upload your own data en masse.

sciolist2y ago

There's two main ways to "add documents to LLMs" - using documents in retrieval augmented generation (RAG) and training/finetuning models. I believe you can use RAG with Ollama, however Ollama doesn't do the training of models.

hobofan2y ago

You can "use RAG" with Ollama, in the sense that you can put RAG chunks into a completion prompt.

To index documents for RAG, Ollama also offers an embedding endpoint where you can use LLM models to generate embeddings, however AFAIK that is very inefficient. You'd usually want to use a much smaller embedding model like JINA v2[0], which are currently not supported by Ollama[1].

[0]: https://huggingface.co/jinaai/jina-embeddings-v2-base-en

[1]: https://github.com/ollama/ollama/issues/327

CubsFan10602y ago

Maybe take a look at this? https://github.com/imartinez/privateGPT

It's meant to do exactly what you want. I've had mixed results.

porridgeraisin2y ago

Used ollama as part of a bash pipeline for a tiny throwaway app.

It blocks until there is something on the mic, then sends the wav to whisper.cpp, which then sends it to llama which picks out a structured "remind me" object from it, which gets saved to a text file.

awayto2y ago

I made something pretty similar over winter break so I could have something read books to me. ... Then it turned into a prompting mechanism of course! It uses Whisper, Ollama, and TTS from CoquiAI. It's written in shell and should hopefully be "Posix-compliant", but it does use zenity from Ubuntu; not sure how widely used zenity is.

https://github.com/jcmccormick/runtts

killermouse02y ago

Would you share that code? I'm not familiar with using the mic in Linux, but interested to do something similar!

nbbaier2y ago

I'd also be really interested in seeing this

deepsquirrelnet2y ago

I love the ollama project. Having a local llm running as a service makes sense to me. It works really well for my use.

I’ll give this Python library a try. I’ve been wanting to try some fine tuning with LLMs in the loop experiments.

palashkulsh2y ago

Noob question, and may be probably being asked at the wrong place. Is there any way to find out min system requirements for running ollama run commands with different models.

mike9782y ago

I have a 11th gen intel cpu with 64gb ram and I can run most of big models slowly... so it's partly what you can put up with.

mark_l_watson2y ago

On my 32G M2 Pro Mac, I can run up to about 30B models using 4 bit quantization. It is fast unless I am generating a lot of text. If I ask a 30B model to generate 5 pages of text it can take over 1 minute. Running smaller models like Mistral 7B is very fast.

Install Ollama from https://ollama.ai and experiment with it using the command line interface. I mostly use Ollama’s local API from Common Lisp or Racket - so simple to do.

EDIT: if you only have 8G RAM, try some of the 3B models. I suggest using at least 4 bit quantization.

hellsten2y ago

Check out this guide for some recommendations: https://www.hardware-corner.net/guides/computer-to-run-llama...

You can easily experiment with smaller models, for example, Mistral 7B or Phi-2 on M1/M2/M3 processors. With more memory, you can run larger models, and better memory bandwidth (M2 Ultra vs. M2 base model) means improved performance (tokens/second).

slawr18052y ago

They have a high level summary of ram requirements for the parameter size of each model and how much storage each model uses on their GitHub: https://github.com/ollama/ollama#model-library

nextlevelwizard2y ago

Rule of thumb I have used is to check the size and if it fits into your GPUs VRAM then it will run nicely.

I have not ran into a llama that won't run, but if it doesn't fit into my GPU you have to count seconds per token instead of tokens per second

wazoox2y ago

Llama2 7b and Mistral 7b run at about 8 tk/s on my Mac Pro, which is usable if you're not in a hurry.

palashkulsh2y ago

Thank you so much everyone, all the help was really needed and useful : )

explorigin2y ago

I run ollama on my steamdeck. It's a bit slow but can run most 7b models.

sqs2y ago

I posted about my awesome experiences using Ollama a few months ago: https://news.ycombinator.com/item?id=37662915. Ollama is definitely the easiest way to run LLMs locally, and that means it’s the best building block for applications that need to use inference. It’s like how Docker made it so any application can execute something kinda portably kinda safely on any machine. With Ollama, any application can run LLM inference on any machine.

Since that post, we shipped experimental support in our product for Ollama-based local inference. We had to write our own client in TypeScript but will probably be able to switch to this instead.

keyle2y ago

Could you maybe compare it to llama.cpp?

All it took for me to get going is `make` and I basically have it working locally as a console app.

coder5432y ago

Ollama is built around llama.cpp, but it automatically handles templating the chat requests to the format each model expects, and it automatically loads and unloads models on demand based on which model an API client is requesting. Ollama also handles downloading and caching models (including quantized models), so you just request them by name.

Recently, it got better (though maybe not perfect yet) at calculating how many layers of any model will fit onto the GPU, letting you get the best performance without a bunch of tedious trial and error.

Similar to Dockerfiles, ollama offers Modelfiles that you can use to tweak the existing library of models (the parameters and such), or import gguf files directly if you find a model that isn’t in the library.

Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.

Running “make” on llama.cpp is really only the first step. It’s not comparable.

palmfacehn2y ago

This is interesting. I wouldn't have given the project a deeper look without this information. The lander is ambiguous. My immediate takeaway was, "Here's yet another front end promising ease of use."

1 more reply

regularfry2y ago

For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.

It's annoying that it seems to have its own model cache, but I can live with that.

1 more reply

lolinder2y ago

For me the big deal with Ollama is the ease of instantly setting up a local inference API. I've got a beefy machine with a GPU downstairs, but Ollama allows me to easily use it from a Raspberry Pi on the main floor.

acd10j2y ago

In my experience award for easiest to run locally will go to llamafile models https://github.com/Mozilla-Ocho/llamafile.

sqs2y ago

Also one feature request - if the library (or another related library) could also transparently spin up a local Ollama instance if the user doesn’t have one already. “Transparent-on-demand-Ollama” or something.

chown2y ago

I have been working on something similar to that in Msty [1]. I haven’t announced the app anywhere (including my friends as I got a few things in pipeline that I want to get out first :)

[1]: https://msty.app

zenlikethat2y ago

That gets into process management which can get dicey, but I agree, a "daemonless" mode could be really interesting

donpdonp2y ago

I'd like to see a comparison to nitro https://github.com/janhq/nitro which has been fantastic for running a local LLM.

refulgentis2y ago

> Ollama is definitely the easiest way to run LLMs locally

Nitro outstripped them, 3 MB executable with OpenAI HTTP server and persistent model load

jmorgan2y ago

Persistent model loading will be possible with: https://github.com/ollama/ollama/pull/2146 – sorry it isn't yet! More to come on filesize and API improvements

akulbe2y ago

I just wanted to say thank you for being communicative and approachable and nice.

evantbyrne2y ago

Who cares about executable size when the models are measured in gigabytes lol. I would prefer a Go/Node/Python/etc server for a HTTP service even at 10x the size over some guy's bespoke c++ any day of the week. Also, measuring the size of an executable after zipping is a nonsense benchmark in of itself

refulgentis2y ago

Not some guy, agree on zip, disagree entirely with tone of the comment (what exactly separates ollama from those same exact hyperbolic descriptions?)

joaomdmoura2y ago

So cool! I have bene using Ollama for weeks now and I just love it! Easiest way to run local LLMs, we are actually embedding them into our product right now and super excited about it!

visarga2y ago

I am using ollama as LLM server + ollama-webui as chat app server. Great UI

nbbaier2y ago

What's the product?

Kostic2y ago

I used this half a year ago, love the UX but it was not possible to accelerate the workloads using an AMD GPU. How's the support for AMD GPUs under Ollama today?

mchiang2y ago

Hi, I'm one of the maintainers on Ollama. We are working on supporting ROCm in the official releases.

If you do build from source, it should work (Instructions below):

https://github.com/ollama/ollama/blob/main/docs/development....

The reason why it's not in released builds is because we are still testing ROCm.

accelbred2y ago

I'm using it on an AMD GPU with the clblast backend.

brucethemoose22y ago

Unfortunately "AMD" and "easy" are mutually exclusive right now.

You can be a linux/python dev and set up rocm.

Or you can run llama.cpp's very slow OpenCL backend, but with easy setup.

Or you can run MLC's very fast Vulkan backend, but with no model splitting and medium-hard setup.

jquaint2y ago

I'm a huge fan of Ollama. Really like how easy it makes local LLM + neovim https://github.com/David-Kunz/gen.nvim

imrehg2y ago

This should be nice to be easier to integrate with things like Vanna.ai, that was on HN recently.

There a bunch of methods need to be implemented to work, but then usual OpenAI buts can be switched out to anything else, e.g. see the code stub in https://vanna.ai/docs/bigquery-other-llm-vannadb.html

Looking forward to more remixes for other tools too.

hatmanstack2y ago

Why does this feel like an exercise in the high priesting of coding. Shouldn't a python library have everything necessary and work out of the box?

behnamoh2y ago

What I hate about ollama is that it makes server configuration a PITA. ollama relies on llama.cpp to run GGUF models but while llama.cpp can keep the model in memory using `mlock` (helpful to reduce inference times), ollama simply won't let you do that:

https://github.com/ollama/ollama/issues/1536

Not to mention, they hide all the server configs in favor of their own "sane defaults".

jmorgan2y ago

Sorry this isn't easier!

You can enable mlock manually in the /api/generate and /api/chat endpoints by specifying the "use_mlock" option:

{“options”: {“use_mlock”: true}}

Many other sever configurations are also available there: https://github.com/ollama/ollama/blob/main/docs/api.md#reque...

bestai2y ago

I think a faq with the answers of this kind of questions could be useful for users.

mfalcon2y ago

I love Ollama's simplicity to download and consume different models with its REST API. I've never used it in a "production" environment, anyone knows how Ollama performs? or is it better to move to something like Vllm for that?

hellsten2y ago

The performance will probably be similar as long as you remember to tune the settings listed here: https://github.com/ollama/ollama/blob/main/docs/api.md

Try to, for example, set 'num_gpu' to 99 and 'use_mlock' to true.

jerpint2y ago

They all probably already use elements of deep learning but are very likely trained in a supervised way to output structured data (I.e. actions)

tripleo12y ago

techn002y ago

Does Ollama support GBNF grammars?

visarga2y ago

No, but it does support json formatting

pamelafox2y ago

API wise, it looks very similar to the OpenAI python SDK but not quite the same. I was hoping I could swap out one client for another. Can anyone confirm they’re intentionally using an incompatible interface?

WiSaGaN2y ago

There is an issue for this: [1]. I think it's more of priority issue.

[1] https://github.com/ollama/ollama/issues/305

d4rkp4ttern2y ago

Same question here. Ollama is fantastic as it makes it very easy to run models locally, But if you already have a lot of code that processes OpenAI API responses (with retry, streaming, async, caching etc), it would be nice to be able to simply switch the API client to Ollama, without having to have a whole other branch of code that handles Ollama API responses. One way to do an easy switch is using the litellm library as a go-between but it’s not ideal.

For an OpenAI compatible API my current favorite method is to spin up models using oobabooga TGW. Your OpenAI API code then works seamlessly by simply switching out the api_base to the ooba endpoint. Regarding chat formatting, even ooba’s Mistral formatting has issues[1] so I am doing my own in Langroid using HuggingFace tokenizer.apply_chat_template [2]

[1] https://github.com/oobabooga/text-generation-webui/issues/53...

[2] https://github.com/langroid/langroid/blob/main/langroid/lang...

Related question - I assume ollama auto detects and applies the right chat formatting template for a model?

lhenault2y ago

I've built exactly this if you want to give it a try : https://github.com/lhenault/simpleAI

WhackyIdeas2y ago

This is going to make my current project a million times easier. Nice.

malux852y ago

I love ollama, the engine underneath is llama.cpp, and they have the first version of self-extend about to me merged into main, so with any luck it will be available in ollama soon too!

brucethemoose22y ago

A lot of the new models coming out are long context anyway. Check out Yi, InternLM and Mixtral.

Also, you really want to wait until flash attention is merged before using mega context with llama.cpp. The 8 bit KV cache would be ideal too.

dchuk2y ago

Is anyone using this as an api behind a multi user web application? Or does it need to be fed off of a message queue or something to basically keep it single threaded?

Havoc2y ago

What model format does ollama use? Or is one constrained to the handful of preselected models they list?

mchiang2y ago

You can import GGUF, PyTorch or safetensors models into Ollama. I'll caveat that there are current limitations to some model architectures

https://github.com/ollama/ollama/blob/main/docs/import.md

Havoc2y ago

Thanks!

awongh2y ago

Wow, I guess I wouldn’t have thought there would be GPU support. What’s the mechanism for this?

brucethemoose22y ago

Via llama.cpp's GPU support.

cranberryturkey2y ago

`ollama serve` exposes an api you can query with fetch. why the need for a library?

lobocinza2y ago

ollama feels like llama.cpp with extra undesired complexities. It feels like the former project is desperately trying to differentiate and monetize while the latter is where all the things that matter happens.

sjwhevvvvvsj2y ago

Literally wrote an Ollama wrapper class last week. Doh!

bearjaws2y ago

If you're using TypeScript I highly recommend modelfusion https://modelfusion.dev/guide/

It is far more robust, integrates with any LLM local or hosted, supports multi-modal, retries, structure parsing using zod and more.

kvz2y ago

This looks really nice but it’s good to point out that this project can use the Ollama HTTP API or any other API, but does not run models itself. So not a replacement to Ollama, but rather to the Ollama npm. Perhaps that was obvious because the post is about that, but I briefly thought this could run models too.

nextlevelwizard2y ago

What is the benefit?

Ollama already exposes REST API that you can query with whatever language (or you know, just using curl) - why do I want to use Python or JS?

JrProgrammer2y ago

What’s the benefit of abstracting something?

nextlevelwizard2y ago

There is a reason why "leftpad" is followed by "incident".

girvo2y ago

That one doesn’t have to write the glue code around your HTTP client library?

nextlevelwizard2y ago

Feels pretty bad to install dependency just so you can avoid making a HTTP request.

leansensei2y ago

There is also an Elixir library: https://overbring.com/blog/2024-01-14-ollamex-ollama-api-emb...

3Sophons2y ago

The Rust+Wasm stack provides a strong alternative to Python in AI inference.

* Lightweight. Total runtime size is 30MB as opposed 4GB for Python and 350MB for Ollama. * Fast. Full native speed on GPUs. * Portable. Single cross-platform binary on different CPUs, GPUs and OSes. * Secure. Sandboxed and isolated execution on untrusted devices. * Modern languages for inference apps. * Container-ready. Supported in Docker, containerd, Podman, and Kubernetes. * OpenAI compatible. Seamlessly integrate into the OpenAI tooling ecosystem.

Give it a try --- https://www.secondstate.io/articles/wasm-runtime-agi/

anhldbk2y ago

Interesting. But the gguf file for llama2 is 4.78 GB in size.

For ollama, llama2:7b is 3.8 GB. See: https://ollama.ai/library/llama2/tags. Still I see ollama requires less RAM to run llama 2

fillskills2y ago

Why would anyone downvote this? There is nothing against HN rules and the comment itself is adding new and relevant information.

coder5432y ago

From the HN Guidelines:

“Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.”

That user almost exclusively links to what appears to be their own product, which is self promotion. They also do it without clarifying their involvement, which could come across as astroturfing.

Self promotion sometimes (not all the time) is fine, but it should also be clearly stated as such. Doing it in a thread about a competing product is not ideal. If it came up naturally, that would be different from just interjecting a sales pitch.

I haven’t downvoted them, but I came close.

jdlyga2y ago

Thanks Ollama

1 more reply

rezonant2y ago

I wish JS libraries would stop using default exports. They are not ergonomic as soon as you want to export one more thing in your package, which includes types, so all but the most trivial package requires multiple exports.

Just use a sensibly named export, you were going to write a "how to use" code snippet for the top of your readme anyway.

Also means that all of the code snippets your users send you will be immediately sensible, even without them having to include their import statements (assuming they don't use "as" renaming, which only makes sense when there's conflicts anyway)

1 more reply

j / k navigate · click thread line to collapse

149 comments

rgbrgb2y ago

Either way, congrats on the release!

[0]: https://github.com/psugihara/FreeChat

SnowLprd2y ago

On the subject of installing Ollama, I found it to be a frustrating and user-hostile experience. I instead recommend the much more user-friendly LLM[0] by Simon Willison.

Among the problems with Ollama include:

* Ollama silently adds a login item with no way to opt out: <https://github.com/jmorganca/ollama/issues/162>

* Ollama spawns at least four processes, some persistently in the background: 1 x Ollama application, 1 x `ollama` server component, 2 x Ollama Helper

* Ollama provides no information at install time about what directories will be created or where models will be downloaded.

As I said above, I think LLM[0] by Simon Willison is an excellent and user-friendly alternative.

[0]: https://llm.datasette.io/

WhackyIdeas2y ago

b33j0r2y ago

We all tell ourselves it’s value-add, but come on, there’s always an element of “we’ll make ourselves a one-stop shop!”

(Heh. Two spaces between the verb and the args gave me a most perplexing error after copy/pasting a file path).

skwirl2y ago

jsjohnst2y ago

1 more reply

_ea1k2y ago

I partially agree. My only issue with them is that the documentation is a little more hidden than I'd like.

Their install is basically a tl;dr of an installer. That's great!

It'd be nice if it also pointed me directly to a readme with specific instructions on service management, config directories, storage directories, and where the history file is stored.

vegabook2y ago

nix-shell makes most of this go away, except the ollama files will still be in `~/.ollama` which you can delete at any time.

  nix-shell -p ollama

in two tmux windows, then

  ollama serve

in one and

  ollama run llama2

in the other.

Exit and all the users, processes etc, go away.

https://search.nixos.org/packages?channel=23.11&show=ollama&...

nulld3v2y ago

The Linux binary (pre-built or packaged by your distro) is just a CLI. The Mac binary instead also contains a desktop app.

tesla_frunk2y ago

Is this any different from

    brew install ollama

siquick2y ago

"User hostile experience" is complete hyperbole and disrespectful to the efforts of the maintainers of this excellent library.

gremlinunderway2y ago

It's not hyperbole when he listed multiple examples and issues which clearly highlight why he calls it that.

Really wish that we could all understand and empathize with frustration on software has nothing to do with the maintainers or devs unless directly targeted.

I am constantly pushing and begging for this feedback on my work, so to get this for free is a gift.

1 more reply

SnowLprd2y ago

Moreover, I already conveyed my understanding of and appreciation for the work open-source maintainers do, and I outright said above that I intend no disrespect.

1 more reply

refulgentis2y ago

2 more replies

config_yml2y ago

Indeed, I thought the user experience was great. Simple way to download, install and start: everything just worked.

fzysingularity2y ago

[1] https://github.com/simonw/llm

[2] https://docs.nos.run/docs/blog/serving-llms-on-a-budget.html...

[3] https://github.com/simonw/ospeak

jsjohnst2y ago

I agree that alternative is good, but if you want to try ollama without the user experience drawbacks, install via homebrew.

okasaki2y ago

There's also a docker container (that I can recommend): https://hub.docker.com/r/ollama/ollama

wrasee2y ago

The install on Linux is the same. You’re essentially encouraged to just

    curl https://ollama.ai/install.sh | sh

which is generally a terrible idea. Of course you can read the script but that misses the point in that that’s clearly not the intended behaviour.

As other commenters have said, it is convenient. Sure.

cqqxo4zV46cp2y ago

dinosaurdynasty2y ago

https://github.com/ollama/ollama/blob/main/docs/linux.md

They have manual install instructions if you are so inclined.

icyfox2y ago

Just for connecting to an existing service: https://github.com/ollama/ollama-python/blob/main/ollama/_cl...

thrdbndndn2y ago

For the client API it's pretty clear:

    from ollama import Client
    client = Client(host='http://localhost:11434')

But I don't quite get how the example in "Usage" can work:

    import ollama
    response = ollama.chat(model='llama2', messages=[
    {
        'role': 'user',
        'content': 'Why is the sky blue?',
    },
    ])
    print(response['message']['content'])

Since there is no parameter for host and/or port.

jmorgan2y ago

Once you have a custom `client` you can use it in place of `ollama`. For example:

  client = Client(host='http://my.ollama.host:11434')
  response = client.chat(model='llama2', messages=[...])

1 more reply

Abishek_Muthian2y ago

I used Ollama docker image to integrate with Gait Analyzer[1], a self-hosted gait analysis tool; all I had to do was to set up the docker compose file,

I was able to get the setup done with a single script for the end user and I used langchain to interact with Ollama.

[1] https://github.com/abishekmuthian/gaitanalyzer

ivanfioravanti2y ago

I posted about the Python library few hours after release. Great experience. Easy, fast and works well.

jumperabg2y ago

How does this fine-tuning work? I can see that you are loading a train.jsonl file and the some instructions but is the output model generated or this is some kind of a new way of training the models?

jerpint2y ago

The gist is only to create the dataset not to fine tune

eurekin2y ago

What's your observations about finetunes - are they really useful for anything practical? :)

tinyhouse2y ago

Does olana support fine-tuning? I assume not. (Not asking about finetuned models that I know they support)

pknerd2y ago

can we use it on cloud or I gotta download it locally? it might not work on my MacBook 2015 with 8GB ram

LoganDark2y ago

Gist isn't an acronym, it's a word. (e.g. "get the gist of things")

filleokus2y ago

How low could the memory consumption of such a model be? We don't need to store who the first kaiser of Germany was, "just" enough to kinda map human speech onto available API's.

andy992y ago

There are "smaller" models, for example tinyllama 1.1B (tiny seems like an exaggeration). PHI2 is 2.7B parameters. I can't name a 500M parameter model but there is probably one.

nattaylor2y ago

https://pypi.org/project/languagemodels/ can load some small models but forming JSON-reliably seems to require a larger-ish model (or fine tuning)

Aside: I expect Apple will do exactly what you're proposing and that's why they're exposing more APIs for system apps

spaniard892772y ago

Not really. You can use small models for task like text classification etc (traditional nlp) and those run in pretty much anything. We're talking about BERT-like models like distillbert for example.

Experts in the field say that might change (somewhat) with mamba models, but I can't really say more.

oblio2y ago

Speaking of, I imagine Alexa, Siri, etc, should now be replaced by LLMs? Or where they already implemented using LLMs?

3abiton2y ago

Exactly this, I have not yet to run into a "small" model that is good enough (gpt-3) quality

reacharavindh2y ago

Not directly related to what Ollama aims to achieve. But, I’ll ask nevertheless.

Could Ollama make it easier and standardize the way to add documents to local LLMs?

I’m not talking about uploading one image or model and asking a question about it. I’m referring to pointing a repository of 1000 text files and asking LLMs questions based on their contents.

jerpint2y ago

For now RAG is the best “hack” to achieve this at very low cost since it doesn’t require any fine tuning

I’ve implemented a RAG library if you’re ever interested but they are a dime a dozen now :)

https://www.github.com/jerpint/buster

jampekka2y ago

Sounds like Retrieval Augmented Generation. This is the technique used by e.g. most customized chatbots.

emmanueloga_2y ago

I don’t know if Ollama can do this but https://gpt4all.io/ can.

reacharavindh2y ago

Basically, I want to do what this product does, but locally with a model running on Ollama. https://www.zenfetch.com/

rex1232y ago

In the meanwhile, we do have agreements in place with all of our AI providers to ensure none of our users information is used for training or any other purpose. Hope that helps!

reacharavindh2y ago

Hey. Congratulations on your product. I’m guessing it will be greatly useful for your target audience.

I don’t have a serious need that I’d think worth paying for. So, I’m probably not in your target. I wanted to do this for a personal use case.

1 more reply

camillomiller2y ago

NetOpWibby2y ago

Ooh, I want this too.

asterix_pano2y ago

Llama_index basically does that. You have even some tuto using Streamlit that creates a UI around it for you.

BeetleB2y ago

> I’m never going to upload some data to ChatGPT for them to train on.

If you use the API, they do not train on it.

(However, that doesn't mean they don't retain it for a while).

As others have said, RAG is probably the way to go - although I don't know how well RAG performs on local LLMs.

martin822y ago

Data is the new oil.

You can be 100% sure that OpenAI will do whatever they want whenever they want with any and every little bit of data that you upload to them.

With GPTs and their Embeddings endpoint, they encourage you to upload your own data en masse.

sciolist2y ago

hobofan2y ago

You can "use RAG" with Ollama, in the sense that you can put RAG chunks into a completion prompt.

[0]: https://huggingface.co/jinaai/jina-embeddings-v2-base-en

[1]: https://github.com/ollama/ollama/issues/327

CubsFan10602y ago

Maybe take a look at this? https://github.com/imartinez/privateGPT

It's meant to do exactly what you want. I've had mixed results.

porridgeraisin2y ago

Used ollama as part of a bash pipeline for a tiny throwaway app.

awayto2y ago

https://github.com/jcmccormick/runtts

killermouse02y ago

Would you share that code? I'm not familiar with using the mic in Linux, but interested to do something similar!

nbbaier2y ago

I'd also be really interested in seeing this

deepsquirrelnet2y ago

I love the ollama project. Having a local llm running as a service makes sense to me. It works really well for my use.

I’ll give this Python library a try. I’ve been wanting to try some fine tuning with LLMs in the loop experiments.

palashkulsh2y ago

Noob question, and may be probably being asked at the wrong place. Is there any way to find out min system requirements for running ollama run commands with different models.

mike9782y ago

I have a 11th gen intel cpu with 64gb ram and I can run most of big models slowly... so it's partly what you can put up with.

mark_l_watson2y ago

Install Ollama from https://ollama.ai and experiment with it using the command line interface. I mostly use Ollama’s local API from Common Lisp or Racket - so simple to do.

EDIT: if you only have 8G RAM, try some of the 3B models. I suggest using at least 4 bit quantization.

hellsten2y ago

Check out this guide for some recommendations: https://www.hardware-corner.net/guides/computer-to-run-llama...

slawr18052y ago

They have a high level summary of ram requirements for the parameter size of each model and how much storage each model uses on their GitHub: https://github.com/ollama/ollama#model-library

nextlevelwizard2y ago

Rule of thumb I have used is to check the size and if it fits into your GPUs VRAM then it will run nicely.

I have not ran into a llama that won't run, but if it doesn't fit into my GPU you have to count seconds per token instead of tokens per second

wazoox2y ago

Llama2 7b and Mistral 7b run at about 8 tk/s on my Mac Pro, which is usable if you're not in a hurry.

palashkulsh2y ago

Thank you so much everyone, all the help was really needed and useful : )

explorigin2y ago

I run ollama on my steamdeck. It's a bit slow but can run most 7b models.

sqs2y ago

Since that post, we shipped experimental support in our product for Ollama-based local inference. We had to write our own client in TypeScript but will probably be able to switch to this instead.

keyle2y ago

Could you maybe compare it to llama.cpp?

All it took for me to get going is `make` and I basically have it working locally as a console app.

coder5432y ago

Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.

Running “make” on llama.cpp is really only the first step. It’s not comparable.

palmfacehn2y ago

1 more reply

regularfry2y ago

For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.

It's annoying that it seems to have its own model cache, but I can live with that.

1 more reply

lolinder2y ago

acd10j2y ago

In my experience award for easiest to run locally will go to llamafile models https://github.com/Mozilla-Ocho/llamafile.

sqs2y ago

chown2y ago

I have been working on something similar to that in Msty [1]. I haven’t announced the app anywhere (including my friends as I got a few things in pipeline that I want to get out first :)

[1]: https://msty.app

zenlikethat2y ago

That gets into process management which can get dicey, but I agree, a "daemonless" mode could be really interesting

donpdonp2y ago

I'd like to see a comparison to nitro https://github.com/janhq/nitro which has been fantastic for running a local LLM.

refulgentis2y ago

> Ollama is definitely the easiest way to run LLMs locally

Nitro outstripped them, 3 MB executable with OpenAI HTTP server and persistent model load

jmorgan2y ago

Persistent model loading will be possible with: https://github.com/ollama/ollama/pull/2146 – sorry it isn't yet! More to come on filesize and API improvements

akulbe2y ago

I just wanted to say thank you for being communicative and approachable and nice.

evantbyrne2y ago

refulgentis2y ago

Not some guy, agree on zip, disagree entirely with tone of the comment (what exactly separates ollama from those same exact hyperbolic descriptions?)

joaomdmoura2y ago

So cool! I have bene using Ollama for weeks now and I just love it! Easiest way to run local LLMs, we are actually embedding them into our product right now and super excited about it!

visarga2y ago

I am using ollama as LLM server + ollama-webui as chat app server. Great UI

nbbaier2y ago

What's the product?

Kostic2y ago

I used this half a year ago, love the UX but it was not possible to accelerate the workloads using an AMD GPU. How's the support for AMD GPUs under Ollama today?

mchiang2y ago

Hi, I'm one of the maintainers on Ollama. We are working on supporting ROCm in the official releases.

If you do build from source, it should work (Instructions below):

https://github.com/ollama/ollama/blob/main/docs/development....

The reason why it's not in released builds is because we are still testing ROCm.

accelbred2y ago

I'm using it on an AMD GPU with the clblast backend.

brucethemoose22y ago

Unfortunately "AMD" and "easy" are mutually exclusive right now.

You can be a linux/python dev and set up rocm.

Or you can run llama.cpp's very slow OpenCL backend, but with easy setup.

Or you can run MLC's very fast Vulkan backend, but with no model splitting and medium-hard setup.

jquaint2y ago

I'm a huge fan of Ollama. Really like how easy it makes local LLM + neovim https://github.com/David-Kunz/gen.nvim

imrehg2y ago

This should be nice to be easier to integrate with things like Vanna.ai, that was on HN recently.

Looking forward to more remixes for other tools too.

hatmanstack2y ago

Why does this feel like an exercise in the high priesting of coding. Shouldn't a python library have everything necessary and work out of the box?

behnamoh2y ago

https://github.com/ollama/ollama/issues/1536

Not to mention, they hide all the server configs in favor of their own "sane defaults".

jmorgan2y ago

Sorry this isn't easier!

You can enable mlock manually in the /api/generate and /api/chat endpoints by specifying the "use_mlock" option:

{“options”: {“use_mlock”: true}}

Many other sever configurations are also available there: https://github.com/ollama/ollama/blob/main/docs/api.md#reque...

bestai2y ago

I think a faq with the answers of this kind of questions could be useful for users.

mfalcon2y ago

hellsten2y ago

The performance will probably be similar as long as you remember to tune the settings listed here: https://github.com/ollama/ollama/blob/main/docs/api.md

Try to, for example, set 'num_gpu' to 99 and 'use_mlock' to true.

jerpint2y ago

They all probably already use elements of deep learning but are very likely trained in a supervised way to output structured data (I.e. actions)

tripleo12y ago

techn002y ago

Does Ollama support GBNF grammars?

visarga2y ago

No, but it does support json formatting

pamelafox2y ago

WiSaGaN2y ago

There is an issue for this: [1]. I think it's more of priority issue.

[1] https://github.com/ollama/ollama/issues/305

d4rkp4ttern2y ago

[1] https://github.com/oobabooga/text-generation-webui/issues/53...

[2] https://github.com/langroid/langroid/blob/main/langroid/lang...

Related question - I assume ollama auto detects and applies the right chat formatting template for a model?

lhenault2y ago

I've built exactly this if you want to give it a try : https://github.com/lhenault/simpleAI

WhackyIdeas2y ago

This is going to make my current project a million times easier. Nice.

malux852y ago

I love ollama, the engine underneath is llama.cpp, and they have the first version of self-extend about to me merged into main, so with any luck it will be available in ollama soon too!

brucethemoose22y ago

A lot of the new models coming out are long context anyway. Check out Yi, InternLM and Mixtral.

Also, you really want to wait until flash attention is merged before using mega context with llama.cpp. The 8 bit KV cache would be ideal too.

dchuk2y ago

Is anyone using this as an api behind a multi user web application? Or does it need to be fed off of a message queue or something to basically keep it single threaded?

Havoc2y ago

What model format does ollama use? Or is one constrained to the handful of preselected models they list?

mchiang2y ago

You can import GGUF, PyTorch or safetensors models into Ollama. I'll caveat that there are current limitations to some model architectures

https://github.com/ollama/ollama/blob/main/docs/import.md

Havoc2y ago

Thanks!

awongh2y ago

Wow, I guess I wouldn’t have thought there would be GPU support. What’s the mechanism for this?

brucethemoose22y ago

Via llama.cpp's GPU support.

cranberryturkey2y ago

`ollama serve` exposes an api you can query with fetch. why the need for a library?

lobocinza2y ago

sjwhevvvvvsj2y ago

Literally wrote an Ollama wrapper class last week. Doh!

bearjaws2y ago

If you're using TypeScript I highly recommend modelfusion https://modelfusion.dev/guide/

It is far more robust, integrates with any LLM local or hosted, supports multi-modal, retries, structure parsing using zod and more.

kvz2y ago

nextlevelwizard2y ago

What is the benefit?

Ollama already exposes REST API that you can query with whatever language (or you know, just using curl) - why do I want to use Python or JS?

JrProgrammer2y ago

What’s the benefit of abstracting something?

nextlevelwizard2y ago

There is a reason why "leftpad" is followed by "incident".

girvo2y ago

That one doesn’t have to write the glue code around your HTTP client library?

nextlevelwizard2y ago

Feels pretty bad to install dependency just so you can avoid making a HTTP request.

leansensei2y ago

There is also an Elixir library: https://overbring.com/blog/2024-01-14-ollamex-ollama-api-emb...

3Sophons2y ago

The Rust+Wasm stack provides a strong alternative to Python in AI inference.

Give it a try --- https://www.secondstate.io/articles/wasm-runtime-agi/

anhldbk2y ago

Interesting. But the gguf file for llama2 is 4.78 GB in size.

For ollama, llama2:7b is 3.8 GB. See: https://ollama.ai/library/llama2/tags. Still I see ollama requires less RAM to run llama 2

fillskills2y ago

Why would anyone downvote this? There is nothing against HN rules and the comment itself is adding new and relevant information.

coder5432y ago

From the HN Guidelines:

“Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.”

That user almost exclusively links to what appears to be their own product, which is self promotion. They also do it without clarifying their involvement, which could come across as astroturfing.

I haven’t downvoted them, but I came close.

jdlyga2y ago

Thanks Ollama

1 more reply

rezonant2y ago

Just use a sensibly named export, you were going to write a "how to use" code snippet for the top of your readme anyway.

1 more reply

j / k navigate · click thread line to collapse