thegraphcourses.org • Generative AI for Work & Research
Last updated: May 28, 2025
Different AI tasks require different considerations:
📊 Bulk Operations (e.g. data cleaning): Price per use matters a lot
💬 Chatbots: Fast response time is important
🔍 Code and Data Analysis: Accuracy is non-negotiable
📑 Document Understanding: Ability to process long files is essential
This guide teaches you to evaluate the best models for your specific use case.Check public leaderboards to shortlist candidates:
Best practice: Run your own evaluations with your specific use cases.
Pricing typically listed per million tokens
Token = chunk of text (see platform.openai.com/tokenizer)
100 tokens ≈ 75 words
Input and output tokens priced separately
Strategy: Test cheap models first before moving to premium options
Cheap, smaller models are usually faster but less intelligent
For chatbots & interactive apps: Consider fast models for better UX
For async processes: Speed is less critical, focus on quality
Modern models have huge limits (1M tokens ~ entire Bible), but fitting in context ≠ good responses.
Applications can use RAG to surpass context limits effectively.
Open-weight models:
• Often cheaper to run
• Can deploy on your infrastructure
• More control and customization
Closed models: Typically more powerful but pricier
Model | OpenAI | Anthropic Claude | Google Gemini | Alibaba Qwen | ||||
---|---|---|---|---|---|---|---|---|
⚡️ FAST, CHEAP | 👨🏻🏫 GENIUS | ⚡️ FAST, CHEAP | 👨🏻🏫 GENIUS | ⚡️ FAST, CHEAP | 👨🏻🏫 GENIUS | ⚡️ FAST, CHEAP | 👨🏻🏫 GENIUS | |
Model Name | gpt 4.1 mini | o4 mini high | 3.5 Haiku | 4 Sonnet | 2.5 Flash | 2.5 Pro | Qwen3 30B A3B | Qwen3 235B A22B thinking |
Performance (MMLU) | ||||||||
Price ($/M tokens) (3/1 in/out) |
||||||||
Context Limit (M tokens) | ||||||||
Open / Closed Weights | Closed | Closed | Closed | Closed | Closed | Closed | Open | Open |
Links |
Analysis
OpenRouter |
AnalysisOpenRouter | AnalysisOpenRouter | AnalysisOpenRouter | AnalysisOpenRouter | AnalysisOpenRouter | AnalysisOpenRouter | AnalysisOpenRouter |
Begin with models in the main dropdowns of public-facing UIs from major providers (OpenAI, Anthropic, Google).
Why: With 400+ models on platforms like OpenRouter, starting with the flagship models helps narrow your choices.
Different providers' offerings are quite similar these days, so consider picking the company whose apps you already use.
Note: If you use Copilot in Microsoft, that's likely OpenAI's models under the hood.
As of writing, Google has some of the best benchmarking models:
• Gemini 2.5 Pro: Top performance
• Gemini 2.5 Flash: One of the cheapest good models
Great starting point if you have no other preferences.
While not always top on benchmarks, it excels at:
• Programming and design tasks
• Smart responses without being too chatty
• Often lower actual usage costs than 2.5 Pro due to efficiency
Note: 2.5 Pro is a thinking model, which can be verbose
Important distinction: These recommendations are for API usage.
For chat interfaces, we actually pay for ChatGPT because it offers:
• Superior user interface
• Deep research capabilities
• Image generation & code execution
• Easy-to-use voice mode