Strategic Insight: Multimodal AI Orchestration for Healthcare Reasoning: Challenges, Costs, and Sustainability

What is Microsoft's MAI-DxO and what are the implications for healthcare organizations

Stuart Miller

Jul 09, 2025

9^th July 2025

Abstract

Healthcare is a complex domain that demands trustworthy, multi-step reasoning and integration of diverse data (text, images, labs, etc.). Recent approaches combine multimodal inputs, chain-of-thought reasoning, and planner/orchestrator architectures using multiple and diverse trained large language models (LLMs) to tackle the limitations of a single LLM approach in clinical decision support.
This white paper discusses the challenges in medical reasoning addressed by these advanced AI techniques, comparing a traditional retrieval-augmented single-model query to a heavy multi-modal/multi-model orchestrated approach.
These approaches, however, come with a heavy compute resourcing impact that has to be considered. We illustrate the cost per query (in tokens, dollars, and compute/energy) for each method and extrapolate the impact for a typical health system with 3,000 clinicians under various usage scenarios (low, mid, high).
Using Microsoft’s MAI-DxO (Medical AI Diagnostic Orchestrator) as a case study, we calculate how many times more expensive (in token and compute costs) the orchestrated approach can be and the annual energy consumption equivalent in terms of household electricity usage.
These strategic briefings take a lot of time to research and write. I’d appreciate it if you’d help me write more by subscribing.
Finally, we summarize mitigation strategies – such as intelligent model orchestration and resource optimization – to minimize costs and energy footprint while harnessing advanced AI for healthcare.

Introduction

Modern healthcare decisions often require synthesizing multiple sources of information and expert reasoning steps. A single physician may review clinical notes, lab results, and medical images, and then iteratively formulate hypotheses, order tests, and refine the diagnosis.

Emulating this process is a grand challenge for AI systems. General-purpose LLMs (such as GPT-4 and PaLM) have demonstrated impressive conversational abilities, but single-model systems often fall short in high-stakes medical settings that require precision, transparency, and multimodal understanding. Even small hallucinations or missed details can compromise patient safety. Moreover, standard LLM chatbots lack access to patient-specific data and struggle with multi-step problem solving, making their responses generic and potentially outdated.

To address these gaps, researchers are exploring advanced architectures that go beyond a lone chatbot. Three key innovations have emerged:

Multimodal Models: Incorporating text, imaging, lab, genomics, and other data into AI reasoning. Healthcare cases often span multiple modalities and sequential steps, so models must handle diverse data (e.g,. X-rays + clinical text) and process information in stages.
Chain-of-Thought Reasoning: Guiding models to reason through problems step-by-step (similar to a clinician’s thought process). Chain-of-thought (CoT) prompting has been shown to enhance LLM accuracy on complex medical tasks by breaking problems into logical steps. This not only improves problem-solving but also makes the reasoning process more transparent and aligned with clinical thinking, helping bridge the gap between AI’s opaque decisions and the clear, accountable reasoning that healthcare requires.
Planner/Orchestrator Architectures: Using a central orchestrator (planner) to coordinate multiple specialized AI agents or models, each expert in a domain (radiology, pathology, etc.), in a multi-agent system. This mirrors the collaborative approach of a medical team: the orchestrator delegates tasks to domain-specific models and integrates their findings. Such architectures ensure that each sub-task (e.g. interpreting an MRI, fetching patient history) is handled by the most appropriate model, and that intermediate results are checked and combined coherently. The orchestrator adds oversight, enforcing systematic decision-making steps and cost awareness, which is critical for safety and efficiency in clinical workflows.

These approaches promise to overcome the limitations of single-model AI in healthcare. In the following sections, we discuss how they address key challenges in clinical reasoning and analyze their implications for cost and energy consumption at scale.

Challenges in Healthcare Reasoning and How Advanced AI Helps

1. Need for Multi-Step, Logical Reasoning: Medical diagnosis is an iterative reasoning process, not a one-shot Q&A. Traditional LLM chatbots lack robust multi-step problem-solving – they tend to provide an answer in one step and often cannot explain or revisit their reasoning. This is particularly risky in healthcare, where reasoning steps and rationale must be clearly articulated. Chain-of-thought reasoning (CoT) addresses this by forcing the model to think aloud, generating intermediate conclusions and justifications.

In medicine, CoT prompting makes AI’s decision process more human-like and transparent, which builds clinician trust. It enables handling of complex, context-sensitive questions by structuring the logic (e.g. considering differential diagnoses one by one, asking follow-up questions) rather than guessing an answer.

For example, a CoT-enabled system presented with a patient’s symptoms can outline its reasoning: possible conditions, what additional info it needs, and why it concludes a certain diagnosis. This sequential approach mitigates the “black box” issue and aligns AI decisions with medical standards of reasoning and ethics.

2. Multimodal Data Integration: A significant challenge is that patient data comes in many forms: text (history, notes), images (X-rays, MRIs), waveforms (EKG), lab tables, genomics reports, etc. A single text-only model inherently loses out on information contained in images or other formats. Historically, AI models have been narrow (an algorithm might excel at reading X-rays or at answering text questions, but not both). However, multimodal AI is now enabling combined understanding, for instance, multimodal LLMs that can analyze an image and incorporate its findings into a textual reasoning chain.

In practice, this could enable an AI system to, for instance, interpret a chest X-ray and detect a nodule, then factor that into the patient’s narrative to inform subsequent decisions. The Healthcare Agent Orchestrator demonstrated by Microsoft embodies this by uniting specialized models: e.g. a radiology model (MedImageParse) for imaging, a report generation model (CXRReportGen) for summarizing findings, and others for genomic or clinical data.

The orchestrator ensures all these modalities are considered together, replicating the way a tumor board combines radiology, pathology, and genetics inputs for a holistic decision. By enabling cross-modal reasoning, these systems address a core limitation of text-only LLMs and reflect real-world diagnostic practices that require seeing the whole patient picture.

3. Collaboration and Expert Diversity: In real healthcare, difficult cases are solved by collaboration – multiple specialists deliberating (e.g. a cardiologist, radiologist, and geneticist each contribute insights). Most AI models to date have been single-agent, which “fails to reflect the real-world teamwork that defines healthcare practice”.

This is where planner/orchestrator architectures shine. An orchestrator can coordinate a set of AI agents with diverse expertise, akin to a panel of virtual clinicians working together. Microsoft’s MAI-DxO is a prime example: it queries several leading models (OpenAI GPT-4, Google’s Gemini, Anthropic Claude, Meta’s Llama, xAI’s Grok, etc.) in a chain-of-debate style, so that multiple “opinions” or diagnostic hypotheses are considered. This approach emulates a team of doctors brainstorming: the AI agents can propose different diagnoses, question each other’s suggestions, and the orchestrator synthesizes the results.

In tests on complex cases, this multi-agent orchestration solved cases that individual models missed, achieving a correct diagnosis rate ~4× higher than experienced human physicians on the same challenges.

Just as importantly, the orchestrator architecture improved decision quality and safety: it enforced that each step (asking about symptoms, ordering tests, etc.) was done systematically and that results were verified before finalizing a diagnosis. This reduces the risk of errors or oversights that a single-pass model might make. The orchestrator could even be configured to avoid costly or unnecessary tests, showing a capability to optimize resource use in diagnosis.

4. Transparency and Trust: In high-stakes domains like healthcare, explainability is crucial. Clinicians need to know why the AI made a recommendation before acting on it. Black-box answers are rarely acceptable in a clinical workflow. The combination of CoT reasoning and orchestrated multi-agent debate inherently produces a traceable line of reasoning.

Each step, query, and intermediate conclusion can be logged and reviewed, offering traceability that a single end-to-end model lacks. For instance, MAI-DxO doesn’t just output “Diagnosis: X”; it shows its work, listing the questions it asked, the test results obtained, and the rationale leading to X. This not only helps in validating the AI’s conclusion but also aligns with how physicians are trained to think out loud and document their reasoning. By making the AI’s thought process inspectable, these advanced systems build user confidence and enable oversight, a critical attribute for any clinical AI tool.

5. Continual Learning and Adaptability: Medical knowledge evolves rapidly, and patient data is dynamic. Single static models often lack up-to-date information and access to patient-specific context. A chained or orchestrated approach can integrate external knowledge sources and tools on the fly. For example, an orchestrator might call a search tool or a database lookup when the AI’s internal knowledge is insufficient, ensuring the answer incorporates the latest medical literature or patient record.

This on-demand retrieval, a form of RAG (Retrieval Augmented Generation), helps mitigate hallucinations and outdated answers. It essentially allows the system to learn or update in real-time during the reasoning chain, which a single model wouldn’t do once deployed. This adaptability is key to staying current with medical guidelines and tailoring responses to each patient’s specific data.

In summary, the move toward multimodal, multi-step, orchestrated AI directly addresses the limitations that have hindered AI adoption in healthcare: it improves reasoning depth, handles all relevant data types, mirrors collaborative workflows, and provides needed transparency. The next sections examine the trade-offs of these sophisticated approaches – in particular, the costs in terms of computation (tokens and processing) – and how those trade-offs scale when deploying such systems across a health network.

Cost Implications: Traditional RAG vs. Orchestrated Approach

Advanced AI reasoning in healthcare brings clear benefits, but it also raises practical concerns about computational cost. A multi-step, multi-model orchestrated query will inherently use more processing power (and thus more tokens and energy) than a straightforward single-model query.

In this section, we illustrate the cost per query for a traditional RAG (retrieval-augmented generation) single-model approach versus a heavily orchestrated multi-modal/multi-model approach, and then extrapolate the costs for a large healthcare practice.

Assumptions and Definitions:

Cost per token: We assume an average cost of roughly $0.02 per 1,000 tokens as a representative figure across various AI providers (OpenAI, Google, Microsoft, etc.) in 2025. In reality, costs vary (e.g. GPT-4 Turbo is about $0.01 per 1K input tokens and $0.03 per 1K output tokens, while simpler models can be <$0.005 per 1K). For this analysis, $0.02/1K is a fair middle-ground reflecting a mix of models.
Energy per query: Based on recent estimates, a typical single LLM query (with a few hundred tokens response) consumes on the order of 0.3 Wh (watt-hours) of energy. More complex “reasoning” queries using larger models or more tokens might consume up to an order of magnitude more (older estimates put heavy queries around ~3 Wh each). We will use 0.3 Wh as a baseline for a simple query and 3.0 Wh for an orchestrated multi-agent query for illustration (about 10× the energy, corresponding to significantly more tokens and computation per query).
Clinical practice size: We consider a health system with 3,000 clinicians. We examine three annual usage levels per clinician:
- Low usage: 200 AI queries per year per clinician (roughly 1 query per workday).
- Mid usage: 500 queries per year (a few per week).
- High usage: 1,000 queries per year (several per week, or ~4 per working week).
  These numbers represent a sensitivity analysis from occasional use to fairly regular use of the AI assistant. Across 3,000 clinicians, these scenarios equate to 600,000, 1.5 million, and 3 million total queries per year, respectively.

Cost per Single-Model Query (RAG): In a traditional RAG setup, the clinician’s question is augmented with retrieved context (e.g. relevant guidelines or patient history), and a single LLM (e.g. a medical GPT-4 instance) produces an answer. Suppose the prompt + retrieved text + answer together are ~1,000 tokens. At ~$0.02 per 1K tokens, that’s about $0.02 per query. Energy-wise, such a query might use ~0.3 Wh. This is quite small – for perspective, 0.3 Wh is less electricity than an LED lightbulb uses in a few minutes. Even heavy chat users would add only a minor amount to their overall electricity footprint with this kind of usage.

Cost per Orchestrated Multi-Model Query: In an orchestrated approach (like MAI-DxO or a multi-agent tumor board AI), a single user query triggers multiple LLM invocations and possibly iterative reasoning steps. For example, Microsoft’s MAI-DxO might query 5 different frontier models in its panel for each case, and do a chain-of-debate or verification cycle among them. This could involve several thousand tokens of total processing (including each model’s prompt and response, orchestrator instructions, etc.).

Let’s estimate an orchestrated session might consume ~5,000 tokens of LLM processing in total (which is plausible if, say, 5 models each process ~1,000 tokens on average). At $0.02 per 1K, that’s $0.10 per orchestrated query. In some cases the token usage could be higher – for instance, if the orchestrator runs multiple rounds of debate or tool use, token counts could swell further (some chain-of-thought dialogues easily exceed 10,000 tokens).

Energy consumption could similarly multiply. Empirical research shows that “reasoning” LLM models (which think step-by-step) used on average 543 tokens per question vs. only 37 tokens for direct answer models – roughly 14× more tokens, hence more compute. They found a 70B parameter reasoning model answering 600k questions could emit as much CO₂ as a transatlantic flight. While our orchestrated scenario differs, it underscores that multi-step reasoning vastly increases compute load. Using our simpler assumption of ~10× more energy, an orchestrated query might consume around 3 Wh of electricity. This is still modest in absolute terms (3 Wh is like keeping a kitchen oven on for only about 1 second), but at scale the difference is significant.

Per-Query Cost Comparison: In summary, one complex orchestrated query could easily cost an order of magnitude more (in both dollars and energy) than a single-model query. For instance, if a RAG query costs $0.02 and 0.3 Wh, the orchestrated might be ~$0.10 (or more) and 3 Wh for a rich, multi-agent diagnostic reasoning session. The n-times factor here is ~5× in cost if 5 models are called in parallel, or up to ~10–15× if the chain-of-thought involves multiple rounds. In the context of MAI-DxO, which explicitly “queries several leading AI models” per case, one can expect on the order of 5–10× the token usage (and thus cost) per case compared to using a single model. This is the price of orchestrating a virtual panel of experts instead of a single “all-knowing” model.

Annual Cost for a Health System: Extrapolating to a typical 3,000-clinician health system, we can estimate annual usage costs:

Low scenario (600,000 queries/year):
- Single-model RAG approach: 600k * $0.02 ≈ $12,000 per year.
- Orchestrated approach: 600k * $0.10 ≈ $60,000 per year.
  The orchestrated system might cost on the order of $50k more per year in this low-usage scenario. Energy-wise, 600k single-model queries at 0.3 Wh each use about 180 kWh; orchestrated at 3 Wh each uses about 1,800 kWh annually.
Mid scenario (1,500,000 queries/year):
- Single-model: 1.5M * $0.02 = $30,000/year.
- Orchestrated: 1.5M * $0.10 = $150,000/year.
  Difference ~$120k. Energy: single ~450 kWh vs orchestrated ~4,500 kWh.
High scenario (3,000,000 queries/year):
- Single-model: 3M * $0.02 = $60,000/year.
- Orchestrated: 3M * $0.10 = $300,000/year.
  Difference ~$240k. Energy: single ~900 kWh vs orchestrated ~9,000 kWh.

These estimates indicate that for a large health system, an orchestrated AI assistant could incur hundreds of thousands of dollars in additional annual API/compute costs compared to a simpler approach, if used widely. The exact factor (n-times) depends on how heavy the orchestration is – in our example it’s a 5× cost jump (assuming five models consulted per query). If the orchestrator engages in iterative back-and-forth reasoning, the cost could be higher (10× or more).

For context, Microsoft’s MAI-DxO, by coordinating multiple models and verifying results, delivered superior accuracy and even reduced medical test spending by ~20%, which might justify the compute cost in a cost-benefit analysis of patient outcomes. However, from an IT budgeting and sustainability perspective, these costs and the energy footprint are non-trivial when scaled to millions of queries.

Energy Impact in Terms of Homes: To ground these energy numbers, consider that an average U.S. household uses about 10,500 kWh of electricity per year. In the high scenario above (3M orchestrated queries/year ≈ 9,000 kWh), the AI system’s annual inference energy would be roughly 85% of what one household uses in a year. In mid usage (4,500 kWh), it’s around 0.43 house-year, and low (1,800 kWh) about 0.17 house-year. In other words, the power draw of running millions of medical AI queries starts to become comparable to the electricity consumption of an American home (though still much less than, say, hospital medical equipment or HVAC systems consume across a whole institution).

If our assumptions are slightly conservative and the orchestrated queries are more energy-intensive, it’s conceivable that a fully scaled AI assistant for a large provider could use the equivalent of several households’ worth of electricity each year. This underscores the importance of efficiency: as AI adoption grows, the cumulative power usage can make a measurable dent in a health system’s energy profile and carbon footprint.

Mitigating Cost and Energy Concerns

The analysis above highlights a trade-off: the most sophisticated AI reasoning yields better clinical performance but at higher computational cost. To ensure sustainability and affordability, organizations and AI developers can employ several mitigation strategies:

Intelligent Routing of Queries: Not every user query needs the full powerhouse of an orchestrated multi-model debate. Many routine questions (e.g. drug dosage lookup, simple symptom triage) could be handled by a smaller model or a single-step answer. A smart system can triage queries by complexity – using a lightweight model for simple cases and invoking the heavy orchestrator only for complex, ambiguous cases.

This is akin to not ordering an MRI for a chest cough. Research suggests that a smaller LLM can achieve similar accuracy on easier questions with a fraction of the energy cost of a large reasoning model. By finding “the right model for the job” each time, one can significantly cut down average token usage and thus cost/energy. For example, an AI service might first run a quick classifier: “Is this query straightforward or complex?” If straightforward, use a single specialized model (or even a knowledge base lookup); if complex, engage the orchestrator. This adaptive orchestration ensures high-power resources are reserved for when they’re truly needed.

Token Optimization and Prompt Efficiency: Multi-step reasoning should be done as efficiently as possible. This includes engineering prompts to be concise, reusing context across steps (to avoid re-sending the same info repeatedly), and leveraging token-saving techniques like cached results or summarizations.

For instance, if an orchestrator fetched a patient’s record in step 1, later steps can refer to a summary or an ID of that record rather than including the full text again. OpenAI’s pricing, for example, offers discounts for cached inputs, encouraging reuse of tokens. By minimizing unnecessary token traffic, one can trim the fat from multi-agent dialogues. Every reduction of, say, 1000 tokens per query translates to about $0.02 saved; at scale of millions of queries, those savings add up.

Model Improvements and Compression: As model architectures evolve, newer models tend to become more efficient (more work per token or per FLOP). The Epoch AI analysis noted that the newer GPT-4 optimizations (GPT-4o) and improved hardware cut the per-query energy by 10× compared to early 2023 estimates.

Continuing this trend, an orchestrated approach running on future models or specialized hardware might consume far less power for the same reasoning task. Additionally, techniques like model distillation or sparsity (mixture-of-experts activating only portions of the model per query) can reduce active compute.

MAI-DxO itself might be implemented with a mixture-of-experts under the hood (OpenAI hints at GPT-4 being an MoE model), meaning not all parameters fire every token. Embracing such efficient architectures in healthcare AI can mitigate the energy hit of running multiple expert models in parallel. In short, algorithmic and hardware innovation will steadily improve the throughput-per-watt of AI, allowing complex reasoning at lower cost.

Operational Constraints and Caching: Just as MAI-DxO can be configured with cost constraints for ordering medical tests, one could impose limits on the AI’s compute per query. For example, an orchestrator might have a budget of N model calls or M tokens; beyond that, it must make a conclusion. This prevents pathological cases where the AI might loop endlessly or use far more resources than the value of the insight gained.

Caching at the institutional level is another tactic: many queries from clinicians are similar (e.g. “What’s the dosing for Drug X in renal failure?”). If the AI has answered a question, those results can be stored so that subsequent identical or very similar queries skip the full processing and retrieve the cached answer (with appropriate validation if needed). This is analogous to a library – you don’t commission a new research report if one already exists.

Energy Offsets and Monitoring: As a sustainability measure, healthcare organizations deploying these AIs can monitor the energy used and invest in offsets or renewable energy accordingly. If an AI system is consuming on the order of a few thousand kWh annually, the hospital could ensure this is drawn from renewable sources or offset by efficiency gains elsewhere.

While this doesn’t reduce the compute cost, it can make the net carbon impact neutral. More granular monitoring (tracking GPU hours, etc.) can also highlight inefficiencies in the AI pipeline to be optimized.

Policy and Usage Guidelines: On an organizational level, setting guidelines for appropriate AI use can avoid overuse of the system for trivial matters. For instance, clinicians might be encouraged to use the AI assistant for complex cases but not for questions that a quick manual lookup could answer. By fostering a culture of mindful AI use, the health system can target the high-value applications and curb wasteful consumption.

By combining these strategies, the effective cost per query can be reduced even as the system retains the ability to tackle hard problems with a powerful orchestrated approach. Microsoft’s own research points out the importance of choosing the right model for each task to find the “golden spot” between performance and efficiency. The goal is a tiered AI system that is both smart and resource-savvy – trivial tasks handled by cheap models, and only truly challenging reasoning tasks invoking the full multi-model panel.

Conclusion and Outlook

AI has enormous potential to augment clinical reasoning, as evidenced by systems like MAI-DxO that can surpass human diagnosticians on tough cases. The combination of multimodal inputs, chain-of-thought reasoning, and orchestrated multi-agent collaboration represents a “path to medical superintelligence”, offering more accurate and cost-effective care decisions. However, this potency comes with higher computational demands.

Our analysis shows that a traditional single-model RAG approach vs. a heavy orchestrated approach could differ by roughly 5–10× in token usage and query cost. In a large practice with 3,000 clinicians, that could mean an increase from tens of thousands to a few hundred thousand dollars per year in AI service costs when scaling up to millions of queries.

The energy consumption for the AI inference, while relatively small in absolute terms, could approach the scale of several household’s annual electricity usage over a year of heavy institutional use (e.g. ~9,000 kWh for 3 million orchestrated queries, which is ~85% of a US home’s yearly consumption). This is a non-negligible addition to a hospital’s energy footprint, especially as AI usage grows.

Remember the above modelling is for one use case/workflow for clinicians. Extrapolate that out in your own mind and you can see how the runaway impact could easily run into high 6 and possible high seven figure costs per year over and above the costs of the vendors intellectual property and your own support and operations cost of supporting your clinical body using such solutions.

The case of MAI-DxO highlights both the promise and the need for prudence. It achieved remarkable diagnostic accuracy gains and even demonstrated the ability to lower medical costs by optimizing test selection. The onus is now on health systems and AI providers to ensure that the computational cost of such tools is managed in a responsible way. The mitigating strategies discussed – from smart triaging of queries to continual efficiency improvements – will be crucial to make AI-driven healthcare reasoning scalable and sustainable.

In summary, advanced orchestrated AI can transform healthcare delivery, but it must be deployed with an eye on operational efficiency. By extrapolating the potential consumption for a typical health system, we see that mindful usage and technical optimizations are key.

The future of clinical AI will likely embrace a hybrid approach: using big “thinking” models only when necessary, while leveraging smaller models and domain-specific logic for routine tasks. This will ensure that we get the best of both worlds – superintelligent assistance for the hardest problems, and minimal overhead for the rest.

In doing so, we can deliver the benefits of AI to patients and providers at scale without incurring prohibitive costs or energy burdens. Healthcare, perhaps more than any domain, will demand this balance of performance and efficiency, and the innovations in multimodal orchestrators are a promising step toward that equilibrium.

Sources:

Abbasian, M. et al. (2023). Conversational Health Agents: A Personalized LLM-Powered Agent Framework. arXiv preprint. – Discusses limitations of current health chatbots (lack of multi-step reasoning, personalization, multimodal data) and proposes a framework to integrate critical thinking and tool use.
Gu, A. et al. (2025). Healthcare Agent Orchestrator: Multi-agent Framework for Domain-Specific Decision Support. Microsoft Tech Community Blog. – Introduces a multi-agent orchestrator for tumor boards, explaining how orchestrating specialist models addresses precision, multimodal integration, and transparency challenges in healthcare AI.
Miao, J. et al. (2024). Chain-of-Thought Utilization in LLMs and Application in Nephrology. Medicina, 60(1):148. – Reviews how chain-of-thought prompting improves LLM reasoning in medicine, making AI decisions more logical, transparent, and trustworthy for clinicians.
Microsoft AI Blog (2025). The Path to Medical Superintelligence. – Announces MAI-DxO orchestrator, showing it turns a single model into a panel of virtual physicians and outperforms individual doctors on NEJM case challenges.
Wired (Knight, W., 2025). “Microsoft’s New AI System Diagnosed Patients 4× More Accurately…” – Reports on MAI-DxO’s performance (85% accuracy vs 20% for doctors) and its approach of querying multiple frontier models in a chain-of-debate style.
Science News (30 Aug 2023). “How much energy does your AI prompt use? It depends.” – Explores energy use of LLMs; notes reasoning models used ~543 tokens vs 38 for standard (14× more) and gives examples of carbon footprint at scale.
Epoch AI (You, J., 2025). “How much energy does ChatGPT use?” – Estimates GPT-4 query at ~0.3 Wh (10× less than earlier thought) and provides context that average U.S. household uses ~10,500 kWh/year.
OpenAI Pricing (2025) via Aimultiple – Notes recent pricing of GPT-4 Turbo at $10 per 1M prompt tokens and $30 per 1M output tokens (i.e. $0.01 and $0.03 per 1K) reflecting lower costs for large-context models.
Science News (Dauner & Socher study, 2023) – Suggests using smaller models when appropriate: a standard model achieved similar accuracy to a 70B reasoning model with less than one-third the carbon emissions. This supports adaptive use of models to reduce energy.

Haverin about…

Discussion about this post