Should we train an AI model from scratch?

Usually, no. Training a frontier model from scratch requires data, compute, infrastructure, and specialized expertise that most teams do not need. Start with an existing model, RAG, fine-tuning, traditional ML, or a smaller custom model if the use case justifies it.

Is RAG better than fine-tuning?

RAG is better when the system needs current or proprietary information, citations, and source traceability. Fine-tuning is better when the model needs to learn a task pattern, format, style, or behavior from stable examples. Many useful systems combine both.

What is the hardest part of building an AI model?

The hard parts are usually data quality, evaluation, workflow fit, observability, and governance. Model choice matters, but a strong model will still fail if it receives bad context, uses weak data, or cannot be tested against the real task.

How do we know if an AI model is good enough to launch?

It is good enough only if it performs well on realistic examples with the right failure controls. Use domain-specific test sets, human review, edge-case testing, cost and latency tracking, and post-launch monitoring before expanding its scope.

Insights/AI & Automation

Building an AI Model: What Matters

May 25, 2026by Asghar Mirzaie

Most teams do not need to build a new AI model from scratch. They need to build a reliable AI system around the right model, the right data, and the right workflow. That distinction matters because the cost, risk, and timeline change completely depending on what “building an AI model” means in your case.

This article is for product teams, operators, publishers, consultants, ecommerce teams, and domain experts trying to turn an AI idea into something useful. If you are still deciding whether to build, buy, fine-tune, or automate a workflow, start with the decision work before the architecture. Refact’s AI development service is built around that order: clarity before code.

Building an AI model now means choosing a model strategy

The phrase “building an AI model” used to point toward training. You gathered data, selected an algorithm, trained the model, tested it, and deployed it. That process still exists, especially for traditional machine learning, but it is no longer the default path for many AI products.

In 2024 to 2026, most practical AI work falls into one of these paths:

Use an existing model through an API.
Add company or domain knowledge with retrieval-augmented generation, often called RAG.
Fine-tune an existing model on curated examples.
Build an agentic workflow with tools, memory, retrieval, validation, and human approval.
Train a smaller domain-specific model for prediction, ranking, forecasting, or classification.
Train a large foundation model from scratch, which is rare outside major AI labs.

The last option gets the attention. It is usually the least relevant. The Stanford 2026 AI Index is cited in the research as reporting that industry produced more than 90% of notable frontier models in 2025. That reflects the reality of frontier model training: large data pipelines, distributed training infrastructure, specialized teams, and compute budgets that most organizations cannot justify.

For most teams, the better question is not “How do we create our own GPT?” It is “Which model strategy gives us enough reliability, control, and speed for this workflow?”

Start with the job, not the model

A weak AI project starts with a capability: “We need a chatbot,” “We need an agent,” or “We need to fine-tune a model.” A stronger project starts with a job that a person or system already performs.

Good AI scopes sound specific:

Classify support tickets by urgency and route them to the right queue.
Summarize sales calls and draft CRM updates for review.
Search internal policies and cite the source document used in the answer.
Flag unusual invoices before finance approves payment.
Recommend the next article, product, or training module based on behavior.

Those examples have inputs, outputs, users, and failure modes. That makes them testable. “Use AI in customer support” is not testable until you decide whether the system is answering customers, drafting replies for agents, triaging requests, finding policy documents, or detecting churn risk.

Refact’s article on generative AI business value makes the same point from a business angle: useful AI work starts with focused workflows and measurable outcomes, not broad adoption language.

Define success before you pick the stack

Before engineering starts, write down what the system must improve. The metric depends on the task. A support routing model may be judged by routing accuracy, first response time, escalation rate, and agent correction rate. A retrieval assistant may be judged by answer accuracy, citation quality, coverage, latency, and the rate of “I do not know” responses.

Also define the failure condition. A model that is 90% accurate may be fine for sorting low-risk content ideas. It may be unacceptable for medical advice, loan decisions, legal review, or account closures. The risk belongs in the product requirements, not in a late-stage compliance meeting.

If nobody can describe what happens when the AI is wrong, the project is not ready to build.

The right path may be API, RAG, fine-tuning, traditional ML, or no AI

There is no universal process for building an AI model because not every problem needs the same kind of intelligence. The decision tree matters more than the trend.

Approach	Use it when	Watch out for
No AI	The work is rules-based, deterministic, or already solved by simple automation.	AI may add cost, latency, and review burden without improving the outcome.
Traditional machine learning	You need prediction, scoring, ranking, forecasting, fraud detection, or tabular classification.	Accuracy can mislead if the data is imbalanced or the test set leaks information.
Existing model API	You need fast validation for summarization, extraction, drafting, classification, or conversation.	You depend on vendor behavior, pricing, rate limits, and data policies.
RAG	The system needs current documents, proprietary knowledge, citations, or permission-aware answers.	Bad chunking, weak ranking, stale documents, or missing context can cause confident wrong answers.
Fine-tuning	You need the model to learn a format, tone, task pattern, or domain behavior from stable examples.	It needs clean training data and does not solve freshness as well as retrieval.
Agentic workflow	The system must plan, use tools, check results, and move through a multi-step process.	Tool failures, rate limits, hidden state, and weak traces make debugging harder.
Custom model training	You have specialized data, a clear business case, and the team to operate the model.	Training, serving, evaluation, monitoring, and governance become your responsibility.

MIT Sloan guidance cited in the research makes a useful distinction: generative AI fits language and common media tasks, while traditional ML may be better for domain-specific prediction with structured data. If the job is churn prediction or inventory forecasting, a large language model may be the wrong starting point.

This is also where build-versus-buy thinking belongs. Refact’s guide to SaaS AI tools is useful if your first decision is whether an existing AI product can handle the workflow before you fund custom development.

Data quality decides how far the model can go

AI projects often look technical from the outside. Inside the work, teams spend a surprising amount of time on data: finding it, cleaning it, labeling it, deduplicating it, checking rights, and deciding whether it reflects the conditions the system will face after launch.

Bad data does not create a small defect. It teaches the system the wrong pattern. If past support tickets were labeled inconsistently, a routing model will inherit the confusion. If a fraud model trains on a dataset where the rare cases are underrepresented, accuracy may look high while the system misses the cases that matter. If a retrieval assistant indexes old policy documents without version control, it may cite rules the company no longer follows.

The DRIL paper summary in the research is a good warning for agent-assisted data collection. Automated agents repeatedly visited low-quality or duplicate sources unless the system enforced deduplication, source-quality checks, metadata, traceability, and human quality control. Automation did not remove the need for editorial judgment. It made that judgment more important.

Useful data work is slow for good reasons

Before model work begins, teams should answer practical questions:

Where did each dataset come from?
Do we have the right to use it for this purpose?
Which fields are missing, duplicated, or unreliable?
Who labeled the examples, and how were disagreements handled?
Does the dataset include edge cases, minority classes, and real production messiness?
What will change over time, and how will drift be detected?

This is not paperwork. It is model quality. In mental health AI research cited in the brief, label ambiguity and unclear outcomes can dominate performance. The same pattern appears in business workflows. If the team cannot agree on what a “high-quality lead” means, the model will not solve the disagreement.

Evaluation must match the real task

Benchmark scores can show model capability, but they do not prove that a system will work in your product. The Stanford AI Index examples in the research show the split clearly. Top models made major gains on coding and agent benchmarks, yet one leading model still read analog clocks correctly only about half the time. Capability is uneven.

Your evaluation set should come from your task, not from a generic benchmark. If you are building an internal knowledge assistant, test it on real questions employees ask, with messy wording, outdated assumptions, missing documents, and permissions that affect what the model can see. If you are classifying support messages, include ambiguous tickets, angry customers, short messages, multilingual examples, and rare but costly categories.

Practitioner discussions in the research repeat the same warning: high accuracy can still fail users. In imbalanced datasets, a model can look successful by predicting the majority class most of the time. That is why precision, recall, F1, PR-AUC, ranking metrics, and manual error review often matter more than a single accuracy number.

Build the evaluation harness before scaling

A practical evaluation harness should include:

A fixed test set of real examples the system has not seen.
Edge-case suites for known risks.
Human review by people who understand the domain.
Regression tests so fixes do not break past behavior.
Cost, latency, and refusal-rate tracking.
A clear process for adding new failures back into testing.

For generative systems, evaluation also needs qualitative review. Did the answer cite the right source? Did it skip uncertainty? Did it invent a policy? Did it take an action without approval? Those questions cannot be reduced to one score.

RAG and fine-tuning solve different problems

Teams often ask whether RAG is better than fine-tuning. The safer answer is that they solve different problems.

Use RAG when the model needs access to changing or proprietary knowledge. Company policies, product catalogs, customer documentation, research libraries, internal wikis, and legal templates are common examples. RAG retrieves relevant material at runtime and gives the model context for the answer. It also supports citations and source traceability when the system is designed well.

Use fine-tuning when the model needs to learn a pattern of behavior. That may mean output format, domain language, classification behavior, or task-specific responses. Fine-tuning can make sense when the training examples are stable, clean, and numerous enough to teach the behavior you want.

Many systems combine both. A customer support assistant may use RAG to pull the latest policy and fine-tuning to follow the company’s response format. The hard part is not naming the technique. The hard part is knowing which failure you are trying to fix.

Hallucinations often start upstream

Teams sometimes blame hallucinations on the model alone. The research points to a broader cause. Many false answers come from retrieval, ranking, routing, permissions, or context construction failures. The model answers badly because the system gave it the wrong evidence, incomplete evidence, or no evidence at all.

That is why context engineering matters. The system must decide what information to retrieve, how to rank it, how much to include, what to exclude, and when to refuse to answer. Adding another sentence to the prompt rarely fixes a broken retrieval pipeline.

Real AI systems are pipelines, not single model calls

A demo can be a prompt and a model. A production system usually needs more structure.

Google Deep Research-style architecture, as described in the research brief, uses planning agents, retrieval agents, web and private data sources, caching, synthesis, citations, conflict comparison, and user steering. That pattern exists because a single model call rarely has enough context or control to complete a serious workflow reliably.

A practical AI pipeline often looks like this:

Trigger: a user action, scheduled job, webhook, or event starts the process.
Context: the system gathers documents, records, permissions, and user history.
Decision: the model classifies, drafts, recommends, or plans.
Action: the system calls a tool, updates a record, drafts a message, or queues a task.
Validation: rules, tests, or another model check the output.
Human approval: risky or ambiguous cases go to a person.
Output: the result appears inside the workflow where the user already works.

In our Workform AI MVP, the important early decision was not “which model should write tasks?” The product needed to ingest and connect information from Slack, email, Asana, and meetings so project managers could see what was happening across scattered systems. The AI value came from context, integration, and workflow focus, not from a model call in isolation.

Observability is part of the product, not an operations add-on

AI failures are hard to debug when the system cannot show what happened. A normal application log might tell you that an API request succeeded. That is not enough for AI. You may need to know the prompt, model version, retrieved documents, ranking results, tool calls, intermediate plan, validation step, latency, cost, and final output.

Practitioner signals in the research are consistent on this point. Teams often discover too late that they cannot reproduce a bad answer because traces are flat, prompts changed, retrieved context was not saved, or a tool returned different data at runtime. In agentic systems, the problem gets worse because several steps may fail partially before the final answer appears.

AI observability should track:

Prompts and system instructions.
Retrieved documents and source versions.
Tool calls, responses, and failures.
Model names, versions, and settings.
Latency, token use, and cost.
User corrections and feedback.
Evaluation results over time.
Data drift and changes in input patterns.

Privacy and security still matter. You may need redaction, retention rules, access controls, and audit policies. But skipping logs entirely means you will be guessing when the system breaks.

Workflow fit determines adoption

A technically strong model can fail if it sits outside the way people work. Users will ignore it, copy outputs into unofficial tools, or create shadow workflows if the official system adds friction.

The HEPI 2026 student survey in the research is a useful signal outside the enterprise context: students were already using AI heavily, often beyond the tools and guidance their institutions provided. The pattern generalizes. If users see a faster path, they will take it. If your AI system does not fit the real workflow, people will route around it.

That is why the interface matters as much as the model. A support agent may need AI inside the ticketing system, not in a separate chat window. A finance reviewer may need a risk flag, cited evidence, and an approval queue, not a paragraph of explanation. An editor may need source suggestions inside the publishing workflow, not a general research bot.

Refact’s automation and integration work often comes down to this same principle. The system has to meet the user where the work already happens, with the right amount of control.

Governance slows reckless launches, but it helps serious ones scale

Governance is not a final checklist. It shapes what the system is allowed to do, who owns decisions, how users are informed, and when humans must review outputs.

Deloitte 2026 findings cited in the research say only one in five companies has a mature governance model for autonomous AI agents. That gap matters because adoption is running ahead of controls. The same research also reports that organizations see more common gains in productivity, insights, cost reduction, and customer relationships than in revenue or product innovation. In other words, AI is creating operational value, but scaling it safely remains difficult.

A useful governance plan answers plain questions:

What data can the system access?
What actions can it take without approval?
Which outputs require human review?
Who is accountable for mistakes?
How are users told when AI is involved?
How are bias, safety, and security issues reported?
How are changes tested before release?

For high-risk workflows, bounded autonomy is the safer pattern. Let the AI retrieve, draft, summarize, classify, or recommend. Require a person to approve actions that affect money, health, employment, legal standing, customer access, or safety.

The practical sequence for building an AI model

The process is less mysterious when you treat the model as one part of a product system.

1. Pick a narrow workflow

Choose one task with a clear user, input, output, and business reason. Avoid multi-department AI programs until one workflow has proven value.

2. Decide whether AI is needed

If rules, search, automation, or traditional software can solve the problem, use them. AI should earn its complexity.

3. Choose the model strategy

Select API, RAG, fine-tuning, traditional ML, agentic orchestration, or custom training based on the task, data, risk, and need for control.

4. Prepare the data

Clean, deduplicate, label, permission, and document the data. Build source traceability before the system depends on it.

5. Build the evaluation set

Use real examples, edge cases, and human review. Decide what “good enough” means before tuning the system around anecdotes.

6. Build the smallest useful version

Test the workflow with a limited group, limited permissions, and clear review. The first version should expose the hard failures early.

7. Instrument the system

Log prompts, context, tools, versions, outputs, feedback, latency, cost, and evaluation results. Without traces, improvement becomes opinion.

8. Launch with ownership

Assign responsibility for monitoring, corrections, retraining, policy changes, and user feedback. AI systems degrade quietly if nobody owns them.

Building an AI model is not a single technical milestone. It is a chain of decisions about scope, data, reliability, workflow, and control. The teams that get value usually do less at first, test harder, and build the operating system around the model before they expand its authority.

If you are trying to decide which AI path fits a workflow before development starts, Refact’s AI development process is built for that early clarity work.

Written by

Asghar Mirzaie

Asghar Mirzaei is a backend developer at Refact, focused on the APIs, integrations, and infrastructure that power the studio’s products. His work spans data pipelines, third-party services, backend architecture, and deployment systems, helping ensure that products are stable, scalable, and ready for real-world use. Asghar works closely with the team to connect product requirements with reliable technical foundations, especially in systems where performance, automation, and integration quality matter. At Refact, he contributes to the engineering work behind the interfaces, making sure the products the studio builds can run smoothly and dependably

El Colectivo 506