Model Benchmarks
12 models evaluated on Vietnamese real estate price prediction using repeated stratified cross-validation on 81,790+ listings.
default
default
Pre-trained on 130M synthetic datasets, no fine-tuning
| # | Model | Config | RMSE | MAE | MAPE | Time/fold | Rows |
|---|---|---|---|---|---|---|---|
| 1 | CatBoostdeployed | default | 0.3360+/-0.0039 | 0.2246 | 25.2% | 24s | 81,790 |
| 2 | LightGBM | lr=0.030 | 0.3816+/-0.0026 | 0.2574 | 29.4% | 35s | 81,790 |
| 3 | XGBoost | default | 0.3836+/-0.0010 | 0.2630 | 29.9% | 20s | 81,790 |
| 4 | AutoGluon | best quality, 600s limit | 0.3993+/-0.0055 | 0.2717 | 30.3% | 12m 20s | 90,566 |
| 5 | TabICLfoundation | n=32 | 0.4995+/-0.0028 | 0.3256 | 38.9% | 1m 59s | 90,566 |
| 6 | KNN | default | 0.5180+/-0.0033 | 0.3450 | 42.1% | 29s | 90,566 |
| 7 | Ridge | default | 0.5516+/-0.0253 | 0.3644 | 43.5% | <1s | 12,908 |
| 8 | RandomForest | default | 0.5640+/-0.0042 | 0.3920 | 46.9% | 18m 4s | 90,566 |
| 9 | LinearRegression | default | 0.5779+/-0.0067 | 0.3781 | 46.8% | 37s | 90,566 |
| 10 | LinearSVR | default | 0.5922+/-0.0070 | 0.3638 | 43.9% | 23s | 90,566 |
| 11 | ElasticNet | default | 0.6102+/-0.0047 | 0.4144 | 49.4% | 15s | 90,566 |
| 12 | Lasso | default | 0.6196+/-0.0044 | 0.4224 | 51.0% | 8s | 90,566 |
What is a foundation model?
A neural network pre-trained on massive data to serve as a reusable base for many downstream tasks. Unlike traditional models that train from scratch on your specific dataset, a foundation model arrives already knowing general patterns — you just show it a handful of examples and it generalizes. This is the paradigm behind GPT (text), CLIP (images), and now TabICL (tables). Pre-training on 130M synthetic tabular datasets lets TabICL do in-context learning: predicting on new data without any fine-tuning.
Foundation & AutoML
Foundation model · 130M synthetic datasets
A transformer pre-trained on 130 million synthetic tabular datasets, learning the statistical patterns common across all tabular data. At inference it performs in-context learning: you show it a handful of labeled examples plus a test row, and it predicts — without any training on your specific dataset. The tabular equivalent of GPT's few-shot prompting. Wins statistically significantly at small scale (<1K rows); gradient boosters catch up as data grows. Runs on an A10G GPU, ~50s latency.
Amazon · stacked ensemble AutoML
Not a single model but a system. Trains LightGBM + CatBoost + XGBoost + Random Forest + neural nets + FT-Transformer in parallel, then stacks them with a meta-learner. The tradeoff for its accuracy is 10-minute fold times and a harder-to-deploy artifact.
Gradient Boosting
Yandex · native categorical handling
Gradient boosting that builds decision trees sequentially — each new tree corrects the previous tree's errors. CatBoost's edge is native support for categorical variables via ordered target statistics, which matters enormously here: we have 257 districts, 235 wards, and 974 streets that other frameworks either one-hot-encode (explosion) or ordinal-encode (lossy). Deployed in production at depth=10, autoresearch-tuned.
Microsoft · leaf-wise, histogram-based
Microsoft's speed-optimized gradient booster. Uses histogram-based feature binning and leaf-wise tree growth (vs XGBoost's level-wise), which converges faster on most tabular problems. Trained here with Huber loss to resist the long tail of outlier listings, and handles NaN values natively.
The original boosting framework
Popularized gradient boosting on tabular data and still a strong baseline. Slightly behind CatBoost and LightGBM on this dataset because it needs ordinal encoding for high-cardinality categoricals (its native categorical mode fails on district-level granularity).
Tree & Instance-based
Bagged decision trees
An ensemble of decision trees trained in parallel on bootstrapped samples, with predictions averaged across trees. Interpretable and robust, but lacks the sequential error-correction mechanism of boosting — which is why it plateaus around RMSE 0.56 here while boosters reach 0.34.
k-nearest neighbors · non-parametric
Predicts by averaging the 10 nearest listings in feature space, distance-weighted. No training phase; pays the cost at inference time. Struggles as categorical cardinality and feature dimensionality grow.
Linear Baselines
L2-regularized linear regression
Linear regression with L2 (squared coefficient) penalty — shrinks all coefficients toward zero. Useful as a sanity-check baseline: quantifies how much of the variance is linearly explained before we invoke non-linear models.
L1-regularized · sparse
Linear regression with L1 (absolute coefficient) penalty. Drives unimportant coefficients to exactly zero, giving implicit feature selection.
L1 + L2 blend
Blends Lasso (L1) and Ridge (L2) penalties. Handles correlated features more gracefully than pure Lasso (which arbitrarily picks one of a correlated pair).
ε-insensitive loss
Support vector regression with a linear kernel. Optimizes an ε-tube (ignores errors below ε, penalizes errors outside) rather than squared error — different loss geometry, comparable expressiveness to Ridge.
Plain OLS · no regularization
Classical ordinary least squares. The null hypothesis against which every more complex model must justify its complexity.
- Target
- log_unit_price = ln(price_vnd / area_m2)
- Evaluation
- 5-fold stratified cross-validation (quantile-binned target)
- Dataset
- 81K+ Vietnamese real estate listings, 44 features (21 numeric + 7 categorical + 16 binary) after LLM price-quality filter
- Source
- batdongsan.com.vn, nationwide coverage
- Tuning
- Optuna TPE sampler, 50 trials with inner 3-fold CV for LightGBM and CatBoost
- Statistical tests
- Nadeau-Bengio corrected t-test with Cohen's d effect size
- Refresh
- Model and dataset retrained monthly by a scheduled job on our infrastructure: refresh the feature view → export → retrain with the winning config → RMSE/row-count gate → deploy (bounces the warm inference container onto the fresh model).
Cross-validation MAPE (mean) differs from production MdAPE (median): MdAPE = 14.8% on 81K listings with CatBoost depth=10 (verified 2026-04-15). Legacy runs on the 90K / 19-feature dataset are retained for family comparison.
Currently deployed: CatBoost (depth=10, autoresearch-tuned) for fast inference. TabICL (foundation model) available as an alternative.
Last updated: April 16, 2026