Hi all,
I have an issue with columnstore and the query optimiser. I am running the following query:
SELECT Store.StoreId, COUNT(*)
FROM Fact.Sale
INNER JOIN Dimension.Store ON Store.StoreId = Sale.StoreId
GROUP BY Store.StoreId
ORDER BY Store.StoreId
OPTION (RECOMPILE)
"Fact.Sale" has almost 4 million rows, while the Store table has 24. The query ends up taking 7 seconds, which is terrible for columnstore. Obviously, if I rewrite the query with subqueries, the query improves, but I am using SSAS and ROLAP to generate queries.
The strange thing is, on another data source with an almost identical query and far more data, it is sub-second. What I've cut it down to is the query optimiser and statistics. See theestimated query plan:
and the actual query plan:
This is also the plan that gets executed (obviously); however, what is interesting is that the "Actual Number of Rows" and "Estimated Number of Rows" are radically off.
In the estimated plan, you can see the columnstore scan spits out a lot of rows (3.8m in fact), but the "Parallelism (Repartition Stream)" actual estimates it spits out 1 row and then from then on in, it continues to estimate 1 row. In the actual plan, for some reason the "Parallelism (Repartition Stream)" actual number of rows is 0, but you can see by the height of the arrows from then on in, the actual number of rows is the 3.8m (not the 1 it estimated).
So clearly the optimiser is picking a simpler query plan because it thinks there will only be 1 row to process through the process. Interestingly enough though, on the other data source (on the same SQL server instance), it gets all the estimations correct and adds in a "Hash match (Partial aggregate)" in batch mode which makes the query fly.
I've update statistics, created statistics on various different columns, rebuilt the table, rebuilt the indexes, etc. but I can never seem to get the right estimations.
Does anyone know how I can probe further into this issue and find a resolution?