FTS Index Size: Flag vs. Auxiliary Table

Hi, I want to accomplish the simple task of flagging a few rows (1-200) in a
large table (many rows i.e. 1,000,000+). I want to find these flagged rows
quickly. After some queries are run against these rows, the flag will be
changed to indicate "don't process these rows". If I use a bit or int or cha
r
flag, and create an index, I will have every row in the index, which seems a
waste of space (correct me if I'm wrong). If the flag is either a value or
null, will the null rows still be indexed? If not, this seems like an ideal
way to go. If so, it seems like it would be better to keep another table to
keep track of the rows needing processing temporarily. This aux table would
always contain only a few rows. So, this is really 2 questions;
1) Are null rows indexed? If not, then an index on a bit column (I
understand that these CAN be indexed in SS 2000), would be a vary small, ver
y
useful index for this problem. No?
2) I'm sure I'm not the only person in history to wonder about this
situation; it must be quite common (yes, I've done my googling but no satis.
answer); what is the OPTIMAL way to handle this situation? Perhaps it's
neither of my ideas and something else entirely.
Thanks!
ChrisI would create a second table, "Samplings" and keep the the keys of the
first batch of 200 rows in it, along with any other information about
the sample. The use of flags is too much like low level assembly
language progamming. Since you are dealing with a small set, anything
will run pretty fast for you.|||> Hi, I want to accomplish the simple task of flagging a few rows (1-200) in ad">
> large table (many rows i.e. 1,000,000+). I want to find these flagged rows
> quickly. After some queries are run against these rows, the flag will be
> changed to indicate "don't process these rows". If I use a bit or int or c
har
> flag, and create an index, I will have every row in the index, which seems
a
> waste of space (correct me if I'm wrong).
Every row will be represented in the index, but not with all columns in the
table. You will only
have the indexed column (which is very narrow) and also the bookmark (for SQ
L Server to find the
row). The bookmark is either 8 bytes for a heap table, or the clustering key
if it is a clustered
table.

> 1) Are null rows indexed?
Yes, the NULLs are included in the index. Seen the the index' perspective, a
NULL is just like any
other value.
Tibor Karaszi, SQL Server MVP
http://www.karaszi.com/sqlserver/default.asp
http://www.solidqualitylearning.com/
Blog: http://solidqualitylearning.com/blogs/tibor/
"querylous" <querylous@.discussions.microsoft.com> wrote in message
news:4FD46302-C3C4-4290-A06F-B6F168C8A218@.microsoft.com...
> Hi, I want to accomplish the simple task of flagging a few rows (1-200) in
a
> large table (many rows i.e. 1,000,000+). I want to find these flagged rows
> quickly. After some queries are run against these rows, the flag will be
> changed to indicate "don't process these rows". If I use a bit or int or c
har
> flag, and create an index, I will have every row in the index, which seems
a
> waste of space (correct me if I'm wrong). If the flag is either a value or
> null, will the null rows still be indexed? If not, this seems like an idea
l
> way to go. If so, it seems like it would be better to keep another table t
o
> keep track of the rows needing processing temporarily. This aux table woul
d
> always contain only a few rows. So, this is really 2 questions;
> 1) Are null rows indexed? If not, then an index on a bit column (I
> understand that these CAN be indexed in SS 2000), would be a vary small, v
ery
> useful index for this problem. No?
> 2) I'm sure I'm not the only person in history to wonder about this
> situation; it must be quite common (yes, I've done my googling but no sati
s.
> answer); what is the OPTIMAL way to handle this situation? Perhaps it's
> neither of my ideas and something else entirely.
> Thanks!
> Chris|||Another option would be to create an indexed view for each flag state - if
for some reason you couldn't use an auxilary table, and/or you'd experience
poor performance due to low selectiveness of the index.
ML
http://milambda.blogspot.com/|||Thanks all. Since null rows ARE indexed, wouldn't all those nulls with just
a
few non nulls slow down the search of the index? I'm not exactly sure how th
e
search would run, but if it's b-tree, won't the search have to keep branchin
g
and branching and branching until it finds what it's looking for? Or does it
have some way of knowing "ok, all the actual values [non nulls] are right
here at the top of the index"'
"Tibor Karaszi" wrote:

> Every row will be represented in the index, but not with all columns in th
e table. You will only
> have the indexed column (which is very narrow) and also the bookmark (for
SQL Server to find the
> row). The bookmark is either 8 bytes for a heap table, or the clustering k
ey if it is a clustered
> table.
>
> Yes, the NULLs are included in the index. Seen the the index' perspective,
a NULL is just like any
> other value.
> --
> Tibor Karaszi, SQL Server MVP
> http://www.karaszi.com/sqlserver/default.asp
> http://www.solidqualitylearning.com/
> Blog: http://solidqualitylearning.com/blogs/tibor/
>
> "querylous" <querylous@.discussions.microsoft.com> wrote in message
> news:4FD46302-C3C4-4290-A06F-B6F168C8A218@.microsoft.com...
>|||The null rows will make the index larger on disk, but they should
not slow things down. An index on thisCol, if used in a
query containing (thisCol is not null), will quickly s

from
the root of the index directly down to the first non-null thisCol
value in the leaf level, then scan across the leaves with non-null
thisCol values.
Although it doesn't actually matter here, duplicate
index keys appear only once at levels above the leaf level of
the index. In no case will there be branching and branching
and looking until it finds what it wants. The initial s

to the
first index row retrieved is effectively a b-tree search within
only the unique values of the indexed columns. Since you
have very few unique values, the non-leaf part of your
index is very small.
Suppose you have 12 offices in a very large building. In
the lobby (the root of the index), there's a sign that says:
ChrisCom: Rooms 8230-8254
Microsoft: Rooms 101-8229
You don't blindly start in the middle of the building and
hop around looking at actual offices to find out where
to go (as you would if you were doing a binary search
on an ordered collection).
If there are a lot of distinct companies in the building,
you might see this, which is only a little more trouble:
AbbottCom - CanadaCom: See detailed directory at left.
CenterCom - LakshmiCom: See detailed directory on page below
MaryCom - ZsoltCom: See detailed directory at right.
Only distinct keys need to be in the upper levels of the index.
Steve Kass
Drew University
querylous wrote:
>Thanks all. Since null rows ARE indexed, wouldn't all those nulls with just
a
>few non nulls slow down the search of the index? I'm not exactly sure how t
he
>search would run, but if it's b-tree, won't the search have to keep branchi
ng
>and branching and branching until it finds what it's looking for? Or does i
t
>have some way of knowing "ok, all the actual values [non nulls] are right
>here at the top of the index"'
>"Tibor Karaszi" wrote:
>
>|||Well, it sort of depends on what you're doing, and what the flag means.
For example, I once had a client that decided that they wanted to
basically archive all of the data that they'd ever used in their
database. (For some compliance reasons.) The solution that they chose
was to give each row a BIT field called ActiveFL (I can hear CELKO
scream right now!), and then basically append it to each query like so:
SELECT * FROM Person
WHERE Name LIKE 'Joe'
AND ActiveFL = 1
So pretty much every query in the system had "AND ActiveFL = 1" appended
to the end. What made matters much worse was that these were rows that
didn't stay "active" for a long time - they basically got inactivated
after a short period of time. So as time went by, you might have, say,
1 million inactive rows and a thousand active rows. Well, what's the
problem with this?
1) You have to remember to add "AND ActiveFL = 1" to all of your queries
in order to not pick up false data. Not toooo bad, I guess you could
use a view for this...
2) Here's the real killer -- You can't effectively index the tables.
Say you have an index on "Name" in my example. If you know that you're
never going to search for the inactive records by name, tough luck,
you're still indexing on the word "Name" for the million inactive rows
that you don't care about. This isn't an efficient indexing scheme, and
as you can imagine, the index overhead just grows over time like this.
3) Even if you were going to index on ActiveFL (as a single column), you
might be in trouble. It's not a selective index, and will probably be
ignored by the optimizer, depending on what the proportion of active to
inactive rows are. The only thing you could possibly do is add ActiveFL
to each index on the table, which will increase your index size and only
compound issue 2.
IMO, In situations like these, where you really don't want to consider
the rows anymore, deleting the inactive rows and moving them to another
"historical" or "archive" table can have a lot of benefits from the
standpoint of performance and manageability.
-Dave
querylous wrote:
> Hi, I want to accomplish the simple task of flagging a few rows (1-200) in
a
> large table (many rows i.e. 1,000,000+). I want to find these flagged rows
> quickly. After some queries are run against these rows, the flag will be
> changed to indicate "don't process these rows". If I use a bit or int or c
har
> flag, and create an index, I will have every row in the index, which seems
a
> waste of space (correct me if I'm wrong). If the flag is either a value or
> null, will the null rows still be indexed? If not, this seems like an idea
l
> way to go. If so, it seems like it would be better to keep another table t
o
> keep track of the rows needing processing temporarily. This aux table woul
d
> always contain only a few rows. So, this is really 2 questions;
> 1) Are null rows indexed? If not, then an index on a bit column (I
> understand that these CAN be indexed in SS 2000), would be a vary small, v
ery
> useful index for this problem. No?
> 2) I'm sure I'm not the only person in history to wonder about this
> situation; it must be quite common (yes, I've done my googling but no sati
s.
> answer); what is the OPTIMAL way to handle this situation? Perhaps it's
> neither of my ideas and something else entirely.
> Thanks!
> Chris|||querylous (querylous@.discussions.microsoft.com) writes:
> Hi, I want to accomplish the simple task of flagging a few rows (1-200)
> in a large table (many rows i.e. 1,000,000+). I want to find these
> flagged rows quickly. After some queries are run against these rows, the
> flag will be changed to indicate "don't process these rows". If I use a
> bit or int or char flag, and create an index, I will have every row in
> the index, which seems a waste of space (correct me if I'm wrong). If
> the flag is either a value or null, will the null rows still be indexed?
> If not, this seems like an ideal way to go. If so, it seems like it
> would be better to keep another table to keep track of the rows needing
> processing temporarily. This aux table would always contain only a few
> rows. So, this is really 2 questions;
In addition to the other posts:
o Indexing on a bit column in a case like yours is likely to be
successful, as the index will be very selective for finding flagged
rows.
o Assuming that your flag is a bit column, there is a gotcha. Don't say:
SELECT ... FROM tbl WHERE flag = 1
but say:
SELECT ... FRPM tbl WHERE flag = convert(bit, 1)
This is because of the rules for data-type conversions in SQL. The
literal 1 is an integer, so the flag will be implicitly converted to
bit. But once the column is involved in a operation, the index
can no longer be s

ed, only scanned, which of course is less
effecient.
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Steve thanks for this great explanation, it really answers a lot of question
s
I've always had about indexing. So, I think it's fine to create an index on
a
flag column for this table; also fine to create an aux table; I think I'll
try a flag first and see how it goes.
Chris|||Hi Dave, good points; the table on which I'll have the flag contains a lot o
f
data which is used very actively throughout the system. It's a table
containing scheduling info, and relates to many other tables and joins in th
e
system used all the time. But, the flag isn't used to determine which record
s
are active or inactive; it's used to determine if a template has been
generated (a process totally outside the db; ie. we run a script that
generates templates for another application; it queries the db for all rows
we haven't generated a template for, and generates one for each row not
already done. Then it marks those rows as done). Even when rows are marked
"needsTemplate = 0", we'll still be accessing those rows all the time for
other purposes. Ultimately, since it is a schedule, many rows will become
archival; at that point, we will move them to a warehouse table.
So, this is the challenge; based on what I've learned here, I think it's ok
to go with such a flag, or, definitely ok to use an aux table.
Thanks for your input!
Chris
"Dave Markle" <"dma[remove_ZZ]ZZrkle" wrote:

> Well, it sort of depends on what you're doing, and what the flag means.
> For example, I once had a client that decided that they wanted to
> basically archive all of the data that they'd ever used in their
> database. (For some compliance reasons.) The solution that they chose
> was to give each row a BIT field called ActiveFL (I can hear CELKO
> scream right now!), and then basically append it to each query like so:
> SELECT * FROM Person
> WHERE Name LIKE 'Joe'
> AND ActiveFL = 1
> So pretty much every query in the system had "AND ActiveFL = 1" appended
> to the end. What made matters much worse was that these were rows that
> didn't stay "active" for a long time - they basically got inactivated
> after a short period of time. So as time went by, you might have, say,
> 1 million inactive rows and a thousand active rows. Well, what's the
> problem with this?
> 1) You have to remember to add "AND ActiveFL = 1" to all of your queries
> in order to not pick up false data. Not toooo bad, I guess you could
> use a view for this...
> 2) Here's the real killer -- You can't effectively index the tables.
> Say you have an index on "Name" in my example. If you know that you're
> never going to search for the inactive records by name, tough luck,
> you're still indexing on the word "Name" for the million inactive rows
> that you don't care about. This isn't an efficient indexing scheme, and
> as you can imagine, the index overhead just grows over time like this.
> 3) Even if you were going to index on ActiveFL (as a single column), you
> might be in trouble. It's not a selective index, and will probably be
> ignored by the optimizer, depending on what the proportion of active to
> inactive rows are. The only thing you could possibly do is add ActiveFL
> to each index on the table, which will increase your index size and only
> compound issue 2.
> IMO, In situations like these, where you really don't want to consider
> the rows anymore, deleting the inactive rows and moving them to another
> "historical" or "archive" table can have a lot of benefits from the
> standpoint of performance and manageability.
> -Dave
> querylous wrote:
>

2012年3月21日星期三

Flag vs. Auxiliary Table

没有评论:

发表评论

FTS Index Size

Blog Archive

About Me