Monday, February 20, 2012

Newbie: Self-Joins and optimization

Hi,
I have the following optimization question.
I have a table tblMeasurements
which has:
Date_time
StationID
PollutantCode
Value
If i want to extract for *the same* Station AND date_time a series of values
for different pollutants (pollut1, pollut2, ...) as in ;
Date_time | Station | Measurement of Pollut1 | Measurement of Pollut2 |
...etc
I have to create a self-join. (dont i ?) on StationID AND Date_time.
Because I ma trying to automate the SQL string creation through a VB
interface i chose not to use JOIN and/or brackets but a simple
FROM tblMeasurements AS M1, tblMeasurements AS M2, .....
WHERE M1.Date_time=M2.Date_time, M1.StationID=M2.StationID, ....
Two questions:
Am I on the right track or am i missing something fundamental? I think I
cannot escape the Self-Join.
And, Is there any way to optimize the query by * a lot* ? I mean, if
explicit self-joins will give me 5% more then its ok. But If we are talking
big numbers then i should consider it. As i said the reason i chose this
method is because its easier to code, since i dont know beforehand the
number of pollutants the user will choose. But if he/she chooses 5
pollutants, we are talking 5 copies of a 5 million-lines table!
Any feedback/comments will be *greatly* appreciated!
Thanx
-steveSounds like a cross-tab to me.
http://www.aspfaq.com/2462
This is my signature. It is a general reminder.
Please post DDL, sample data and desired results.
See http://www.aspfaq.com/5006 for info.
"steve" <steve@.here.com> wrote in message
news:TdS7e.47989$gJ1.857143@.weber.videotron.net...
> Hi,
> I have the following optimization question.
> I have a table tblMeasurements
> which has:
> Date_time
> StationID
> PollutantCode
> Value
> If i want to extract for *the same* Station AND date_time a series of
> values for different pollutants (pollut1, pollut2, ...) as in ;
> Date_time | Station | Measurement of Pollut1 | Measurement of Pollut2 |
> ...etc
> I have to create a self-join. (dont i ?) on StationID AND Date_time.
> Because I ma trying to automate the SQL string creation through a VB
> interface i chose not to use JOIN and/or brackets but a simple
> FROM tblMeasurements AS M1, tblMeasurements AS M2, .....
> WHERE M1.Date_time=M2.Date_time, M1.StationID=M2.StationID, ....
> Two questions:
> Am I on the right track or am i missing something fundamental? I think I
> cannot escape the Self-Join.
> And, Is there any way to optimize the query by * a lot* ? I mean, if
> explicit self-joins will give me 5% more then its ok. But If we are
> talking big numbers then i should consider it. As i said the reason i
> chose this method is because its easier to code, since i dont know
> beforehand the number of pollutants the user will choose. But if he/she
> chooses 5 pollutants, we are talking 5 copies of a 5 million-lines table!
> Any feedback/comments will be *greatly* appreciated!
> Thanx
> -steve
>
>|||>> Am I on the right track or am I missing something fundamental? <<
Normalization. Each pollutant measurement is a different attribute, but
you are putting data and metadata in your table, then self-joining to
restore the actual data model.
CREATE TABLE PollutionSamples
(sample_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
station_id INTEGER NOT NULL,
pollutant_1_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
pollutant_2_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
.
pollutant_n_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
PRIMARY KEY (sample_date, station_id));
choose. But if he/she chooses 5 pollutants, we are talking 5 copies of
a 5 million line[sic] table! <<
If you mean that you do not know what pollutants you are measuring,
then you do not have enough information to create a database. You need
the appropriate domain & default for each pollutant before you can
declare a column (not a line-- that is a spreadsheet term).
If you mean that you do not know what pollutants the user will want to
see in a report, then that is an issue for front end and probably some
kind of reporting tool that will give you charts, etc.|||thanx for you fast reply!
I dont understand why my way violates normalization'
Actually, you can just add a new code (pollutant code) and its measurements
over a period, very easily.
With your table, you would have to add a new field. Why?
Also, since 1 column - 1 pollutant in your table, that means that the time
periods would have to be the same
otherwise we would have a lot of NULLS. Which is not the case with my table
and my data.
e,g. CO data from 1998-1999, Ozone data from 2000-2004, etc.
As you can see my db knowledge is *minimal*, does what i say makes sense ?
TIA
-steve
"--CELKO--" <jcelko212@.earthlink.net> a crit dans le message de news:
1113587306.568558.147660@.z14g2000cwz.googlegroups.com...
> Normalization. Each pollutant measurement is a different attribute, but
> you are putting data and metadata in your table, then self-joining to
> restore the actual data model.
> CREATE TABLE PollutionSamples
> (sample_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
> station_id INTEGER NOT NULL,
> pollutant_1_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
> pollutant_2_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
> ..
> pollutant_n_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
> PRIMARY KEY (sample_date, station_id));
>
> choose. But if he/she chooses 5 pollutants, we are talking 5 copies of
> a 5 million line[sic] table! <<
> If you mean that you do not know what pollutants you are measuring,
> then you do not have enough information to create a database. You need
> the appropriate domain & default for each pollutant before you can
> declare a column (not a line-- that is a spreadsheet term).
> If you mean that you do not know what pollutants the user will want to
> see in a report, then that is an issue for front end and probably some
> kind of reporting tool that will give you charts, etc.
>|||Your structure s all right, (although I might consider adding the
pollutantCode to the Primary Key)
If you have the luxury of Modifying the Stored Proc code in the event a new
pllutant is added, then you can "hardCode" the columns for the list of
pollutants, instead of using multiple self-joins..
Select StationID, Date_Time,
Min(Case When PollutantCode = 'ozone' then Value End) Ozone,
Min(Case When PollutantCode = 'CO' then Value End) CO,
Min(Case When PollutantCode = 'SO2' then Value End) SO2,
Min(Case When PollutantCode = 'CO2' then Value End) CO2
From tblMeasurements
Group By StationID, Date_Time
Order By Date_Time
"steve" wrote:

> thanx for you fast reply!
> I dont understand why my way violates normalization'
> Actually, you can just add a new code (pollutant code) and its measurement
s
> over a period, very easily.
> With your table, you would have to add a new field. Why?
> Also, since 1 column - 1 pollutant in your table, that means that the time
> periods would have to be the same
> otherwise we would have a lot of NULLS. Which is not the case with my tabl
e
> and my data.
> e,g. CO data from 1998-1999, Ozone data from 2000-2004, etc.
> As you can see my db knowledge is *minimal*, does what i say makes sense ?
> TIA
> -steve
> "--CELKO--" <jcelko212@.earthlink.net> a écrit dans le message de news:
> 1113587306.568558.147660@.z14g2000cwz.googlegroups.com...
>
>|||On 15 Apr 2005 10:48:26 -0700, --CELKO-- wrote:

>Normalization. Each pollutant measurement is a different attribute, but
>you are putting data and metadata in your table, then self-joining to
>restore the actual data model.
>CREATE TABLE PollutionSamples
>(sample_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
> station_id INTEGER NOT NULL,
> pollutant_1_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
> pollutant_2_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
> ..
> pollutant_n_ value DECIMAL (6,2) DEFAULT 0.0 NOT NULL,
>PRIMARY KEY (sample_date, station_id));
Hi Joe,
I really can't believe that you have proposed a structure like this! It
would mean adding a column to the table each time a new pollutant had to
be measured. And I'm not even sure if the full design would fit in the
maximum number of columns allowed in SQL Server.
FWIW, I've been working at a lab where (a.o.) ground samples were
measured for various pollutants. I can't recall the exact number of
pollutants measured, but I think it was somewhere near the one hundred
mark. And that was 20 years ago - the number will probably have
increased since then! Do you really want that many columns in your
table? Would you have seperate columns for each product in a table
holding daily sales in supermarket branches?
P.S.: In our modern world, a default of 0.0 for a pollutant measurement
is either silly or very wishful thinking...
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||>> It would mean adding a column to the table each time a new pollutant
had to be measured. And I'm not even sure if the full design would fit
in the maximum number of columns allowed in SQL Server. <<
The list of pollutants that the US requires to be tested have been MUCH
more stable than the ZIP codes we use for our postal code. The problem
is that each attribute's measurement is different scale -- parts per
million, parts per thousand, adjustment for seasons, etc. And I agree
that normally I would not like to see a reallllllllllly long row in a
table, but if they are different attributes of the **same sample**,
then this is the right way. Otherwise, you are splitting a single
sample over multiple rows and will have to re-assemble it later. Hey,
we never minded files with over 100 fields, did we?
either silly or very wishful thinking...<<
LOL! I needed a default for the pseudo-code. When I lived in Los
Angeles, the automobile pollution stations would caliberate their test
wands by waving them in the air during morning traffic. Los Angeles
air quality could be so low that they would get NEGATIVE reading from
an automobile tailpipe.

No comments:

Post a Comment