Several years ago I started downloading day prices for Nasdaq. That is, the open, high, low, close, and volume figures for each trading day for each of the some 3000 or so stocks listed on the Nasdaq exchange. That’s what is going to be used to build our first data warehouse. Microsoft SQL Server 2005 has three tools specially created to do this sort of thing, and that is what we are going to use. Specifically, we will use Analysis Services and Integration Services to do the building, and to some extent we will use Reporting Services for data analysis.
SQL Server Management Studio has some fairly good tutorials on the use of these tools. However, typical of Microsoft tutorials, you get so buried in the specific details that it becomes very hard to grasp the big picture. So, before rushing to build something, we need to look at some basics. The best place to start is with something known as a Data Cube.
The Data Cube Viewed as a Concept
Forgetting for a moment how we actually implement such a thing, let’s just talk about it in the abstract for a moment. I mentioned in the last post how retail sales is a very popular topic to use in Data Warehouse tutorials. The ideas there are quite understandable, and when retail concepts are used as the basis for creating a data warehouse, it makes for easier understanding. Just briefly we will revert to that to make a couple of points, then forge ahead on stock analysis.
Data cubes tend to be oriented towards answering typical questions asked about how the business process is working. For example, if a company is selling bicycles, a top level manager might want to know how the sales of a particular model breaks down according to geographical location, gender of the purchasor, and age of the purchasor. In fact, the manager probably wants to know how this works out for each of the models the company sells. Additionally, since sales according to these factors must surely change over time, then time itself becomes one of the variables to consider. Let’s consider just the first three factors for a moment, then add the other two.
Now we are going to have two selections of gender, male and female. Suppose for the sake of argument that we consider three geographical areas where the particular model of bicycle is sold, Northeast (New York, New Jersey, and Connecticut), Midwest (Indiana and Illinois) and Southwest (Texas, New Mexico, and Arizona). Finally, suppose we have the following age groups of interest: 10 to 19, 20 to 29, 30 to 39, and 40 and above. We could think of each of these category sets as dimensions, like length, width, and height. We assume that we have the sales figures for a particular period according to these three dimensions, so imagine that we have a bunch of boxes, 2 genders times 3 geographical areas times 4 age groups or 24 boxes to be precise. We will label each of the 24 boxes with a particular gender, location, and age group, and place a piece of paper in the box with the corresponding sales figure. Now we stack the boxes.
We will let gender correspond to height so that the stack is only two boxes high. Length can be geographical location while width is age group. The stack of boxes will be 3 boxes long by four boxes wide, by two boxes high, and we now have a data cube. To know what the sales for this bicycle model is for the time period for which the data is collected, just look for the box with the desired location, age group, and gender.
Okay, the stack of boxes isn’t really what we want to think of as a cube, because its length, width and height are not equal. Even worse, we are surely going to want to include other dimensions, at least including the other two mentioned above, bicycle model and time period. Now we have no hope of creating a stack of boxes to illustrate the idea. Even though we can’t create a real physical stack when the dimensions get greater than three, we can still try to imagine it, and we certainly can implement it as a data cube in a database.
Now, Getting Back to Nasdaq Stocks
That’s enough of selling bicycles, let’s get back to Nasdaq. In this case we will have at least two dimensions for our cube. Those will be time and the name of the stock. We could also further subdivide according to the type of company, such as computer software, disk manufacturer, modem manufacturer, router manufacturer and the like. Either way we would actually have a two or three dimensional figure which we could actually draw (or stack if we had that many boxes and that much free time). Now we get specific.
The free day price data is available at eoddata.com. Four weeks of price data is available at any one time, with the prices for all stocks for a given day in a single comma separated value (CSV) file. It is free to register to download the files, and all you get is the occasional spam ad in your email box. On the whole, not a particularly bad bargain.
My data set goes back to December, 2005, well before I started working with SQL Server Analysis Services and making data cubes. I have been downloading the files regularly, archiving them, then putting them in a single table. Each row has colums for date, stock symbol, open, close, high, low and volume. The task is to transform that data into a form usable for creating a data cube. Since the data is already in a database table, we start from there. Actually we could have started with the CSV files, but since the table is already there that is our starting point.
A Few Words About Database Normalization, and a Link to More Details
One of the first thing you learn about creating a database is that it should be normalized. There are degrees of normalization, each successive degree having the characteristics of the degrees lower than it, plus an additional characteristic. I don’t want to get off on that subject too much, so here is a Wikipedia link which discusses the various normal forms in a bit more detail. The point is that we need a database as a starting point which is not the least bit normalized. It will contain several dimension tables, and some number of fact tables. It will be the basis for the initial form of the data warehouse we will build.
A Look Further Onward Before Looking At Details
Much can, and must be said about how to structure this starting, unnormalized database, and that we will do in the next post. For now, I want to say a word or two about generally what happens during and after the creation of that database. First we will employ SQL Server Integration Services to populate it with the existing stock data. Once that is done, SQL Server Analysis Services will be used to create the data cube from it. Once the data cube is created, SQL Server’s data mining tools will be used to look for trends in the data. To some extent we will use SQL Server Reporting Services to create some reports on that analysis. As you will see, Reporting Services does leave something to be desired as a reporting tool, so we will look at some additional approaches for the creation of custom reports.
What’s Coming Next
In our next episode, we will have quite a bit more to say about how to structure this unnormalized starting database, and how to populate it with data using Integration Services. The episode following that will describe how Analysis Services is used to build the cube.
And Finally, In Closing, a Word About Comments
In a word or two, I’d really like to hear yours. So far, the comments received on the various topics covered have been most welcome. With the exception of the spam comments from online poker sites, Russian mail order bride brokers, and the penis enlargement industry caught by the WordPress spam catcher, all the comments received have been most welcome, and have been approved. With this sudden change in topic, I’d like to know that you folks are still logging on and reading. See you next time