Benford’s Law

You might have thought that the distribution of the first digit in a set of data (for example the population of countries in the world) would be evenly distributed? Well, it turns out that it’s not, and this is described by Benford’s Law.

So instead of one-ninth chance of a number starting with each digit from 1 through 9, the probably of a value start with a 1 is about 30%, starting with a 2 is 17%, 3 is 12% and so on, until the change of something starting with a 9 is 4.6% (from Wikipedia).

From Wikipedia again:

This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.

But then when you think about it, it makes some sort of sense. If you have a population which is doubling every year, then if it starts at 1, next ear it will be 2, then 4, 8, 16, 32, 64, 128, 256, etc. The first digit in that series is 1 pretty often.

Or think of the population of countries. As an example, I think of the population of India over time. When I was growing up, in the ’90s, the population was in the 800 millions (thanks Wikipedia again) and growing at about 20% per decade. In 1991 the population was 846,387,888 and by 2001 it had grown to 1,028,737,436 and was growing at about 20% per decade. The first digit of the population had zipped through 3, 4, 5, 6, 7, 8 and 9 in the years from 1950-2001. Now that the first digit is a 1, it’s going to take a long time to increase a 2 by growing at 20% per decade (or possibly creep back to a 9).

I’m aware that I’m looking at the growth of populations rather than a static snapshot but I find it easier to visualise that way.

I heard about Benford’s Law about a year ago and found it quite interesting and unexpected. Looking back though I’m surprised that I had never noticed that before and it sort of seems logical. Reading the history of the law on Wikipedia makes it seem like it was only reasonably recently discovered and fully explained.

Note: It seems that Wolfram has a few good examples of where this law holds on this page: http://mathworld.wolfram.com/BenfordsLaw.html

Advertisements
Posted in Uncategorized | Tagged , , | Leave a comment

DISTINCT. Just avoid it! Please….

Don’t use DISTINCT, unless you really mean it.

I’ve come across this issue countless times, where a developer will write a query that brings back a few duplicate rows, so they will just put DISTINCT into the query to remove the duplicates.

The problem with this is that often when a query is showing duplicate rows, the correct thing to do is fix the duplicates rather than using DISTINCT to hide them.

A simple example here might be where you have a table of CUSTOMERS, where each customer has one or more addresses. You want to find all the customers that have an address in a certain city.

The simple query to do this would be to join the two tables together and return the results.

select c.firstname, c.lastname, a.city
From customer c
Join address a on (a.customerid = c.customerid)
Where a.city = ‘London’

Problem solved? Well, not really because you then see that some customers might have a postal address and a residential address, both of which might be in London. So, you put a DISTINCT into the query which solves the problem. Except it doesn’t necessarily…

DISTINCT is an indiscriminate tool and I’ve found that its use is often cause by not understanding the data model that the foundation of the database. When duplicate rows are found, DISTINCT is the first solution that comes to mind, rather than examining where join conditions might be missing or where the join hierarchy might be misunderstood.

This post has come about after finding that many of the reporting queries at work have distinct thrown into them. This might just mask issues with the query and in my opinion, using DISTINCT should really be used in exceptional cases only.

Posted in Uncategorized | Tagged , , | Leave a comment

Cursors and SQL Server

I’ve recently started a job where I’ve moved from doing Oracle development to working as a tester, and working with SQL Server.

Most of the basics in SQL Server are the same as in Oracle. The methods for writing queries, modifying data and creating tables are pretty similar. However, it has taken a while to work out best practices for developing procedural T-SQL code. This has been a little difficult without a mentor as I am just developing code that I think will be good without anyone to work with to improve it.

I’ve been reading up on cursors this week while trying to write some procedural code. Weirdly, most of the articles that I read told me that I shouldn’t use cursors. As a database professional this struck me as a little weird. This wasn’t just the occasional article but almost every article I found. 

Don’t be a retard

After reading quite a few articles, it turns out that they were written for database retards. This article is a typical example. The first example that it shows, is how to use aggregate functions like COUNT(*) and SUM(column), instead of using a cursor. Only a real database novice would even consider trying to use a cursor for that.

Many of them were saying that in at least 90% of the cases where you might want to use a cursor, you shouldn’t. Maybe I’ve already coded those 90% of cases as SQL anyway. 

I think that part of my surprise about this is my background. I’ve worked with databases since my very first programming job. I worked for a company that only did database development and administration so I had a pretty good training right from the start. 

I wonder if many people working with SQL Server have come from a .Net or other background and write procedural code rather than taking advantage of native database features. 

While loops vs cursors

One thing I did find was the number of articles that talked about using a WHILE loop rather than a cursor. An example of this sort of code (taken from this article at Code Magazine) is shown below. You can look in the linked article to see the cursor version for comparison. 

SET @TransactionID = (SELECT MIN(TransactionID)
FROM Production.TransactionHistory)
WHILE @TransactionID IS NOT NULL
BEGIN
SET @TransactionID = (SELECT MIN(TransactionID)
FROM Production.TransactionHistory
WHERE TransactionID > @TransactionID)
END

I’ve seen a lot of code using this while loop form. It has the benefit that it is nice and simple, and even articles such as the one linked above that say that the while-loop method is faster. I accept that there are instances when the while loop method is faster, however I have two problems with the assertions made in the article:

  1. The while-loop code and the cursor code aren’t doing the same thing. The cursor is retrieving three columns but the while loop is only getting one column. Might not sound like much of a difference but could have a large impact. I know in my Oracle work, if you are selecting just from the primary key, the database could choose to get the data from the index rather than looking at the table at all. 
  2. Unless there is something I’m missing, this while-loop method will probably only work well if you are selecting from the primary key. Each time through the loop, the database has to run a separate query and find the next row. Could be slow!

In my experience, the simplest code is often the most efficient, and I don’t think that the while-loop code is the simplest. There is more code involved in a cursor but a lot of it is filler keywords which take up space but don’t slow things down. The while-loop is sort of hand-coding a cursor, which doesn’t strike me as very efficient or durable.

Disclaimer

Note that this is all from an Oracle developer so maybe some of my assumptions are a bit off the mark. I think I’m pretty good with general database development and concepts but maybe that assumption is a bit off the mark too.

 

Posted in Uncategorized | Tagged , , | Leave a comment

Just like every other blog, I’ve taken a long break!

I’ve left this blog for a while. Created the first six posts about nine months ago and nothing since then. It has never quite found its feet and I haven’t put the time into it.

I think I found that the things I wanted to write about too a lot more words than I wanted to put into a blog. I wanted to write a few hundred words but proper posts were taking a few thousand words. I also hadn’t been working for much of the past few months. When the plan was to draw from my work experience it’s a bit hard when there are no work experiences!

I’ve just started a new job as a data warehouse tester. The project is under a lot of pressure and I’m learning a lot of new skills. I might have to start writing a bit more…

Posted in Uncategorized | Leave a comment

Query performance tuning – part 0.1

Once a user can actually write a query, the first question you will eventually hear is “my query is slow…. what can I do about it?”

Learning how to navigate the database and create queries that bring back the correct data is one problem but it opens the door to a whole new issue of making sure that those queries don’t take hours to run.

This post is only going to be a short introduction. Performance tuning can’t all be outlined in one single post. So, the questions I’m going to tackle are:

  • how to know how the database is interpreting your query
  • a very simple practical way to know if the database is doing the right thing.

Execution plan

After writing and submitting your query to an Oracle database, there are two things that the database will do (Disclaimer: this explanation might not be technically correct but it is correct enough).

Firstly it will run through the query and check that every thing is correct, the database can understand the syntax, and all the objects exist (tables, views, columns etc) and the user has access to them. The next step it does is to examine the query and work out how to run the query. That “how” process is what I’m going to look at today and it is called the execution plan.

In any Oracle tool, there will generally be a way to see the execution plan. I will show how to do it using Oracle SQL Developer. These are the two tools that I have here to work with. Other tools such as TOAD or PL/SQL Developer will also have a simple way. I will show how to do it for SQL*Plus at a later date.

SQL Developer is a free tool from Oracle that can be used as a graphical interface to write and run database queries.

The window below shows a simple query written against two of the tables in the SCOTT schema, EMP and DEPT. I have simply entered the query and hit the green play button above the text field to run the query. 

You can also get the execution plan by clicking one of the icons about the query. In this case, it is the fourth icon along, which looks like a little hierarchy of boxes. The image below shows the result from running it for this simple query.

This shows the steps that the database will take to return the data that has been requested. You can read it by looking at the bottom of the hierarchy and reading upwards. Here there are two steps at the bottom of the tree each of which are separate steps.

I spent years worrying about what each of the steps in the execution plan do and how slow or fast they are. Things have moved on a lot over the past few years and I don’t worry so much about the detail in most cases. For me, the most important column in the plan is the CARDINALITY column. This tells you how many rows the optimiser thinks are going to be processed at each step in the query. As long as this number is roughly correct then the optimiser will end up doing the right thing.

So, I’ve gone through two useful bits of information here. How to find the execution plan and how to see if it’s doing the right thing. The one thing I haven’t gone through is what to look for if the expected Cardinality is wrong. That is a lifetime of work and I will look at explaining some simple ways in the near future.

Posted in Performance tuning | Leave a comment

Great performance tuning resource

Thanks to Tom Kyte on Twitter, I’ve been introduced to a great and fairly simple performance tuning resource at http://use-the-index-luke.com/

It seems to provide useful information about how to measure performance and improve it on several different databases (Oracle, SQL Server, MySQL and others). I’m surprised that I’ve never heard of it until now.

Check it out.

Posted in Performance tuning | Tagged , | Leave a comment

Brevity isn’t working

I’m only new to this blogging thing so it’s probably not a new problem for a lot of people out there.

I’m trying to write a short post about performance tuning but it is already at 1000 words and I’m not even close to writing what I want! I was hoping to publish it today but it’s not even close to ready. Maybe I’ll have to split it in two and see where I’m at.

Posted in Uncategorized | Tagged , | Leave a comment