Understanding data – What about a collection of numbers? – Basics
Before we go to Big Data, let us understand what is this data. At the base minimum all data are a bunch of symbols. If you provide me a page full of data points, each data point is but a string of symbols like “11.23”, “Elm Street”, “345-333-3333”, “Mary had a little lamb”, “November 22 1963”, “12:30 PM”. It is upto us to look at these symbols and the context of the process which has created these symbols and classify them as real numbers or words. Date and Time are also numbers, a counter denoting number of seconds or milli seconds elapsed from a reference point in history, aggregated up into years, days, months, hours, minutes and seconds. Boolean is just a special choice between two choices and in its generic version it is a choice among more than two choices and hence degenerating again into a number. Geographical location is denoted through a set of two numbers, denoting the significant angles from a reference point on earth, chosen by man. Essentially there are only two kinds of fundamental data types – numbers and words. Here words could be a letter, letters, sentences and multiple sentences. Similarly numbers could be a single real number or a collection of real numbers.
Words are human constructed where certain squiggles such as “a” denote a certain sound and which combined with other sounds create a collective sound which denote a word such as “apple”. And for that cumulative sound, we map a real word concept, in this case a specific kind of fruit. The mapped concept could also be an imaginary concept well understood by people in the real world – like a god.
Infact there are other kinds of data (external observations) which seem quite different from words and numbers but then they are typically recorded using a collection of numbers. These are pictures and sounds. Pictures and sounds though not fully man made like words, are articulated and observed through human constructs. For instance the concept of color is an entirely a human concept. There are colors observed by other species but each species has a different conception of it. Color in its physical sense is a frequency or a combination of frequencies but as the photons hit the rods and cones in our eye, we give it a different meaning. Similarly sound is a longitudinal wave composed of compressions and rarefactions in a medium but as it impinges on our tympanum we give it a different meaning. These interactions between our sense organs and nature have been codified into a set of numbers and stored as pictures and sound segments. They are codified in such a manner that we can regenerate the same experience for our sense organs. So what is stored is a re-playable recording rather than the true essence of the external nature which is not relevant to us.
Now having covered to an extent the various kinds of data points we encounter in our world, let us look at a bunch of numbers. These numbers typically denote some attribute of a certain object of interest unless they are a bunch of generated numbers, generated by humans. So all numbers can be divided into numbers observed in nature (stock price or age of the universe) or artificially human generated numbers. Please note that I consider stock price as not a generated number but a number observed in the world – the word which was given to humans and that which has been embellished by human activities. When I mean human generated numbers, I only refer to say a program which randomly generates numbers but then after generating such a number if we use that number to denote something in the human world like for instance voter id, then the number changes its nature from an artificial number and becomes a number observed in nature (Voter id of a voter). So in general these artificial numbers can be ignored for our current discussion as artificial numbers which has no other significance attached to it by humans like voter id, can be ignored as they dont play any significant role in our context of data. I introduced that for the sake of philosophical completeness.
So all numbers we encounter in nature have a certain significance attached to them and they typically like indicated before denote some attribute of a certain object or a collection of objects. Well there are certain numbers which are say coefficients which define a particular model or a theory, for instance the gravitational constant. In such a case these coefficients are attributes of the conceptual object – theory object or model object. But sometimes there are constants shared across multiple connected theories and in such a case the number gets significance as the number associated with the integrated theory. I am not able to think of a significant number which cant be construed as an attribute of an object and hence I feel I am safe in assuming all numbers which we observe in nature draws its significance from the object associated with it. These associations can be man made or natural but nevertheless its significance arises from the association.
Having established the philosophical framework “data”, here a bunch of numbers, hinges on, lets move to understand a bunch of numbers.
- First we observe that there is a number of numbers in a bunch of numbers which is an attribute we name as count for the collection of numbers. This count can be very low. If the count is say 2 or 3, then there is really not much statistical significance you can draw out of this collection. What is a reasonable count when statistical significance start to matter is a tricky question and hinges upon what you want to do with the statistical analysis of these numbers.
- Given a bunch of numbers, using basic arithmetic, you can order them. In such an ordered bunch of numbers, you will surely have a highest number and lowest number which are derived numbers, which form the attributes of this collection of numbers – they define the boundaries or the range.
- We can in that ordered collection find the middle number such that there are equal number of numbers above and below it and that is called the median.
- Then we can find duplicates and arrive a number of distinct numbers in the collection and this is a subset of the original collection of numbers and hence a different derived collection.
- Then we can also find the count of these distinct numbers in the original collection and from those counts we can form another collection – count collection. This collection is linked with the original collection. These counts can then be again ordered and hence we again have boundaries for counts – highest and lowest count. The number in the original collection which has the highest count is called the mode. Please note that there is no special name which correspond to the number which has the lowest count. That is because there are many numbers which don’t even appear in the collection and the count for those numbers is zero and ideally those are the numbers which form the true statistical opposite of mode and as there are too many such numbers it become a trivial concept that way and hence we don’t have a special name for that.
These numbers mode and median might or might not say something about the bunch of numbers. For instance say the bunch of numbers talk about height of trees in a garden and say the garden has almost equal number of lemon trees of sizes around 5-8 ft and tall coconut trees of sizes around 90ft. So if you have a bunch of numbers denoting the heights of the trees in that garden, you will have a bunch of numbers around 8 ft and another bunch around 90 ft,. Now depending on the number of trees in each set, the median could be around 8ft or 90 ft and the mode could also be around 8 or 90 ft and the average or mean (sum of all numbers divided by the number of numbers) could be around 50ft and none of these three numbers – median, mode and mean can say anything meaningful about the bunch of numbers and so only in certain cases the attributes like median, mode and mean of a collection can truly give us an idea about the members of the collection. So we should be wary of such derived numbers which talk about a collection. Infact one of the things we didn’t consider in the above situation is the fact that some of the trees could be growing trees – instead we made an implicit assumption that these are fully grown trees.
In any case by looking at the above numbers we can conclude there are certain discrete clusters in this bunch – one cluster around 5-8ft and another around 90ft – which gives an idea that this collection has some deeper structure to it. In this case it is a collection of two disparate collections made of attributes of different kinds of objects. So possibly within each cluster, mean, mode and median might make sense as the base numbers all spring from a similar kind of object in nature. Of course you can find a stunted coconut tree and an unusually tall lemon tree but those are true anomalies of nature and its rules. Incase of trees, the clusters have a certain direct physical significance but in certain cases, the underlying physical significance is not as explicit as this.
For instance let us consider the queue lengths in a bank teller counter. These set of numbers are distinctly different from the numbers in the previous scenario. For instance the queue lengths don’t stabilize like how it stabilized in the case of trees. To understand the difference, we need to ponder the process which generates queue lengths and tree heights. They are distinctly different process and hence lead to different results and different numbers. In the latter after the trees are fully grown the sampling (observation) frequency of heights doesn’t really matter that much. While in the bank teller queue scenario, the sampling frequency is quite critical. It wouldn’t be wise to sample it once a day say an hour after bank opens. That kind of sampling will defeat the purpose of finding this attribute. On other hand sampling it every second would generate too much fluff. So what would be the right sampling frequency which doesn’t create too much superfluous data and also doesn’t lose significant information? To answer this, one needs to look deeply into the process which generates this attribute called queue length and understand what creates this queue in the first place. We will look at this in another post.