What is the most efficient way, memory-wise, to store 1 million phone numbers?
Apparently this is an interview question at Google, please give your ideas.
If memory is our biggest consideration, then we don't need to store the number at all, just the delta between i and i+1.
Now if numbers range from 200 0000 - 999 9999, then there are 7,999,999 possible phone numbers. Since we have 1-million numbers, and if we assume that they are uniformly distributed, we have an Expected distance of E = n_i+1 - n_i ~ 8 (3 bits) between sequential numbers n_i and n_i+1. So for a 32-bit int we could potentially store up to 10 sequential offsets (~400kb optimal total memory footprint), however its likely that we'll have some cases where we need an offset greater than 8 (Maybe we have 400, or 1500??). In which case, we can simply reserve the first 2 bits of the int as a header which tells us what frame size we use to read the bits it stores. For example, maybe we use: 00 = 3x10, 01 = 5x6, 10 = 7x4, 11 = 1*30.
"1"x10^10
(all 10 billion bits are 1). All alternating numbers starting at 0 would be "01"x(10^10)/2
(repeat the string "01" 5 billion times). The approach would fail if you get a random distribution of around half billion numbers, where the encoding size can exceed 10 billion bits.
Write them in ASCII, space-separated.
Zip the resulting string, using your favourite compression algorithm. If order isn't important, sorting them first might help the compression, gives you more repetition closer together.
Oh, did you want efficient random access? Then you should've said.
A possible solution is
Delta frequency distribution will be highly skewed.
I made an experiment using a simple BER-like packing approach for deltas using a 7+3+3+... bit encoding; encoding function was
def delta_write(x, b1, b2):
lim = 1 << (b1 - 1)
if x < lim:
bit_write(x, b1)
else:
bit_write(lim + (x & (lim - 1)), b1)
delta_write(x >> (b1 - 1), b2, b2)
(the two parameters 7 and 3 were determined experimentally)
With this approach I got one million 10-digits numbers with the first 5 digits chosen from one thousand random prefixes with an average of 8.83 bits per number (packed size 1104188).
Huffman coding on digit blocks would probably give very good results. If the numbers were of mixed type (e.g. some U.S., some overseas including access code) you'd need another couple bits to specify which type they were (and therefore which blocks to use).
If the numbers were in some small range--e.g. seven digits--the most compact way to store them would probably be to treat them as integers, sort them, and store (Huffman-encoded) differences in values. E.g. with 10^6 numbers in 7 digits (10^7 possibilities), you'd expect to need about log2(10) ~= 3.3 bits per number.
First I observe that they never start with 0 since 0 is used as escape character in the beginning. So I can simply regard phone numbers as integers. If this weren't the case I'd simply prepend a "1" to the number and then convert it to integer. This wouldn't affect coding efficiency significantly(Probably constant overhead of a few bytes). If there are other characters outside the 10 digits inside the phone numbers just encode with a base higher than 10. This will hurt efficiency though.
I'd order them by size ascending. Then calculate the differences. And then serialize the differences using protobuf as packed repeated fields.
This method is similar to RexKerr's method, except I use the lazy solution of protobuf over an huffman encoder. Probably a bit bigger since the protobuf integer encode is general purpose and doesn't take the probability distribution of phone numbers into account. But it's much easier to code since I just need to use an existing protobuf serializer. This will get problematic once you exceed the size of an UInt64, i.e. there are phone numbers longer than 19 digits. The file format still supports it, but most implementations won't.
Without an index access times will be pretty bad, but it should be rather compact.
A ternary search tree which is a special trie data structure will be memory efficient and will still enable (as trie) partial matching.
If you look at data field representations of the North American Numbering Plan, you'll conclude that the US phone numbers of 1+NPA+NXX+xxxx can be stored in less than 22 bits per phone number field in each area code. Add the area codes and the data representing any US (plus Canadian) phone number can comfortably fit in 32 bits. This is as a bit field representation -- not as an int.
Your thinking on this should not be US Centric, however. Surely the question is not just an exercise is compressing 1 million phone numbers into the fewest bits possible.
US phone numbers can be as short as 3 digits (internal PBX dial plans) to as long as 22 digits (1+NPA+NXX+xxxx+11digit internal PBX dial plan). If the phone number was limited to the ITU specified number format, you have up to 15 digits plus 1 bit for the '+'.
You then should probably define a variable bit field representation of any phone number between 3 digits and 22 digits (or 15 digits for ITU) with each bit field having a X bit header field to indicate the format of the field.
Then put these bit fields into a compressed bit array. Potentially that bit array can be indexed with a trie or some other method.
The efficiency of this is based on the format of the 1 million phone numbers, how quickly you want to access them, and how flexible that data structure is for more phone numbers in the future in differing formats. It is not just counting bits for the "right" answer IMHO.
Say we assume that each phone number is consistent with US format of (3-digit area code)-(7-digit number)
This is a 10-digit number.
However, there are rules for engagement when dealing with phone numbers. They're sparse, for one, meaning not all possible area codes are used. In this case, a simple tree is a-ok. I mean think about it... you only need 269 + 26 for Canada. That's pretty small, and you've cut out a large portion of space PLUS increased search time. Not only that, but it can be augmented for location information.
After that, you've got a 7 digit number. This can be stored in a single 32-bit integer. Sort on insert, and you've got a pretty fast mechanism for retrieval as you can do binary search on the remainder of the number.
I think we can you use Bit Vector here of size 1 million.
Java Example:
private BitSet dir = new BitSet(1000000);
public void addTelephoneNumber(int number)
{
dir.set(number);
}
public void removeTelephoneNumber(int number)
{
if (dir.get(number))
{
dir.flip(number);
}
}
public boolean isNumberPresent(int number)
{
return dir.get(number);
}
I guess an unsigned Int32 or for international numbers an unsigned Int64
Using 32bit unsigned ints that would be 4MB
Int
or snazzy packing to eliminate the probably-unused values of 0-999999999 (that's over half your 34 bit space!).
Commented
Mar 10, 2011 at 19:46
It truly depends on what operations you will want to run on the stored database.
The trivial approach is using unsigned integers, if you just need to store them probably some compression on the raw text representation using a dictionary would be smaller.
At a job interview the point of this question is to size up the applicant's problem-solving skills. Because the focus of the question is memory efficiency, In my opinion the correct answer is to ask the interviewer: "Are the phone numbers international, or are they limited to a single country?" If the numbers are limited to a single country, then the task of maximizing memory efficiency is simplified by the fact that each country has simple rules for distribution of phone numbers by state and city.
8 million bits with each bit 1(used) or 0(available) for one of 8 million numbers example
100 0000
900 0000
= 8 million phone numbers, bit 1 = 1000000 and bit 8 million = 9000000
/********************************************************************************
Filename: Phone_Numbers.c
Author: Paul Romsky
Company: Autoliv AEL
Date: 11 MAR 2013
Problem: What is the most efficient way, memory-wise, to store 1 million
phone numbers?
Caveats: There is no mention if the numbers are contiguous or not, so, to save
space the numbers should be contiguous. The problem (as a specification)
is rather vague and leaves a lot to interpretation, so many different
methods may be desired, but which one(s) desired is not surely known.
Are the phone numbers just the extension (last four digits), or do they
include the exchange (the leading 3 digits), or do they also include
area code and/or international numbers?
There is no mention of the first number, only the range of numbers, so
the first number 000-0000 is used. Although many numbers are not
normally used, they could in fact be used as valid number sequences
within the phone companies and are thus phone numbers nonetheless.
Solution: A simple algorithm. If the numbers are not contiguous a fractal algorithm
could be used.
A standard ANSI C compiler should pack this program into a very small
assembly module of only a few bytes.
Revisions:
Rev Date By Description
--- ----------- -------------------- -------------------------------------------
- 11 MAR 2013 P. Romsky Initial Coding
********************************************************************************/
/* Includes */
#include <stdio.h>
/* Functions */
/********************************************************************************
*
* Main Entry Point
*
********************************************************************************/
int main()
{
unsigned int Number;
/* 1,000,000 Phone Number Storage 000-0000 through 999-9999 */
for(Number = 0000000; Number < 10000000; Number++)
{
/* Retrieve Numbers */
printf("%07u\n", Number);
}
return 0;
}
/* End */
Apparently this is an interview question at Google, although this seems like its a bit too easy.
. Not too easy for me tough