74

What is the most efficient way, memory-wise, to store 1 million phone numbers?

Apparently this is an interview question at Google, please give your ideas.

12
  • 2
    @Phill: there basically is no CW anymore.
    – Matt Ball
    Commented Mar 10, 2011 at 16:21
  • 29
    @Dylan: not quite zero, you do have to remember where you left the printout. Commented Mar 10, 2011 at 16:52
  • 5
    Apparently this is an interview question at Google, although this seems like its a bit too easy.. Not too easy for me tough Commented Sep 13, 2011 at 13:30
  • 3
    First you have to define a phone number. 7 digits (US local)? 10 digits (US long distance)? Or something more exotic-- 5 to 8 digits (China local)? 9 to 12 digits (China, as dialed from outside the country)? I'm sure there are other patterns also, these are just the ones I know. The density of the space matters in how you would pack it. Commented Oct 8, 2011 at 20:48
  • 3
    Does anyone else notice this is basically a simplification of the disk sorting problem in the first 20 pages of Programming Pearls? Right down to the use of phone numbers as the domain and memory as your greatest consideration when weighing design trade-offs. The answer is a bitarray or bit vector.
    – nsfyn55
    Commented May 13, 2012 at 15:52

14 Answers 14

47

If memory is our biggest consideration, then we don't need to store the number at all, just the delta between i and i+1.

Now if numbers range from 200 0000 - 999 9999, then there are 7,999,999 possible phone numbers. Since we have 1-million numbers, and if we assume that they are uniformly distributed, we have an Expected distance of E = n_i+1 - n_i ~ 8 (3 bits) between sequential numbers n_i and n_i+1. So for a 32-bit int we could potentially store up to 10 sequential offsets (~400kb optimal total memory footprint), however its likely that we'll have some cases where we need an offset greater than 8 (Maybe we have 400, or 1500??). In which case, we can simply reserve the first 2 bits of the int as a header which tells us what frame size we use to read the bits it stores. For example, maybe we use: 00 = 3x10, 01 = 5x6, 10 = 7x4, 11 = 1*30.

3
  • 9
    I really like this comment. For the lay man, what you need to do is sort the numbers from smallest to largest. Store the first number in the list. Then for the next number, only store the difference. So an easy example is a) 555-1234 b) 555-2234. In this case 5552234 - 5551234 = 1000. So your storage would be 5551234,1000,... Great thinking, I think this is what Google would be looking for. they never mentioned access speed but I would have included an alternative answer with this which takes that into account.
    – Talon
    Commented Nov 7, 2013 at 15:24
  • 1
    Appreciate the "delta" between one number and the other... Wow! Awesome!
    – CodeMad
    Commented Jan 29, 2015 at 20:13
  • How much further can you get if you use a bit array to store numbers, as suggested in another answer? For 10-digit phone numbers you'd need a 10-billion bit bit-array. Then you can compress patterns e.g. all possible phone numbers are encoded as "1"x10^10 (all 10 billion bits are 1). All alternating numbers starting at 0 would be "01"x(10^10)/2 (repeat the string "01" 5 billion times). The approach would fail if you get a random distribution of around half billion numbers, where the encoding size can exceed 10 billion bits.
    – Marius
    Commented Nov 10, 2018 at 3:09
28

Write them in ASCII, space-separated.

Zip the resulting string, using your favourite compression algorithm. If order isn't important, sorting them first might help the compression, gives you more repetition closer together.

Oh, did you want efficient random access? Then you should've said.

1
  • 2
    A generic packing is quite inefficient because it cannot take a serious advantage from the sorting. Using for example sorting + delta coding you can get close to 1/3 the size of bzip2 --best (I made a test with one million numbers of 10 digits with 1000 5-digits prefixes: bzip2=3660874, delta=1104188, raw=10000000)
    – 6502
    Commented Oct 8, 2011 at 19:44
10

A possible solution is

  1. sort the numbers
  2. encode deltas from one number to the next one

Delta frequency distribution will be highly skewed.

I made an experiment using a simple BER-like packing approach for deltas using a 7+3+3+... bit encoding; encoding function was

def delta_write(x, b1, b2):
    lim = 1 << (b1 - 1)
    if x < lim:
        bit_write(x, b1)
    else:
        bit_write(lim + (x & (lim - 1)), b1)
        delta_write(x >> (b1 - 1), b2, b2)

(the two parameters 7 and 3 were determined experimentally)

With this approach I got one million 10-digits numbers with the first 5 digits chosen from one thousand random prefixes with an average of 8.83 bits per number (packed size 1104188).

2
  • I though about the same steps 1. and 2., with Huffman's used for the encoding. Curious if it will give better result...
    – DK.
    Commented Oct 11, 2011 at 19:13
  • @DK: May be. The nice thing of BER encoding is that there is no tree to store (because it's a predefined one) so as usual YMMV. If I did the computation correctly the minimum theoretical average of bits per number when packing one million numbers built from 1000 prefixes and 5 free digits is about 4.68 bit per number (plus the storage for 1000 prefixes), so apparently 8.83 is still far from the optimum.
    – 6502
    Commented Oct 11, 2011 at 19:59
7

Huffman coding on digit blocks would probably give very good results. If the numbers were of mixed type (e.g. some U.S., some overseas including access code) you'd need another couple bits to specify which type they were (and therefore which blocks to use).

If the numbers were in some small range--e.g. seven digits--the most compact way to store them would probably be to treat them as integers, sort them, and store (Huffman-encoded) differences in values. E.g. with 10^6 numbers in 7 digits (10^7 possibilities), you'd expect to need about log2(10) ~= 3.3 bits per number.

7

First I observe that they never start with 0 since 0 is used as escape character in the beginning. So I can simply regard phone numbers as integers. If this weren't the case I'd simply prepend a "1" to the number and then convert it to integer. This wouldn't affect coding efficiency significantly(Probably constant overhead of a few bytes). If there are other characters outside the 10 digits inside the phone numbers just encode with a base higher than 10. This will hurt efficiency though.

I'd order them by size ascending. Then calculate the differences. And then serialize the differences using protobuf as packed repeated fields.

This method is similar to RexKerr's method, except I use the lazy solution of protobuf over an huffman encoder. Probably a bit bigger since the protobuf integer encode is general purpose and doesn't take the probability distribution of phone numbers into account. But it's much easier to code since I just need to use an existing protobuf serializer. This will get problematic once you exceed the size of an UInt64, i.e. there are phone numbers longer than 19 digits. The file format still supports it, but most implementations won't.

Without an index access times will be pretty bad, but it should be rather compact.

7

A ternary search tree which is a special trie data structure will be memory efficient and will still enable (as trie) partial matching.

http://en.wikipedia.org/wiki/Ternary_search_tree

5

If you look at data field representations of the North American Numbering Plan, you'll conclude that the US phone numbers of 1+NPA+NXX+xxxx can be stored in less than 22 bits per phone number field in each area code. Add the area codes and the data representing any US (plus Canadian) phone number can comfortably fit in 32 bits. This is as a bit field representation -- not as an int.

Your thinking on this should not be US Centric, however. Surely the question is not just an exercise is compressing 1 million phone numbers into the fewest bits possible.

US phone numbers can be as short as 3 digits (internal PBX dial plans) to as long as 22 digits (1+NPA+NXX+xxxx+11digit internal PBX dial plan). If the phone number was limited to the ITU specified number format, you have up to 15 digits plus 1 bit for the '+'.

You then should probably define a variable bit field representation of any phone number between 3 digits and 22 digits (or 15 digits for ITU) with each bit field having a X bit header field to indicate the format of the field.

Then put these bit fields into a compressed bit array. Potentially that bit array can be indexed with a trie or some other method.

The efficiency of this is based on the format of the 1 million phone numbers, how quickly you want to access them, and how flexible that data structure is for more phone numbers in the future in differing formats. It is not just counting bits for the "right" answer IMHO.

3

Say we assume that each phone number is consistent with US format of (3-digit area code)-(7-digit number)

This is a 10-digit number.

However, there are rules for engagement when dealing with phone numbers. They're sparse, for one, meaning not all possible area codes are used. In this case, a simple tree is a-ok. I mean think about it... you only need 269 + 26 for Canada. That's pretty small, and you've cut out a large portion of space PLUS increased search time. Not only that, but it can be augmented for location information.

After that, you've got a 7 digit number. This can be stored in a single 32-bit integer. Sort on insert, and you've got a pretty fast mechanism for retrieval as you can do binary search on the remainder of the number.

2

I think we can you use Bit Vector here of size 1 million.

Java Example:

private BitSet dir = new BitSet(1000000);

public void addTelephoneNumber(int number)
{
    dir.set(number);
}


public void removeTelephoneNumber(int number)
{
    if (dir.get(number))
    {
        dir.flip(number);
    }
}


public boolean isNumberPresent(int number)
{
    return dir.get(number);
}
3
  • 1
    How efficient is this BitSet solution in terms of space? Commented Jan 27, 2012 at 23:58
  • This is the best answer Commented Mar 15, 2016 at 16:50
  • This is not a good answer. It is supposed to store phone numbers, not numbers from 1-1,000,000. Having said that, it is not possible to retrieve the phone numbers from this. You could have a hashing function that maps phone numbers to a specific index in this bit array, but again hashing functions are one way, so it's not possible to reconstruct the original phone number list. The idea behind storing things is to be able to retrieve them after. Commented May 2, 2016 at 3:56
1

I guess an unsigned Int32 or for international numbers an unsigned Int64

Using 32bit unsigned ints that would be 4MB

7
  • Well since phone numbers are at least 7 digits (in my experience) you'd be wasting space to store the numbers 0 - 999,999.
    – Kai
    Commented Mar 10, 2011 at 16:20
  • Phone numbers are not numbers. Don't try and store them as integers. Commented Mar 10, 2011 at 17:29
  • I'd bet in practice (in most the US) you'd need to store the area code to be useful. This makes the phone numbers 10 digits and would require at least a 34 bit Int or snazzy packing to eliminate the probably-unused values of 0-999999999 (that's over half your 34 bit space!). Commented Mar 10, 2011 at 19:46
  • 1
    In other countries you might have to store the leading 0, and you have to handle extntions Commented Mar 11, 2011 at 20:10
  • @NickJohnson They aren't but for the sake of the problem you could use an integer as an index to a phone number, right? eg. "(311)-0031-1151" can be indexed to 31100311151. An integer index would need 37 bits, but you would need at least 7*11 bits needed to store the same number as ASCII.
    – Andrew J
    Commented Feb 4, 2012 at 7:11
1

It truly depends on what operations you will want to run on the stored database.

The trivial approach is using unsigned integers, if you just need to store them probably some compression on the raw text representation using a dictionary would be smaller.

4
  • No! Phone numbers are not numbers! Commented Mar 10, 2011 at 17:29
  • Leading zeros may be significant, you might also have to account for an extention Commented Mar 11, 2011 at 20:09
  • The easy way to deal with leading zeros if you really want to use such an encoding scheme is to prepend "1" to the string, then encode as an integer. When you decode the integer back to a string, strip the leading "1". So the telephone number "456" gets stored as "1456" while the telephone number "0015833258881" gets stored as "10015833258881". There are other problems with storing telephone numbers as integers, but the leading zeros "problem" isn't one of them. Commented Mar 31, 2011 at 2:37
  • I agree with @NickJohnson on this one. I'd not try and store them as int or any number based datatype. The reality is, they are strings - they dont have any "computational" value Commented Mar 11, 2020 at 21:54
1

At a job interview the point of this question is to size up the applicant's problem-solving skills. Because the focus of the question is memory efficiency, In my opinion the correct answer is to ask the interviewer: "Are the phone numbers international, or are they limited to a single country?" If the numbers are limited to a single country, then the task of maximizing memory efficiency is simplified by the fact that each country has simple rules for distribution of phone numbers by state and city.

1

8 million bits with each bit 1(used) or 0(available) for one of 8 million numbers example

100 0000
900 0000
= 8 million phone numbers, bit 1 = 1000000 and bit 8 million = 9000000 
2
  • This is called pigeon hole sort.
    – Ole Tange
    Commented Sep 13, 2012 at 14:54
  • What about a try to avoid using duplicates of that many bits. The structure contains only what is needed. If phone numbers are sparse, that could be quite a cost savings as well without collision issues. I am assuming your solution fails to account for collisions. Commented Nov 9, 2016 at 18:12
-11
/******************************************************************************** 

  Filename: Phone_Numbers.c
    Author: Paul Romsky
   Company: Autoliv AEL
      Date: 11 MAR 2013

   Problem: What is the most efficient way, memory-wise, to store 1 million 
            phone numbers?

   Caveats: There is no mention if the numbers are contiguous or not, so, to save 
            space the numbers should be contiguous.  The problem (as a specification) 
            is rather vague and leaves a lot to interpretation, so many different 
            methods may be desired, but which one(s) desired is not surely known.

            Are the phone numbers just the extension (last four digits), or do they
            include the exchange (the leading 3 digits), or do they also include 
            area code and/or international numbers?

            There is no mention of the first number, only the range of numbers, so 
            the first number 000-0000 is used.  Although many numbers are not 
            normally used, they could in fact be used as valid number sequences 
            within the phone companies and are thus phone numbers nonetheless.

  Solution: A simple algorithm. If the numbers are not contiguous a fractal algorithm
            could be used.

            A standard ANSI C compiler should pack this program into a very small
            assembly module of only a few bytes.

 Revisions:

 Rev Date        By                   Description
 --- ----------- -------------------- -------------------------------------------
  -  11 MAR 2013 P. Romsky            Initial Coding

 ********************************************************************************/

/* Includes */

#include <stdio.h>


/* Functions */

/******************************************************************************** 
 *
 * Main Entry Point
 *
 ********************************************************************************/
int main()
{
  unsigned int Number;

  /* 1,000,000 Phone Number Storage 000-0000 through 999-9999 */

  for(Number = 0000000; Number < 10000000; Number++)
  {
    /* Retrieve Numbers */

    printf("%07u\n", Number);
  }

  return 0;
}

/* End */
6
  • 1
    May I ask what this is? It's not even complete. How does it answer the question?
    – Mysticial
    Commented Mar 11, 2013 at 21:46
  • Mystical, It is a C program. I hope you can see the whole thing, it is mostly comments and the final code that goes into memory is at the bottom. Commented Mar 11, 2013 at 21:58
  • Could you add an explanation as to how this solves the problem? Commented Mar 11, 2013 at 21:58
  • Martijn, can you see the main function at the end of the program? It is a simple loop. The size of the program is very small yet holds 1 million sequntial numbers. The formatting got messed up when I cut and pasted my code. Commented Mar 11, 2013 at 22:03
  • int main() { unsigned int Number; /* 1,000,000 Phone Number Storage 000-0000 through 999-9999 / for(Number = 0000000; Number < 10000000; Number++) { / Retrieve Numbers / printf("%07u\n", Number); } return 0; } / End */ Commented Mar 11, 2013 at 22:04

Not the answer you're looking for? Browse other questions tagged or ask your own question.