Tuesday, May 12, 2009

One of the toughest job-interview questions ever

I mentioned in a previous post that I once interviewed for a job at a well-known search company. One of the five people who interviewed me asked a question that resulted in an hour-long discussion: "Explain how you would develop a frequency-sorted list of the ten thousand most-used words in the English language."

I'm not sure why anyone would ask that kind of question in the course of an interview for a technical writing job (it's more of a software-design kind of question), but it led to a lively discussion, and I still think it's one of the best technical-interview questions I've ever heard. Ask yourself: How would you answer that question?

My initial response was to assail the assumptions underlying the problem. Language is a fluid thing, I argued. It changes in real time. Vocabulary and usage patterns shift day-to-day. To develop a list of words and their frequencies means taking a snapshot of a moving target. Whatever snapshot you take today isn't going to look like the snapshot you take tomorrow -- or even five minutes from now.

So the first question is: Where do we get our sample of words from? Is this about spoken English, or written English? Two different vocabularies with two different frequency patterns. But again, each is mutable, dynamic, fluid, protean, changing minute by minute, day by day.

Suppose we limit the problem to written English. How will we obtain a "representative sampling" of English prose? It should be obvious that there is no such thing. There is no "average corpus." Think about it.

My interviewer wanted to cut the debate short and move on to algorithms and program design, but I resisted, pointing out that problem definition is extremely important; you can't rush into solving a problem before you understand how to pose it.

"Let's assume," my inquisitor said, "that the Web is a good starting place: English web-pages." I tormented my tormentor some more, pointing out that it's dangerous to assume spiders will crawl pages in any desirable (e.g., random) fashion, and anyway, some experts believe "deep Web content" (content that's either uncrawlable or has never been crawled before) constitutes the majority of online content -- so again, we're not likely to obtain any kind of "representative" sample of English words, if there even is such a thing as a representative sample of the English language (which I firmly maintain there is not).

By now, my interviewer was clearly growing impatient with my petulence, so he asked me to talk about designing a program that would obtain a sorted list of 10,000 most-used words. I dutifully regurgitated the standard crawl/canonicalize/parse/tally sorts of things that you'd typically do in such a program.

"How would you organize the words in memory?" my tormentor demanded to know.

"A big hash table," I said. "Just hash them right into the table and bump a counter at each spot."

"How much memory will you need?"

"What've you got?" I smiled.

"No, seriously, how much?" he said.

I said assuming 64-bit hardware and software, maybe something like 64 gigs: enough memory for a 4-billion-slot array of 16 bytes of data per slot. Most words will fit in that space, and a short int will suffice for a counter in each slot. (Longer words can be hashed into a separate smaller array.) Meanwhile you're using 32 bits (64 available; but you're only using 32) of address space, which is enough to hash words of length 7 or less with no collisions at all. (The typical English word has entropy of about 4.5 bits per character.) Longer words entail some risk of hash collision, but with a good hash function that shouldn't be much of a problem.

"What kind of hash function would you use?" the interviewer asked.

"I'd try a very simple linear congruential generator, for speed," I said, "and see how it performs in terms of collisions."

He asked me to draw the hash function on the whiteboard. I scribbled some pseudocode that looked something like:

HASH = INITIAL_VALUE;
FOR EACH ( CHAR IN WORD ) {
HASH *= MAGIC_NUMBER
HASH ^= CHAR
HASH %= BOUNDS
}
RETURN HASH

I explained that the hash table array length should be prime, and the BOUNDS number is less than the table length, but coprime to the table length. Good possible values for the MAGIC_NUMBER might be 7, 13, or 31 (or other small primes). You can test various values until you find one that works well.

"What will you do in the event of hash collisions?" the professor asked.

"How do you know there will be any?" I said. "Look, the English language only has a million words. We're hashing a million words into a table that can hold four billion. The load factor on the table is negligible. If we're getting collisions it means we need a better hash algorithm. There are plenty to choose from. What we ought to do is just run the experiment and see if we even get any hash collisions. "

"Assume we do get some. How will you handle them?"

"Well," I said, "you can handle collisions via linked lists, or resize and rehash the table -- or just use a cuckoo-hash algorithm and be done with it."

This led to a whole discussion of the cuckoo hashing algorithm (which, amazingly, my inquisitor -- supposedly skilled in the art -- had never heard of).

This went on and on for quite a while. We eventually discussed how to harvest the frequencies and create the desired sorted list. But in the end, I returned to my main point, which was that sample noise and sample error are inevitably going to moot the results. Each time you run the program you're going to get a different result (if you do a fresh Web crawl each time). Word frequencies are imprecise; the lower the frequency, the more "noise." Run the program on September 10, and you might find that the word "terrorist" ranks No. 1000 in frequency on the Web. Run it again on September 11, and you might find it ranks No. 100. That's an extreme example. Vocabulary noise is pervasive, though, and at the level of words that rank No. 5000+ (say) on the frequency list, the day-to-day variance in word rank for any given word is going to be substantial. It's not even meaningful to talk about precision in the face of that much noise.

Anyway, whether you agree with my analysis or not, you can see that a question like this can lead to a great deal of discussion in the course of a job interview, cutting across a potentially large number of subject domains. It's a question that leads naturally to more questions. And that's the best kind of question to ask in an interview.

143 comments:

  1. So were you hired?:)

    ReplyDelete
  2. :( you feel into *every* trap the interviewer laid out for you. I am assuming you werent hired: were I the interviewer I suspect I would not have passed you :(

    ReplyDelete
  3. sorry that should be fell not feel :)

    ReplyDelete
  4. simon7:07 AM

    @Tom: Can you explain that any further? Which are the traps?

    ReplyDelete
  5. I didnt see any traps out there?

    ReplyDelete
  6. Anonymous7:26 AM

    hm, i can see what you're hinting at--but in any case: was this what the interviewer wanted to hear?

    my approach would have been quite different: have a simple python program using a dictionary enumerating frequencies, sorting by value and taking the first 10k. this list could be stored for subsequent invocations, such that you can compare against the previous runs (thus you could determine "runnerups" or dynamic orderings, if necessary). the advantage of having that list persisted is that other programs can easily use the information computed by this program, or in a parallel environment you can have many threads reading the list but only one continuously updating the sorted list. i also completely agree with you on the unimportance of the hashing functions for that case, since during the development of the program you can go from simple-but-working to optimally representing the problem domain...

    ReplyDelete
  7. Anonymous7:33 AM

    You wrote: "I explained that the hash table array length should be prime, and the BOUNDS number is less than the table length, but coprime to the table length."

    Let the table length L, which is a positive integer, be a prime. Then let B, the bounds number, be a positive integer less than L. Now it *follows* that L and B are coprimes (ie. saying "but coprime to the table length" is superfluous): the only non-trivial divisor of L is L (since L is a prime), and L cannot divide B, since L is greater.

    ReplyDelete
  8. @Simon & Metaphox

    I listed some of the common ones here: http://news.ycombinator.com/item?id=605211

    So many people fall at the same issues this guy did. When I write interview questions (and indeed pick the interviewers) at our company a question like this would be common.

    We are looking for him to avoid the answer by drawing out the question (if he simply asked "well there are some issues with that, what assumptions am I making regarding the input data?" he would have been told what it took him a LONG time to argue his way to).

    Then he came back to his original gripe: which wasnt important in any way. He had made a point of it before, the interviewer would have noticed. Pushing the issue smacks of arrogance (which I'm not saying he DOES have) and inability to let non-issues slide.

    He says this was the most difficult question he has faced: I agree, because he failed at it. I am assuming he recognises this and has learnt from it :)

    ReplyDelete
  9. Anonymous7:46 AM

    Why not use a Trie instead of a Hash-based aproach ? Seems to be optimized for storing words.

    http://en.wikipedia.org/wiki/Trie

    ReplyDelete
  10. Anonymous8:06 AM

    there certainly doesn't seem to be a culture match

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Anonymous8:22 AM

    I am extremely technical and I really have to agree with 'Tom | Errant' in his angle. Kas, you failed and you failed big time. I won't want to work with you as a team-mate nor an employee. You are the 'perfect' developer: Give you a problem and lock you in a room by your self with all of your requested gadgets, beverages and food; leave you alone and sooner or later you'll come out with the 'best' solution [in your opinion]. You're absolute and near perfection in your field of trade, I am sure, and your Myers Briggs is perfect for finding where you fit in. You are right to question the appropriateness for a technical writer's interview. But the interviewer was also right in asking it. Why did he ask it? He was not looking for a complete answer, he was looking for answering technique, methodical approach and emotional response. To me you failed in all areas. If I were looking for a programmer knowing all the latest techniques you'd be my guy, but that guy could also be a $5/hour guy from India or a $2.50/hour teenager from Asia.

    ReplyDelete
  13. @ Anonymous
    Because he probably was assuming it wasn't for a spell-check program and because hashing is constant in read/delete time whereas a trie is definitely not.

    I don't see any traps, your questioning of the question is a good sign and shows you understand the deeper roots behind the problem.

    You can't solve a problem on the level it is created.

    ReplyDelete
  14. Anonymous8:44 AM

    I would assume that this "well-known search company" WANTS this moving list of top words daily.

    It seemed that you were making an assumption about how the result would be used without thinking that the interviewer already had a set purpose.

    ReplyDelete
  15. Anonymous8:51 AM

    This is a typical case where one finds himself smarter than others, I have a cousin that works just like that: you ask him a question then he'll either find the most absurd thing to ask you back (so he doesn't have to answer the question) or he'll give you what he thinks is a deep insight (that somehow no one has ever thought about before). I feel that's a shameful waste of good intelligence, and no good ever comes out of it.
    I'd advise you to be humbler and not to assume others are dumber than you (although sometimes they will probably be).

    ReplyDelete
  16. Anonymous8:52 AM

    We should unquestioningly accept requirements. We are robots to serve our infallible masters. The box is warm. Do not venture outside of it.

    ReplyDelete
  17. Anonymous8:53 AM

    extract the words from a few different web books, dump them into an sqlite db and do a siple sql statement...

    done this before with python and sqlite

    no need for the rant :P

    ReplyDelete
  18. @Anonymous

    That's a horrible implementation, you're dumping stuff back to a (poorly suited to the task) database, just to have it do all the sorting, when you could have done it much faster by handling the stream yourself.

    ReplyDelete
  19. Good post, Kas. An interesting question and I liked your consideration of how to answer (plus, your time in writing it up!).

    While I would agree that the list is fluid, having watched search term usage for years in a few organizations now, that is largely very stable over good stretches of time. Across the whole language, if you're focusing on the top 1% of terms (which 10,000 would be if you assume a 1,000,000 word language as you mention above), it's likely very stable there as well.

    I'd imagine the top 100 words (just to take another chunk off the top) would likely be so stable that you might see one word change in, say, a year or so (if you take the corpus to be large enough). It would actually be interesting to know what the stability of the top 1% is or the top 0.01%. How would you determine that?? :-)

    ReplyDelete
  20. tr -sc '[:alnum:]' '\n' < text | sort -f | uniq -ic | sort -k1,1n

    ReplyDelete
  21. EvilKayak9:27 AM

    If this was your real, you show high levels of incompetence; your ability to make decisions FTL. You probably have clanged on to education system for a while and show lack of effort on making basic assumptions to get a job done. I hate employees like you who come and ask me questions and can't make simple decision.

    you could of just listened to the question, taught up of a few assumptions, and presented your answer by presenting your assumptions and answers.

    ReplyDelete
  22. Anonymous9:42 AM

    There's a lot of bandwagoning going on in these comments. It might not have been a perfect answer, but it definitely wasn't "incompetent", or a failure. It maybe came off as a little arrogant.

    If I was interviewing you, I think it would have been a wash.

    ReplyDelete
  23. Anonymous9:50 AM

    Your solution goes way over my head, but I just tried sorting 10,000 tuples in python in the form: (freq, word). It took a split second.

    ReplyDelete
  24. Anonymous9:51 AM

    The comments, I think, show all that's wrong with the coding interviews: All that is required is to have a large database of typical interview questions in your head (of which this is one) and provide the cookie cutter answer when asked one of these. Now *that* you can get from "a $5/hour guy from India or a $2.50/hour teenager from Asia".
    In his answer Kas does come across as somewhat arrogant but his analysis of the basic assumptions are great, e.g. it's especially hard to get a collection of "typical" random English text. But most interviewees would not even think about this.

    ReplyDelete
  25. Anonymous9:53 AM

    If you think language and usage characteristics change minute by minute and day by day you must be speaking in a very entertaining way. Needless to say ridiculous assumption.

    ReplyDelete
  26. Anonymous10:03 AM

    Some of the people posting comments about this appear to be real jerks -- "you failed and you failed big time", "you show high levels of incompetence"!

    Who are these people? Your observations are perfectly valid, and if I was the interviewer, I would be very happy to have found someone who thinks around questions, rather than just answering them blindly.

    ReplyDelete
  27. Anonymous10:04 AM

    I had a strikingly similar question at an interview at a well-known search company as well. Perhaps the same one. The difference: I answered the question, stating my assumptions before my answer. Perhaps my interview was targeted differently as it was a coding position, but I saw immediately what the interviewer wanted: a chance to see how the candidate reacts to different aspects of the problem. We did theme and variations on this problem for the full 45 minute interview, coding, evaluating, and recoding that problem about 10 times.

    For what it's worth, the question eventually evolves into a discussion of distributed hashing with limited memory sets, finite disk bandwidth, and fast full-text searching.

    ReplyDelete
  28. Anonymous10:05 AM

    I dunno -- the english lexicon is kinda liquid, cuz people are creating new vernacular constantly `specially online - anyhoo - good post

    ReplyDelete
  29. Anonymous10:07 AM

    Seems like a good solution here is markov's :)

    ReplyDelete
  30. Anonymous10:10 AM

    And, yet, it is obvious that, in all of your extreme technical cleverness, you were too clever to meet the non-technical challenges of the interview situation.

    ReplyDelete
  31. Anonymous10:14 AM

    As a Prof that does corpus linguistics/informetrics for a living, I think the answers given, and the counter-questions asked to clarify the fundamentally ill-posed question, were very thoughtful. I would definitely have kept you on my short-list of potential hires.

    ReplyDelete
  32. @Shawn Freeman

    Wouldn't both a hash function and a Trie be linear on the number of characters in the word?

    ReplyDelete
  33. Just FYI: you start to get collisions when the number of items in the table is the square root of the table size. One million is quite a bit larger than the square root of four billion - so you're pretty well guaranteed collisions.

    ReplyDelete
  34. Anonymous10:31 AM

    Speaking as an employee and occasional interviewer at a well-known search company, this sort of worrying about definitions and corpora is a good thing and probably part of what was being looked for. The interviewer was probably rushing because time was limited.

    Your knowledge of hashing is impressive, but I see two technical flaws here.

    First, there is no provided way to extract the answer from the hash table. The 10000 most common hash values are not useful. Are you keeping a list of words? What's the memory cost?

    Second, how does this parallelize? For most hash-collision solutions, naive hash-table adding doesn't work, and it would be an awful lot of data to ship around if it did. And you can't ignore hash collisions with a 32 bit key -- not with all the typos and proper nouns on the web.

    ReplyDelete
  35. The problem is when interviewers ask a non-toy question, and expect a toy-answer. If you want a canned answer, ask a truly canned simple question.

    He IS right, this isn't easy. But there is a better way to answer.

    With caveats, I agree crawling a small set of pages, and gathering statistics via a simple python program.

    Of course, in production, if your spider crawls a highly technical website and sister sites for its sampling, your 10,000 words WON'T be the most frequent in the english language. It IS A HARD PROBLEM.

    If your interviewer DOESN'T realize this, you'll hire a guy whose web bot generates a list consisting of low frequency words because it started crawling the wrong website. And he won't understand why the 'simple' solution is actually broken.

    Irrespective of all this, here is a tip.

    Usually, to keep things moving, you throw out the 'simple' answer, and point out a few caveats with it.

    "Well, I'd write a python program to do it, but..." and rattle off a few caveats. Don't make it exhaustive, just the big ones you can think of.

    This way, the interviewer ( who is time constrained ) can move on, or he can ask more pointed questions. Either way, you've still told him that yes, you can solve the problem at hand, you are aware of the pitfalls in the method, and if the interviewer wants, he can drill down further.

    You've shown the interviewer that you are capable of writing 'hacks' ( hacks are useful to explore the problem ), that you know the pitfall of the hack, and you allow the interviewer to still move ahead with the interview

    ReplyDelete
  36. I should say, you'll hire a guy who writes a simple solution, that when it gives the wrong answer, doen't have the ability to intuit why it is giving the wrong answer.

    ReplyDelete
  37. Tom is correct. Vague questions posed in an interview are there to see if you will incessantly avoid the question by arguing about the question itself. The job is likely not about how well you can argue your assignments, nor is it just about regurgitating material. You can point out you're making certain assumptions about your input data set, but your response should be salient. By the time you finally got to your technical answer you had already lost the game.

    ReplyDelete
  38. Anonymous10:41 AM

    I have to agree with those who said you came off as arrogant. The whole sample tangent portion of the interview was probably stuck in your mind as something to keep on about... but it's b.s.


    "Let's assume," my inquisitor said, "that the Web is a good starting place: English web-pages."

    Now it's my turn. The Internet would be a perfect sample for both written and oral vocabulary. There are quotes, well noted past & present authors, urbandictionary.com, etc. I don't understand your logic here. An average of the English language over a fairly decent recording of history of the English language would be a good sample. I don't understand your logic behind that sampling.

    ReplyDelete
  39. Anonymous10:54 AM

    Your response/solution oozes with arrogance. I'm assuming you didn't get the job. That's called over-design and when you work with someone who acts like this, you hate your life. There's no shortage of software problems to address, so before you come up with a solution, you need to gauge the importance of a problem. If I was interviewing you I would have laughed in my head during the interview. "This guy thinks he's so smart. Thank God I won't have to work with him."

    I'd rather work with someone who's stupid than someone who's arrogant or argumentative or can't answer a simple question.

    ReplyDelete
  40. Anonymous10:58 AM

    So, English has no "average corpus", and yet there are "typical English words"?

    You're a quibbling douche-bag, how does your medicine taste.

    ReplyDelete
  41. Anonymous11:01 AM

    @Yonatan (and Shawn Freeman): Yes, both hash tables and tries are linear in the number of characters in your word.

    @Tom: Yes, the OP should read up on the birthday "paradox". In fact, I'd love to see a hash function that doesn't have any collisions on a single, random "4 billion into 1 million" situation.

    ReplyDelete
  42. Folks with Tom | Errant -

    Are you sure that you are filtering out the right people? You might wind up with a bunch of employees who just know how to look up the answers to common interview questions and interview goals. Some of the best people to work with are sure to not interview well, so you should also leave yourself some room to go with your gut, and not make the 'traps' so hard that people can't spot them without reading a webpage on faking your way through an interview.

    I don't have much experience interviewing - so take this opinion with a grain of salt.

    ReplyDelete
  43. Anonymous11:06 AM

    You sound like an unbelievable douche... this article should be reserved for only a select few, not featured on Wired where a normal person that gets a headache even thinking of the word algorithm reads it. And I agree with the guy above me, you're obviously and arrogant bastard.

    ReplyDelete
  44. Anonymous11:06 AM

    I would have just looked up word frequency norms. They have those.

    ReplyDelete
  45. Anonymous11:18 AM

    I would have hired you. However unlike most technical people, I don't have hangups about people knowing more than me in a given area of programming.

    ReplyDelete
  46. I think he is absolutely correct that the "most-used words in the English language" is not a very well defined concept. It could mean anything up to all verbal and written communication since English became English. I don't see how you can argue that any particular sample you are able to dredge up off the Internet will have the same top ten thousand words in the same order.

    So I think he is justified in his objection to the question. It is effectively impossible as stated.

    On the other hand, requiring such precise use of language is considered extremely annoying by an overwhelming majority of nonmathematicians. Your behavior did not feel appropriate for a job interview. Generally if you are asking somebody to pay you for your time, you need to try to give them what you want to the best of your ability, not convince them it doesn't exist. Your interviewer was probably envisioning an hour of fruitless debate at every meeting involving you. That probably isn't what the company pays people to do.

    Also, this part really rubbed me the wrong way:

    "How much memory will you need?"

    "What've you got?" I smiled.

    That's just being a smartass...

    ReplyDelete
  47. Anonymous11:26 AM

    your point was pointless

    ReplyDelete
  48. Anonymous11:28 AM

    Basically what it comes down to is that interviewers are often douchebags who ask stupid questions to over analyze an employee instead of asking something relevant.

    ReplyDelete
  49. I think you're very wrong about the chance of collision. In a 4 billion bucket hashtable, you have a 50% chance of a collision after just 74466 words. By the time you reach a vocabulary of a million, you're virtually guaranteed at least one collision. The chance of collision (as a python expression) is: 1 - product((BUCKETS-i)/float(BUCKETS) for i in xrange(NUM_WORDS)). For a million, this is as close to 1.0 as makes no difference.

    ReplyDelete
  50. Anonymous11:33 AM

    @EvilKayak: Perhaps you should have "clanged on to education system for a while."

    ReplyDelete
  51. I understand the OP's situation even if he did come across as a little arrogant. Sometimes in interviews candidates are not given the chance to show their abilities through typical interview questions.

    I was once asked a question about sheep herding and sheep rotation. The ultimate answer was Triangular Number (http://en.wikipedia.org/wiki/Triangular_number). Well I don't have a degree in math and have never come across that word/concept in and off itself. I gave the interviewer the formula, but he kept pushing for the word.

    The in another phone interview I was asked five questions I had just read about maybe a day or two earlier. I actually told the guy I had just seen those questions.

    I don't think for most jobs people want honest and intelligent people. They want to fill each position with the minimum at the lowest cost.

    ReplyDelete
  52. Anonymous11:40 AM

    I'd hire this guy after asking other questions, possibly not related to the field, to see if the guy really was arrogant or not.

    ReplyDelete
  53. The interviewer was clearly trying to measure your algorithmic and programming skill.

    It sounds like you didn't think that was the "right" thing for him to measure, because there were all these meta-questions with the problem statement, and you kept trying to steer the discussion to those. Fair enough. That situation comes up on the job all the time; somebody's asking for X, but you can see they would be better served by Y.

    What such a situation calls for is to persuade your colleague (not "tormentor") of your perspective, in a way that makes him feel respected, not irritated or condescended to, and grateful that you helped him out. It doesn't sound like you really did that. I probably would have passed on you as well -- "technically brilliant, but won't be pleasant to work with."

    I hope this comes across in the constructive spirit in which it is intended.

    ReplyDelete
  54. Anonymous11:44 AM

    Judging by the bizarrely incoherent claims that you failed the interview "big time," this explains why America has a chronic shortage of IT personnel.

    The people who run and/or hire at IT companies are ignorant incompetent kooks, to judge by the claims that you "failed" this interview. Consider: that kind of attitude from the interviewer places you in a Catch-22 bind. If you question the assumptions behind the question, it means you fail because you "refuse to let go of the issue." If you don't question the assumptions behind the question, it means your fail because "you're not a deep enough thinker."

    It's entirely evident that no human being could "pass" the test, if the cranks and crackpots who've commented so far are correct. In fact, if you take their claims seriously, it's trivial to find a bizarre enough distortion of logic that will insure that *any* response you make to the question provides alleged "proof" of your "failure."

    Flip a coin to pick a response at random? You fail because you're insane. Immediately sketch out full working code in response to the questioner? No, you fail because you have no vision and would only do exactly and precisely what you're told to, like an automaton. Invent a new search algorithm 100 times faster than any previous algorithm known in response to the question? "Epic fail" because you didn't respond directly to the question. Solve the problem instantaneously before the interviewer asks it, by using ESP? Colossal failure, since this would make you impossible to work with due to your alleged "arrogance." Create world peace and grant everyone on earth eternal life as an incidental side effect of your answer to the question? Massive failure since you're exerting effort far beyond what's required and thus you're badly wasting the company's time and money, and as a result (predictably) the commenters will sneer "I would not want to work with you in any way, shape or form." And so on.

    Short version: no matter what you do, no matter what you say, no matter how excellent your skills, no matter how vast your knowledge, no matter how deep your insight, you can never ever ever ever ever satisfy the fringe lunatics on this website.

    Alas, those fringe lunatics seem altogether representative of IT management.

    Solution?

    Abandon IT for a field not dominated by kooks and cranks and crackpots.

    And we wonder why America keeps sliding farther and farther behind the rest of the world technologically...

    ReplyDelete
  55. Anonymous12:01 PM

    First of all he's interviewing for a tech writing job, not a programming job. Second, anyone who is saying he's not technical enough so they would not hire him is an idiot. Third, any programmer that carry's all that stupid detailed crap around in his head is not going to be a good long term hire, he will probably start to fall behind the state of the art because he always go to what he already knows rather than looking up what is the best way these days.

    I'd hire this guy as a writer.

    ReplyDelete
  56. Anonymous12:02 PM

    Kas, very impressive. I don't have a degree in computer science and I know little about programming, but I do have a Masters in English. Your answer to that abysmal question could have come straight out of the mouth of the father of modern linguistics, Ferdinand de Saussure, who wrote Course in General Linguistics and sparked the Structuralist and Poststructuralist theories in philosophy and other liberal arts disciplines. He argued that language is not stable, but composed of diachronic (over time changes) and synchronic (at the moment) elements. Language changes, and it changes quickly. Compare Shakespeare to contemporary English to view the drastic evolution of language in a few hundred years (even from the 1500's to the 1700's). Some may argue that English is now standardized, but every major paradigm shift, say the information revolution or the advent of computers for example, brings on a whole lexicon of new vocabulary and jargon that would easily fall into the top 10,000 words. Your example of the word 'terrorist' is even more striking because of its shift in importance from day to day. As far as the programming code, I will let more skilled minds in this discipline stick to that, just as your critics in the comments should not step into the realm of linguistics without more research and education. Thank you for sharing your answer and for thinking critically. It is nice to see that not all computer geeks are robots yet.

    ReplyDelete
  57. Anonymous12:12 PM

    And THIS is why I wouldn't have hired you.

    ReplyDelete
  58. Anonymous12:16 PM

    I don't know, programming and coming up with programming designs are an iterative process.

    It's not a fallacy to say "assuming the corpus is good" and then coming up with a first draught idea of how to solve it. While you clearly have working knowledge of algorithmics, your insistence on a corpus was the kind of pedantic whining that I would dislike having in a colleague.

    I'm saying this as constructive criticism.

    It would have been easy to answer all of those questions with answers instead of questions.

    For example: Q: how much memory will it use? A: "the english language has 1 million words. So I think that push comes to shove, I could store the whole dictionary on file and keep an index table of 4 megabytes. That would be the very minimum." is an answer that is perfectly intelligent, doesn't commit you to anything, and displays your reasoning and general knowledge.

    ReplyDelete
  59. Anonymous12:22 PM

    Alternatively Kas you could have prefaced an answer to the question saying this is a naive solution to the problem and I want to bring up some issues I have with the question itself at the end. Put out a solution to the question first to illustrate that you are smart. THEN pick the question apart and illustrate that you are aware of the issues with the original question.

    ReplyDelete
  60. Anonymous12:25 PM

    > Tom said:
    > Just FYI: you start to get collisions when
    > the number of items in the table is the square
    > root of the table size. One million is quite a
    > bit larger than the square root of four
    > billion - so you're pretty well guaranteed
    > collisions.

    Actually, you have a 50% chance of having a collision when the number of entries into your table is roughly the square root of the table size [actually it's closer to 1.18*sqrt(table size)]. The growth of this "collision is more likely than not" population is O(n^.5).

    Assuming you have truly random data (which is a best case scenario), then for a table size of 2^32, you'll have a 50% probability of a collision with a population of about 77,000.

    The question is if you design your algorithm to assume there are no collisions, can you afford a 50% chance of no collisions? Probably not. I'd prefer to have something along the line of 95% guarentee of no collisions. This population is about .32*sqrt(population size). For a table size of 2^32, this population is 20,991.

    However, most applications are designed to have a failure rate of much less than 5%. Because of this, one almost always needs to be concerned with collisions.

    ReplyDelete
  61. Anonymous12:25 PM

    I must take sides with people saying you missed the point of the question. If I am your supervisor and I ask you to do something I don't want a debate. I want a fast solution. I don't want a mindless answer AND I don't want to debate the philisophical underpinings of the question at hand.

    We can assume there is a need for a *rough approximation* of the top 10,0000 words for a given set of data (which maybe you will help select). Give me a quick solution along with caveats and if I need something better I will ask.

    When you say you were disappointed the guy didn't know some portion of hashing theory that you knew you sound a bit wounded - like how could they have missed your technical ability. You lost on your personal ability.

    Imagine the flip side. YOU are the boss. You need a list of the top 10,000 words as a component of a larger issue. YOU ask a coworker to find them and are drawn into a 30 minute debate on WHY (while he avoids HOW).

    Learn from this. Accept your failing and move on. Don't hold onto a feeling that YOU are right and HE is wrong (or we are wrong).

    ReplyDelete
  62. Anonymous12:26 PM

    And I have to add to the post from 12:16pm that discussing concrete algorithm implementations at the stage you were asking is IMO simply wrong.

    The bottom line is that the algorithm is this:
    - data mine
    - make a table of 1 million words, increment word frequencies as you mine
    - sort a table of 1 million words

    End of story. Then he can delve into each step deeper if he wants. And you can reveal how you would handle each complication further and further.

    If you come to the table with all your cards down, you will come off sounding as the eager high school graduate who says "I will use XOR EAX,EAX instead of MOV EAX, 0h to save valuable cycles". It's amateur hour.

    Life is like a theatre show in very many ways. You can't reveal what's behind the curtain before it's time to reveal what's behind the curtain. So many programmers fail to see this, and on top of that feel indignant when they are criticized for failing to see it... saying "What? The stage props were there all along."

    That's not how it works, and there's a deeper reason than just aesthetics to it.

    ReplyDelete
  63. Anonymous12:27 PM

    When you decided to go for a hash table (which I wouldn't; I'd go for a trie), and you pointed out that it would be sparsely populated, I'd suggest you handle collisions using a simple linear probing strategy as it would be a lot faster due to the cache exploiting spatial locality much better. I'd use double hashing as my next best alternative, instead of cuckoo hashing.

    By the way, as a technical writer you still need to be able to think analytically about a problem, and that was what your interviewer was testing you for.
    However, as a technical writer, your job is not to question all decisions made by other people in your company, or be a smart-ass. In a professional situation, if someone wants your input on something, they'll ask you about it. To demonstrate that you were aware of the dangers of making faulty assumptions about the language, I'd suffice telling your interviewer so in a couple of short sentences.

    ReplyDelete
  64. Anonymous12:30 PM

    The answer is so obvious -- just download Hadoop and modify the default word count appropriately. The only hard part is feeding it enough data.

    ReplyDelete
  65. Anonymous12:31 PM

    I think, if your goal was to get a job, you failed: you didn't figure out what the guy wanted to hear. If your goal was to get a job that was right for you, I think you succeeded in weeding this company out (or at least whatever group you interviewed with).

    FWIW I think your response was a strong hire. Grabbing a random set of assumptions and jumping to code is a no hire in my book. Without going through that exercise, I have no way of evaluating the quality of the list you return back to me.

    Yes, there is a real risk that people mentioned that perhaps you are just stalling the question because you have no idea how to implement a solution even with perfectly nailed-down requirements, but I think the second half of the interview rules that possibility out.

    Hash tables are a thoroughly solved problem. I'd have taken whatever the language offered me. Memory usage would be proportional to the number of unique words in the English language -- which is somewhere south of 100,000 words, provided no spam chaff (full of unique nonsense words) got in to the corpus. Easily doable on any modern machine. The fact that the interviewer was judging your skills in re-inventing the wheel is just more justification to no-hire the company.

    ReplyDelete
  66. Anonymous12:44 PM

    I would not hire you, you were too cocky.

    ReplyDelete
  67. Anonymous12:58 PM

    Yes the main reason why you failed is because you were highly defensive and evasive in your responses.

    That reeked of: "I dont really know, but i'm going to snow ball this into a bunch of exceptions and asides and hope the other person gets frustrated enough that they give up the fight." WHICH IS NOT WHAT YOU SHOULD BE DOING IN THIS TYPE OF INTERVIEW.

    Of course it is good to point out the fallacies in his question as well as identifying the fact there are assumptions that are unknown. You should mention then, give examples, propose definitions for those unknowns and assumptions in order to limit the problem scope, and then become constructive in order to build the system he is asking for.

    ReplyDelete
  68. Eric Scrivner12:59 PM

    This question is almost a direct lift from Knuth's "Literate Programming". Whomever recommended the Trie has clearly read Knuth's solution as well.

    ReplyDelete
  69. Anonymous1:01 PM

    I would expect that the interviewer was expecting a solution that scales without requiring near infinite RAM. e.g. disk-based mergesort. (why this is needed for technical writing i have no idea)

    ReplyDelete
  70. If you were interviewing for a well known search company, why wouldn't you have given the answer back using a Map/Reduce based algorithm? If I recall correctly, this was one of the example problems in the original Map/Reduce paper. In fact, I just checked and it was.

    I read this question as saying "Prove to me that you have read the original paper and understand how to implement it." Given that question, you failed.

    ReplyDelete
  71. Anonymous1:05 PM

    People have worried about this before you (and for longer than you)... and they have produced standard corpi:

    For British English BNC (http://www.natcorp.ox.ac.uk/)

    For American English: http://americannationalcorpus.org/

    I'm sure there are many for other languages too..

    You did sound a little arrogant btw..

    ReplyDelete
  72. Yea, I would've stored it as a trie (tree of hashes) also. Most dictionary/spell-checkers use a trie for word storage and fast lookup. Or perhaps used a database to grab things, or put every word as one line in a file, then used the unix 'uniq' program to generate the word stats, then sorted them and tail'ed the top 10k results. Can be done on the command-line without any programming. :)

    ReplyDelete
  73. Anonymous1:08 PM

    I never Liked hash tables as a means of storing large data, i would personally have gone with a sort of a balance B-tree, maybe. its easier than dealing with collitions, and searches are faster and its already sorted.

    ReplyDelete
  74. Solipsism: The pretentious morons preferred way of solving any problem.

    ReplyDelete
  75. I didnt make it past the first paragraph (where the interview question was!) but thank you for that! I had fun working out the solution!

    (I'll visit later to read the rest...)

    ReplyDelete
  76. Anonymous1:26 PM

    Probably the "well known search company" wanted you to explain how you'd use MapReduce to sift their huge database of English language material, which includes books as well as web pages.

    ReplyDelete
  77. Anonymous1:32 PM

    I think your solution has some problems, but I (as a software developer who has interviewed candidates) disagree with those who say that your arguing the details of the problem was the wrong thing to do. Obviously you weren't applying to be an engineer, but I never, ever want an engineer who just produces "a fast solution" (as a previous comment put it), I want an engineer who actually understands the problem. I also want to make sure that *I* understand the problem and you questioning me on the details is the only way we are going to get there. If the problem is ambiguous or you don't understand it or you aren't sure that your solution is actually going to match what they really want (as opposed to what they ask for) then you should start questioning the assumptions.

    One quibble that neither you nor the interviewer seemed to consider is the definition of "used". Is a word used when it is written or when it is read? Should words on more popular web pages count as more commonly used than words on less popular web pages?

    That aside, when the interviewer points you in a different direction, you go there. You seemed to do that when he asked you for an algorithm (although I have some problems with your solution), so that's okay. In general, however, if a "simple" interview question prompts a lengthy, interesting discussion about the problem then I'd probably consider it a success.

    ReplyDelete
  78. To tell you all the truth, I agree with Tom. The current corporate world is best served by people who do not think too hard, who don't need to argue every point, and who can accept certain compromises in order to produce what is best for the company.

    Your arrogant attitude towards the interviewer's question only serves to anger him. As smart as you are, you let your smarts get in the way of something very trivial (a stupid job interview.)

    If there is one thing institutionalized education has tought me, it's how to survive in a corporate world. I am probably worse than you when it comes to arrogance and superiority. In my mind, I'm enlightened, everyone is stupid, I know everything. I know this is true. Unfortunately, I have to push those feelings aside in order to be happy in this life and to prosper.

    I suggest you either go into academia and research, or you seriously rethink your attitude towards other people.

    ReplyDelete
  79. Anonymous1:36 PM

    Interesting to note the the standard corpora do not agree on what is most common :). The list of caveats, statements of limitations, and explanations that come with these standard corpora are also noteworthy.

    ReplyDelete
  80. Anonymous1:43 PM

    1) You were interviewing for a job - implies equal courtesy and respect on both sides (perhaps something that was lacking on both sides of this interview)
    2) Your answer while sound and technical was contextually inappropriate, give the answer from the point of view of the job you are interviewing for (a Tech Writer)
    3) As some one doing the hiring, your job is not to know as much as your reports (especially in any fast paced technological field), this is near impossible. Your job as the manager or supervisor is to get the desired out put from your reports under the direction of the higher-ups. As a result, the interviewer needs to find some one who 'fits' from the perspective of business need at that time.

    ReplyDelete
  81. Anonymous1:48 PM

    Did anyone not notice that at one point it is an interviewer and the next a professor?

    ReplyDelete
  82. Anonymous1:52 PM

    Out of all the comments, Andy said it best. This post was just pure arrogance and proof that smartasses will annoy anyone and everyone to no end. Lively debate and discussion, if that's what you want to call it, should be done over coffee, not an interview where a company is trying to decide whether it wants you working there. They wanted to see your methodology and technical ability, but instead you went off on some stupid philosophical discussion about the fluidity of language. If I were the interviewer I'd have said, "Thanks for pointing that out, Captain Obvious. Now just answer the question."

    ReplyDelete
  83. James2:32 PM

    If this is the same internet search firm where I work, I can only say that I'm glad the outcome of this interview was what it was.

    ReplyDelete
  84. Anonymous2:36 PM

    "If I am your supervisor and I ask you to do something I don't want a debate. I want a fast solution."

    Superbly stated, sir! Congratulations! You just prevented the modern integrated circuit revolution.

    Your arrogant ignorant sneering supervisorial attitude, sir, precisely encapsulates the mindset of the people who demanded a quick solution to the problem of reducing the size of component digital circuitry, so that it could fit into the Minuteman missile. The obedient drones who answered that question quickly and efficiently produced the obvious and most efficient answer to that problem, and since it was the most obvious and the most efficient solution, it was of course wrong.

    Miniaturized electronic components in blocks never went anywhere. Vast amounts of time and money got wasted. The Pentagon wasted billions, and produced nothing but dead ends.

    But one engineer, Jack St. Clair Kilby, declined to obey the mindlessly foolish recipe for failure "If I am your supervisor and I ask you to do something I don't want a debate." Fortunately, Jack St. Clair Kilby got hired during a holiday when there were no supervisors around to bully him. And so Jack St. Clair Kilby had the luxury of ignoring your appallingly foolish and ignorant edict "obey, and no debate."

    Instead, Jack St. Clair Kilby sat around thinking for a few weeks about what the right question ought to be. He realized that the right question was not how to miniaturize the digital circuit components, but how to integrate the entire circuit in a size small enough to fit into the Minuteman missile. Once he asked the new question, Kilby immediately realized that the solution was to etch the entire circuit as a whole on a single piece of silicon. In other words, he invented the first integrated circuit chip.

    I congratulate you, sir, on having the brazen huevos to boast that you would have fired the man who invented the modern integrated circuit because he wasn't following your orders and he wanted a debate.

    The arrogantly mindless sneer "If I am your supervisor and I ask you to do something I don't want a debate" represents Exhibit A in the decline and fall of American management, ladies and gentlemen.

    ReplyDelete
  85. Anonymous2:36 PM

    This is what you want probably: http://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf



    (Referenced as implementation of "Top" agreagator in Sawzall language, used as fronted to MapReduce)



    Nice post.

    ReplyDelete
  86. Anonymous2:38 PM

    This isn't a tough question at all. I'd be more concerned about the fact that you think it's tough than I would about any "arrogance."

    How I'd answer the question:

    "First, do you have the corpus identified? No? Well, how much time do you want to spend on that? We can just get a premade corpus for testing, such things exist. How about we just use the works of Shakespeare? That nicely bounds the 'What do you mean by English?' problem, and lets us concentrate on a technical approach without getting bogged down in linguistics.

    "Is this something we only need to do once? If it's just a one-time thing, then the first thing I'm going to do is take ten or fifteen minutes to write a Python program that builds a word-frequency histogram. It's going to be a quick and dirty solution - my definition of 'word' is going to be defined in a regular expression that we'll probably need to re-examine later. Oh, crap, I have to write a regular expression. Make that twenty minutes. Anyway, if we're just trying to get an answer it can be quick and dirty.

    "There's all kinds of reasons this might not work. I don't think it's likely to run out of memory - I know I can build a dictionary with a half-million entries in it on my laptop, and I seem to remember that Shakespeare used something like 200,000 distinct words. And while it's not the biggest corpus in the world, it's big enough to give us some usable performance data. I mean, I could work out on paper how much memory I expect this to use and how fast I expect it to run, but there's not much sense in doing that if I can just find out for myself by taking a few minutes to write a program.

    "While it's running we can talk about other things, like what this program might look like if we needed to optimize it for larger corpora. Like, it would certainly be faster if we used a trie, and we'd need to organize it functionally to use map/reduce if it turns out we need to run it on multiple CPUs. The definition of 'word' is probably something we should look at too, especially if we want to run this on something that's not English."

    ReplyDelete
  87. Anonymous2:43 PM

    @Integrated Circuit Man: Yet still so many 'not so free' nations are clipping along the technological road fairly well ..... perhaps [another] part of the decline is too much liberty?

    ReplyDelete
  88. Anonymous2:51 PM

    Did the interviewer have water for you on the table already? Did you spill it on your resume by accident and turn red with embarrassment?

    ReplyDelete
  89. Anonymous2:53 PM

    > but that guy could also be a $5/hour guy from India or a $2.50/hour teenager from Asia.

    Good luck with that.

    ReplyDelete
  90. Anonymous3:06 PM

    "The current corporate world is best served by people who do not think too hard..."

    "...Perhaps [another] part of the decline is too much liberty?"

    Primo magnifico. I bow before the master.

    Don't think too hard...and cut back on that pesky liberty -- wow, sure sounds like the last eight years of our beloved previous administration in the White House, doesn't it?

    Don't I know you guys? Didn't you plan the invasion of Iraq, and tell us Saddam had WMDs, and assure us that the levees around New Orleans didn't need reinforcement because, well, y'know, the current corporate world is best served by not thinking too hard?

    I love it. The people on this forum busily condemning the guy who interviewed with Google have got a live show & tell demo going: WHAT'S WRONG WITH AMERICA: EVERYTHING YOU ALWAYS WANTED TO KNOW, BUT WERE AFRAID TO ASK.

    Keep it coming, people. Pure comedy gold. "If I were the interviewer I'd have said, `Thanks for pointing that out, Captain Obvious. Now just answer the question.'"

    Exactly what the White House said to the CIA analysts who tried to explain that the question "Where are Saddam's WMDs?" was meaningless. In fact, I betcha `Thanks for pointing that out, Captain Obvious. Now just answer the question,' are the precise words used by the Intel exec who asked how the Itanium project was coming along when his engineers tried to tell him the Itanium was a fiasco back in 2001.

    ReplyDelete
  91. Anonymous3:19 PM

    A further irony is that when this was on Reddit this morning, it happened to be next to an article about the problem with anti-intellectualism in the U.S. Most of these comments just prove that America is not interested in thinking.

    ReplyDelete
  92. Anonymous3:33 PM

    My own two cents:

    I fully agree with one of the anonymous above:

    "I think, if your goal was to get a job, you failed: you didn't figure out what the guy wanted to hear. If your goal was to get a job that was right for you, I think you succeeded in weeding this company out (or at least whatever group you interviewed with).

    FWIW I think your response was a strong hire. Grabbing a random set of assumptions and jumping to code is a no hire in my book. Without going through that exercise, I have no way of evaluating the quality of the list you return back to me."

    My sentiments exactly. If I was interviewing you, I would hire you right away. Someone who didn't display critical thinking and who jumped straight to coding without examining assumptions would be a quick no hire.

    Just because you debate something, doesn't mean it's arrogant. It all depends on how you treat your debater. The mere act of questioning is NOT arrogance! I think it's the opposite. It's all the people who scream "arrogant" at the mere presence of a questioning attitude are the ones that are arrogant. It's like the age old adage that says that people who complain about other people's egos are usually doing so because they feel their own big and stinky egos being constricted.

    At the same time you have to face reality. There are a lot of morons out there who want to employ robots who unthinkingly obey. That's reality. What are you going to do about it? You'll have to find a way to make such situations workable for you. It all depends on what you strive for in life.

    If I was the interviewer who did not want this kind of detailed analysis of assumptions, what I would do is I would launch into a 5 minute speech about how much I love and appreciate critical thinking, and how I have duly noted your ability to think critically, and THEN ask you to please move on. That way you'd probably feel that I GOT IT, I saw the quality of your insight and appreciated it, and you had nothing further to prove to me in that regard, and could safely move on. I guess the reason you kept challenging the assumptions is because at no time did the interviewer express appreciation of your critical thinking skills.

    ReplyDelete
  93. Anonymous3:35 PM

    Some people are perhaps over-aggressive in their (valid) criticism.

    Sounds like flaunting your knowledge as a classic means of avoiding the question. Until you show that you're perfectly capable of answering the question. Then you just sound like a pedant (I have this problem too so no offense intended).

    In this particualr case, I see a win with this interviewer. If (s)he wasn't enjoying the distraction you would have been cut off very quickly. However in general I would call this bad technique.

    My non-mathematician answer?

    define corpus
    create database table
    crawl corpus
    add each new word, increment counter for repeated usage
    sort by count
    repeat as needed for a dynamic list

    The original question was a one sentence quickie, and that should be more than sufficient answer. You took more time exhaustively dissecting the question and questioning the validity of the problem, as well as the motivation behind it than would have been lost in the execution of an 'dirty' solution.

    I think given reasonable access to a designated corpus and a modern PC I could have had the list in hand before you finished with your pedantic antics.

    But, based on the interviewer's responses to your prompting you might just fit in with that particular team.

    ReplyDelete
  94. Anonymous4:00 PM

    Big douchebag: answers a question like you did in an interview.

    Bigger douchebag: answers a question like you did and blogs about it.

    Biggest douchebag: answers a question like you did, blogs about it, and makes the front page of reddit.

    ReplyDelete
  95. Anonymous4:04 PM

    I find it fascinating that the only language that I saw mentioned in all of these comments was Python.

    Also, there are a notably high frequency of condescending, angry, douchebag comments here.

    Lesson: Python programmers are condescending douchebags.

    ReplyDelete
  96. Anonymous4:19 PM

    Averaged over n >= 30 days you will get a normal distribution of most commonly used words. As n approaches infinity the distribution becomes normal. 1 day of terrorist being 100, will be insignificant over the course of a one month or greater web crawl. Anyone who has any understanding of real statistics would know that. Words are not as fluid as you think, they are generational, they do not change monthly. Crawling social networking sites in say 5 years when it becomes transient between demographics will give you sufficient basis for commonly used words as well as news (professional and amateur including blogs) to give you sufficient professional word frequency. The only problem here is weighting, which can be done by using census information. I won't go further but it's not as difficult a philosophical question as you think. Silly programmers, you just need a DOE engineers perspective.

    ReplyDelete
  97. Anonymous4:56 PM

    I see a lot of people defending critical thinking, and I agree with that. Just do not forget these few key points:
    1. This was a job interview, which is limited in time. If the interviewer wants you to demonstrate analytical thinking, go ahead and demonstrate it. This nitpicking should be limited to 1 sentence, max.
    2. A big internet search company of today got a lot of employees. Some of them are hired to be thinking, some of them are hired to be technical writers. I believe maybe you applied for the wrong position, if you want to help define linguistic concepts for some internet search company's crawler.

    ReplyDelete
  98. Anonymous5:04 PM

    Or you could have just went with a linked list of words and incremented the count, then perform a quicksort on the count key and spit out the top ten thousand. That's 36 Megabytes plus a tad, all in memory. Fast too.

    Jeez, no biscuit for you, dog.

    ReplyDelete
  99. Anonymous5:34 PM

    Yup, you failed. It's good to see the bigger picture, but you demonstrated you couldn't get the gist of a problem, nor approach it constructively.

    ReplyDelete
  100. Hello KT, the correct answer, and therefore the one that would have gotten you the job, was, "let me restate your question.....". you were supposed to have an open dialog with your interviewer. Everyone remember the NASA problem where the designers in one country used metric and in another inches. So then the pieces that they each build did not fit together...once they got them out into space..bad time to find that out!!!! that's what the open dialog is for, to keep from makeing monumental mistakes like NASA did. its cost millions of dollars..... and a simple converstion would have prevented it. After you both understood the question should you have moved on the the answer... A bell curve...stats... would have been a fine answer. caio

    ReplyDelete
  101. Anonymous5:44 PM

    " The arrogantly mindless sneer "If I am your supervisor and I ask you to do something I don't want a debate" represents Exhibit A in the decline and fall of American management, ladies and gentlemen. "

    Wow. I had never been accused of the decline and fall of American management before! I know you missed my point and I'm afraid that I won't be able to explain it better than I did.

    There is a time to talk about drawbacks/risks and a time to get your work done. Knowing the difference is an important skill in a workplace. Treating your lead or manager like an idiot because it seems to you that they fail to grasp your superior knowledge surrounding the finer details of a problem is a sure way to make enemies and avoid getting hired.

    You need to pick better battles. Standing up for such a stupid point of contention in an interview and avoiding the obvious answer smacks of an ignorance worth avoiding.

    ReplyDelete
  102. Anonymous6:02 PM

    You are amazingly annoying, and someone I would not hire based on this interview. You were aware of what the interviewer wanted, yet instead of providing that, you wanted to challenge parts of the question that your interviewer obviously did not care about or want. After repeated attempts by the interview to steer you to what he wanted, you persisted in shifting to what you wanted. Since this interview is my only way of gauging your personality I would cut your interview day short and thank you for your time.

    The tech companies that I've worked at, try to filter out jackasses; you would clog up the filter so badly.

    ReplyDelete
  103. Anonymous6:06 PM

    The interviewer asked the interviewee to "Explain how you would develop a frequency-sorted list of the ten thousand most-used words in the English language."

    What obvious answer?

    The interviewee DID explain how HE would go about it which rightly included raising the very serious problems of definition and constraints.

    The question as posed is insoluble unless one had access to every English word every spoken, printed. Anything short of having universal access would give rise to an approximation. Sometimes approximations are good enough so that is fine.

    Am intrigued by the notion of sampling as possible solution. Not really sure how far this would get us as the underlying word frequency data are NOT normally distributed but, rather, have a Zipfian distribution.

    ReplyDelete
  104. Anonymous6:18 PM

    Both arrogant (oh my god the interviewer does not even know about Cuckoo hashing...) and you obviously do not understand much about statistic, probabilities, and code theory (combo! :p ) -- I mean, maybe you can do the math, you probably can -- but you don't have the intuitive feeling that would have saved you for all those bad answers. Your "solution" was completely overkill and yet it still contained obvious flaws :/

    ReplyDelete
  105. Anonymous6:24 PM

    There are many papers on this topic, a while back I built a learning algorithm that would parse text books and guess the genre based on the words used. (And yes, we ran into the same problem with the subjective nature of the English Language)

    The first hundred or so words are all you really need to determine the language someone is using and the next three to four hundred or so can give you an idea of what you are reading.

    Not sure why anyone would say the first 10,000 words in a generic sense, I would try and find the first 10,000 words in a given subject and work my way from there, or find a fraction of distinct words from a variety of subjects.

    For the storage, I would use the same thing Microsoft uses (or used) for spell check. A Tree where each node contains 26 pointers one for each character of the alphabet. Using a tree prevents hash collisions and keeps search time at word length.

    ReplyDelete
  106. Interesting read. If an interviewer's questions lack background explanation, then, it's a good thing to demand this level of specification. I believe the commonly quoted statistic is: Around 75% of engineering projects fail due to poorly defined problematics/specifications.

    It appears that you touched a nerve with quite a few interviewers here. The bitterness of replies suggests that they may have been guilty of asking broadly scoped questions and not getting their presumed response.

    Some of the weak spelling and grammar amongst the comments was amusing. (Ironic, also, given the topic.)

    ReplyDelete
  107. Anonymous6:41 PM

    You're one of those guys that goes through life trying to prove to everyone how intelligent you are. Not by actually being intelligent, but by trying to make everyone around you look stupid through your arrogance. You not only showed this through your answer to the question, but more so by posting it on the internet looking for others to compliment your technical knowledge.

    Get some putty and fill in that chip on your shoulder, hero.

    ReplyDelete
  108. Anonymous6:46 PM

    Some sad comments indeed!

    Keep thinking about not hiring smart people, then you're competitor will.

    Keep being the shining start of your toy company.

    ReplyDelete
  109. Anonymous8:25 PM

    Anonymous claims those that fail kas for his interview would have prevented "Jack St. Clair Kilby" from miniaturizing electronics.

    I disagree, I say kas himself would have prevented Mr Kilby from miniaturizing electronics, because if he was surrounded by 'smarter then thou' colleagues who endless debated about why he would fail, and how stupid he was, or why all his assumptions were wrong, I bet he wouldn't have finished when it came time to bring his invention to production.

    I've asked many people these types of questions, where the assumptions are a little weak, but the question brings up topics and methods I want to see the interviewee handle. The question I have revolves around developing a sorting algorithm where examining the value of an element is much more costly then swapping it. I've had people debate me endlessly about the assumptions to the problem I'm asking them to solve.
    Such as,
    "what kind of thing would be more expensive to examine then to swap? That's pretty much impossible in modern memory and cpu design to achieve. Can you give me an example of where I'd encounter this, surely your company doesn't have a need for this type of sorting?"

    That comment is a NOHIRE, however, i've had other candidates that say,

    "Interestingly, I have an interest in EE, and its actually pretty much impossible to have a modern electronics design where that would be possible, I'd love to explain why to you, but I get that you're trying to get me think of sorting a little bit different. So here is how i'd solve it..."

    I've hired those people.

    Its as if kas had said,

    "Look, first off, if you really require a true representation of the 'most-used' words used in the 'English Language' today, then that actually is a very difficult problem due to the ever changing nature of language and the difficulties in finding a representative set of words. However, if we limit the source to English words, say to the ones used on the web, then this becomes manageable. Do you think I can make this simplifying assumption?"

    ReplyDelete
  110. Apart from the question of "why anyone would ask that kind of question in the course of an interview for a technical writing job", the next obvious question is : why any candidate interviewing for a technical writing job would know so much about hashing.

    That would make you a NOHIRE for me: there's no way you are going to be happy at technical writing, when you have such strong opinions of development issues.

    ReplyDelete
  111. Anonymous11:37 PM

    I get the strong impression that OP's recollection is heavily embellished. I very much doubt the interview went as described. Treppenwitze.

    ReplyDelete
  112. A job interview doesn't have to be merely tap-dancing to the tune of the interviewer.

    Kas effectively presented his strengths and weaknesses through his answers. It's to the benefit of both parties that he look elsewhere if his assets didn't match the position.

    That's not failure unless the goal was simply to get that specific job at any cost.

    ReplyDelete
  113. The comments here are rather bewildering. To describe this interview as a fail is extraordinary.

    If, in the course of an interview for a technical job, the interviewer pulled out a piece of wood with several holes in it and asked you to insert the appropriate pegs would you do it without question?

    It could be the case that some people aren't interested in jobs where the interview, and likely the job thereafter, is simply jumping through a series of arbitrary hoops.

    I think it's good that some people are willing to dig deeper and ask questions about the fundamental objective of what's to be accomplished. There are far too many people out there that are simply content to simply meet baseline requirements and never challenge themselves to exceed expectations or show forward thinking.

    ReplyDelete
  114. Anonymous5:01 AM

    As someone who's interviewed candidates for a tech writing role, I'm already impressed if the interviewee knows the difference between "it's" and "its".

    ReplyDelete
  115. Anonymous2:14 PM

    tadman,

    So if you went to an interview, and the interviewer pulled out a piece of wood with holes in it, asked you some obvious lateral thinking exercise involving pegs. Would you divert the interview as to why he was using wood, when steel is fire resistant and stronger? Or perhaps question him on why he didn't use a particle board made from recycled fiber instead of a dead tree? Or why would a software job would involve wood?

    It sounds as if you think that interview questions have to involve the real world solutions rather then questions that try to probe at different personality or software development approaches.

    You claim that "questioning the fundamental objective of what's to be accomplished" in a obvious interview question is a sign that the engineer is not willing "to simply meet baseline requirements and never challenge themselves to exceed expectations or show forward thinking".

    In fact questioning insessantly about the fundimentals of an 'interview question' to me is an obvious sign the interviewee isn't capable of understanding the context of why a problem exists,thus isn't capable of true forward thinking.

    I mean, if you're not capable of understanding why you're being asked that question in an interview, you're not really able to think outside of the problem you were given. Regardless of all the 'holes' you found in the assumptions.

    ReplyDelete
  116. Anonymous3:29 PM

    Interviewer: "How do you build a bridge?"
    Interviewee: "Over what?"

    FAIL.

    ReplyDelete
  117. Anonymous5:16 PM

    I think the issue was really that he was making statements that degraded the question, rather then asking questions to clarify which direction he should go. As a product manager, I do expect to be challenged by the tech writers I work with, but when someone says "do this", there is usually a reason in mind. I don't mind if someone asks "Why?". But as mentioned in previous posts, there comes at time to stop asking and move on, unless the point completely changes the work. My guess, similar to what others have stated, is the point of the interview question was to test to see if the interviewee 1) knew that they should ask more discovery questions so they're clear on the objective, 2) if they were able to ascertain (or knew to ask about) what is important to know (and thus write about) and what isn't, and 3) demonstrate basic problem-solving skills and ability to "think outside the box". Also, a basic interviewing skill to know is that the people you interview with may likely be your coworkers, if not one of your supervisors so they want to make sure you get along with them and that they want to work with you.

    ReplyDelete
  118. Really good question...

    Maybe I will add that list to my questions :)
    http://www.mockquestions.com

    ReplyDelete
  119. Anonymous1:06 AM

    Kas, if I saw one of my colleagues ask you a numbskull question like this and heard your answer, I'd privately tell him later, "Wow, you just got your ass handed to you," and then vote for you in the post-interview meeting.

    Conversely, if a candidate just blurted, "hash table" in response without any obvious thought or clarification of the requirements, that'd be, well, it'd be okay, but less good.

    And you guys who are taking him down for caring enough to _communicate_ with his interviewer instead of just submitting to torture... you guys are idiots. "Bridge over what?" indeed. It actually does matter.

    ReplyDelete
  120. Anonymous2:29 AM

    Yes, exactly. "Bridge over what?" matters a _great_ deal. A 20-mile-long bridge over shallow tropical water with sand at the bottom is an *entirely* and *radically* different problem than a 300-foot-long bridge over the Hudson river with deep mud at the bottom. One case requires the causeway bridge to Key West and demands a peculiar type of construction, whilst the other situation requires caissons and decompression chambers and blasting down to the bedrock under the mud several hudnred feet below the surface of the water. But of course the ignorant incompetent fool who sneered that knee-jerk sound bite was far too clueless to realize this.

    What's deeply disturbing about the responses to Kas' post is, of course, that many of the halfwitted ignorami who sneered "I would not hire him" are clearly talking from personal experience. These people actually seem to be in management. This represents a brutal blow to the credibility of American management and a savage condemnation of the American way of doing business as a whole.

    Hey! Ignorant management fools! Try reading the book "The Mythical Man-Month." Then go on to persue "The Peter Principle." Then read the chapter in Friedrich Hayek's "The Road To Serfdom" titled "Why the Worst Rise To the Top." Yes, these tomes describe _you_.

    It cannot have escaped the notice of any sane person that we have just emerged from an 8-year-long period in which the very worst people rose to the very top of our government. The result nearly wrecked America. We started out that period with a trillion dollar surplus, now we've got a ten trillion dollar deficity. In 2000, the American economy was vibrant and growing fast, in 2009 the AMerican economy has collapsed. IN 2000 we were at peace, in 2009 we're mired in two lost wars and gearing up for a third. What we observe from the startlingly ignorant and foolish responses condemning Kas' post is that the same thing appears to have happened in American business -- the worst have risen to the top, and are rapidly running America into the ground.

    After reading to bizarrely ignorant and spectacularly foolish responses condemning Kas, I now understand why 50% of all large programming projects get abandoned after bogging down into total chaos.

    The reason is aptly summarized in the absurdly moronic statement by one of the clueless managment types: "Wow. I had never been accused of the decline and fall of American management before! I know you missed my point and I'm afraid that I won't be able to explain it better than I did.

    "There is a time to talk about drawbacks/risks and a time to get your work done. Knowing the difference is an important skill in a workplace. Treating your lead or manager like an idiot because it seems to you that they fail to grasp your superior knowledge surrounding the finer details of a problem is a sure way to make enemies and avoid getting hired."

    Yes indeedy, "there is a time to get your work done." Perfect statement of the halfwit attitude which leads to half of all large programming projects getting bogged down and abandoned after pissing away millions of man-hours and tens of millions of wasted dollars. Don't bother to examine the underlying assumptions of your peabrained manager, never raise significant deeper questions which might lead to the entire project hitting a brick wall...no, just follow orders like a good little robot, and above all, never ever come off as "arrogant." Translation: never display insight, never think outside the box.

    The people in management who have responded here by ridiculing and sneering at Kas have the cranial capacity of lobotomized trilobites. America could take a giant leap forward if we took the whole lot of them and dumped them at sea along with our radioactive waste.

    ReplyDelete
  121. Anonymous8:26 AM

    Clearly there are two strongly opposing sides here (me being Cpt. Oblivious):
    1) Kas is very technical and accurate in his answer and should be commended on that
    2) The person conducting the interview could possibly take the confidence as arrogance

    No one has fully explored (in the comments) that perhaps a 'truce' should have been drawn within the context of the following:
    1) This was quite clearly a question that was meant to be an instrument within the larger bounds of a full interview
    2) The interviewer should have at least acknowledged the technical abilities of kas and guide the interview in the appropriate direction

    If on the other hand the question was: I have this problem, how do I solve it? And it were for an architect type position, then surely whip out the white boards and dry erase markers and let's have it out.

    However, the context here was an interview for a technical writer. Kas should have been in a technical write frame of mind and the interviewer did not have enough skill to steer the interview in the right direction.

    So in a way, they BOTH failed, but they both won.

    Regarding the clear resentment of management types and 'those morons ontop':
    Say what you will and moan as much as you want, they are still ontop and they still gross more than you, they still have more free-time than you to spend doing what they like (not necessarily family in most cases). But they got there more than likely because of their ability to use a little more of the emotional/creative side of their brain.

    It's about people, methods and insight. Technical facts, figure and critical thinking are great as well, but one has to begin balancing both. The moment you realize this you too can be an incompetent moron on top too.

    rgds
    CCIE, CISSP, MBA, PMP, manager of 250 talented workers much more intelligent than myself, 35 days of vacation a year, house in the Caribbean and grossing 178k/yr

    @"cranial capacity of lobotomized trilobites" you can come by anytime to look at the fossils collected in Argentina during my last trip since you are so in awe with them.

    ReplyDelete
  122. You dodged the question and presented yourself as a smart aleck.
    If Sergey Brin was like you, he'd say "Rank pages by relevance? But relevance is in the eye of the beholder. It isn't well-defined." And Google wouldn't exist.
    You might be a good candidate for coding business logic for which someone else has written a spec, but you can't do creative work.
    And for the record, I'm not in management. Not even a team leader.

    ReplyDelete
  123. Oh and btw on the subject of hash collisions. One thing I've learned from experience is that every event that has a very small but non-zero probability will sometime occur. Usually it'll be at exactly the wrong moment.

    ReplyDelete
  124. I'd never hire any of the arrogant wankers in these blog comments. Arrogance just means you can't work in a team. Good luck with that.

    ReplyDelete
  125. Gosh, there are so many comments by Anonymous! I'd like to thank everyone for participating -- even the most abusive and rude respondents (you know who you are); I thank you for your input. It's still a free country.

    The two deeply thoughtful Anonymous responses further above -- the ones that start with "Kas, if I saw..." and "Yes, exactly" -- are very much appreciated. Very thoughtful comments. Creative, too. "Lobotomized trilobytes"?! LOL... awesome.

    You guys made my day.

    I'm glad to see so much dialog on this. Now, if you'll excuse me, I have to change clothes. These tomato stains look like they're not going to come out...

    ReplyDelete
  126. Anonymous3:06 PM

    It was a very good question, and I enjoyed the conversation that followed a great deal.

    Obviously, you did extremely well; and if the interviewer didn't think so, it's his company's loss.

    Regardless, for somebody as strong as you it may be too late to come to Google. It's not the company it used to be even 5 years ago. There are great startups in the Silicon Valley.

    ReplyDelete
  127. Well... the way I wrote it is with a multi-dimensional linked list. The
    program I wrote below takes a letter, attempts to find the first letter
    within the linked list, which is populated in the order that it finds
    letters. Lets take an example word, "winners". Should this be the first
    word, the first letter would find its place in the linked list. Should
    there already be values in the list, it would continue down this same
    first position in the list. The reason that an array of letters is not
    used is due to the memory restraints, as well as the most frequently
    used letters will tend to be towards the beginning of the list, and
    letters that are not used will not occupy any space at all!

    For instance, should an array be used for this purpose, each populated
    with the letters of the English alphabet, the array would grow
    exponentially and not be able to contain a very long list of words.
    Attempting to create an array of all the possible words using all the
    letters of the alphabet with only two letters would create an array that
    of size 642, which is 4096 bytes large, assuming we will use ASCII. An
    array that could contain all the words 6 letters or less would be
    68719476736 bytes, or 64 GB, which is much larger than the average
    volatile memory capacity of a modern PC. Attempting to make an array
    that could contain all the possible words up to 13 letters long would
    require 281474976710656 GB of memory, which is much larger than the
    non-volatile storage capacity of computers. In fact, the longest word
    recorded in a respectable dictionary is 45 letters. It would be quite
    impossible to create an multi-dimensional array to store that word.

    Therefore the only reasonable way of containing such information would
    be with a linked list, which is only populated when needed. Therefore, a
    45 letter word would only require, assuming 1 byte for the letter, two
    bytes to store the number of times the word was encountered, one byte to
    store the pointer to the next link in the list, and one byte to store
    the pointer to the child link ( assuming we are on a 32 bit system ),
    225 bytes. This is a huge improvement over the multi-dimensional array.
    Not only that, but letters are shared amongst words! The words "book"
    and "boot" would share the letters "boo" within the multi-dimensional
    linked list, the second word only costing an additional 5 bytes.

    Also, due to the fact that the list is only populated by letters it
    finds, the more common letters will tend to be near the top of each
    level in the list, reducing the time it takes to walk through the list.
    Assume that the program finds the word "live". In order to increase the
    count it will first walk through the first list. It first looks for 'L':

    t
    e
    n
    f
    l -- Here it finds L.

    Then we start walking down L's child list, which will contain mostly
    vowels, as words that start with L are usually followed by a vowel.

    L's child list:

    e
    a
    i -- Ok, found I.

    Then L->I's child list.

    m
    n
    b
    v -- Found V!

    L->I->V's child list:

    e -- Found E, first one.

    So, in order to find LIVE, we took 13 steps. Now the multi-dimensional
    array does have the advantage here in speed, as each letter can be
    reduced to a numerical value and its place in the array found easily in
    one step. Too bad the memory constrains make it impossible for use.

    Here is a program I wrote in C, which demonstrates this concept of a
    multi-dimensional linked list. It was written for a *nix based system
    and takes files for input, or from stdin. The file will compile with gcc
    with no arguments required.

    This was written to find the most common sets of letters, but could be
    easily modified to find the most common words.

    Usage: ./a.out group_size ( filename )

    Ex.
    for i in WorksOfShakespeare/*; do cat $i; done |./a.out 4 | sort -k1n

    OR

    Looking for 5 letter words in the slackware HowTo documentation:
    "linux" appears to be a very popular letter combination here, being
    encountered 27658 times... :)

    $ for i in /usr/doc/Linux-HOWTOs/*; do cat $i; done |./a.out 5 | sort
    -k1n | tail
    13302 ument
    13315 onfig
    13562 confi
    14882 ystem
    14883 syste
    15393 inter
    19485 tions
    20179 ction
    27658 linux
    37586 ation

    File below:
    ---------------

    Code on Page:
    http://computertechsolution.com/files/charCount.html

    ReplyDelete
  128. Oh, by the way, if you can find a better way to do this, let me know.

    ReplyDelete
  129. I don't know if it was said before in the comments (didn't read them all) but I think maybe the Interviewer was looking for a specific set of algorithms dealing with the frequency of specific items in a stream of data.
    I think this page maybe does a good job explaining them:
    http://www.americanscientist.org/issues/pub/the-britney-spears-problem
    There should be no need for gigabytes of storage to find the 1000 most frequent words.

    ReplyDelete
  130. I would start from Ogden's Basic English 850-word list and build progressively from there (add 100 words per category) until you reach the 10,000 word list. You could cross-check this with a contemporary corpus and have a merged list. See: http://www.scribd.com/doc/14227/Ogden-850-Word-List-Revised-EnglishDari-Afghan-Farsi

    ReplyDelete
  131. If you need more interview questions. Go through these links
    PHP Interview questionsC Interview questions.
    Java Interview questions.

    ReplyDelete
  132. Nothing to worry or get nervous, just be confident and never tell any lie.. answer what you know.. are the first principles while attending an Interview. I had lot of experience in this area, so collected a big list of interview questions and answers sites (more than 220 sites) on wide variety of areas. This doesn't cover just interview questions but also has information related to how to dress, how and what to ask the person who is interviewing you like if it is HR, you might want to know about the work environment, about the overtime rules, about the holiday structure, any medical benefits, insurance coverages etc.,. Thought it will be useful to all, so sharing them at the below link -- might be of some help to you... today and even in future..
    http://markthispage.blogspot.com/2009/06/sites-you-must-refer-to-if-you-going.html

    ReplyDelete
  133. I had a very similar situation in an interview for a coding position.
    I decided to take a different approach, answering the question – I put my assumptions before the answer as I guessed that the interviewer would like to see my reaction to a situation.
    I had to put variants, coding, testing etc and revising the process all other again several times.

    The interviewer finally asked me related question on - memory, disk bandwidth, and fast full-text searching.

    ReplyDelete
  134. Padmanaban3:22 AM

    Interview is just a process to select the right person, so have a complete study about the company as well as the job position. Internet provides an easy way to search jobsopenings related for your needs.

    ReplyDelete
  135. Anonymous12:37 PM

    If I were asked that interview question, I would have simply stated the truth: "I don't know!"

    That requires a lot of courage. And to further the point, since the question was irrelavent, as you were interviewing for a writing position, that should have gotten you the job.

    If I were the interviewer, and you answered as such, I probably would have hired you, because you didn't fall for my trap. It's a little like the question, "If a chicken and a half laid an egg and a half in a day and a half, how much does a pound of butter weigh?"

    ReplyDelete
  136. Awesome Awesome article!!!

    Mock Questions

    ReplyDelete
  137. First of all. Thanks very much for your useful post.

    I just came across your blog and wanted to drop you a note telling you how impressed I was with the information you have posted here.

    Please let me introduce you some info related to this post and I hope that it is useful for community.

    Source: common interview questions

    Thanks again
    Ngo

    ReplyDelete
  138. Anonymous5:07 PM

    If I were the interviewer, not only would I have not hired you, I would have kicked you out of my office.

    When the interviewer asks "Explain how you would develop ", he isn't asking you to come up with a solution on the spot. He is asking what your steps would be to come up with a solution. He wasn't asking you to try to prove how smart you are by pointing out faults with the premise.

    That should have been obviously considering you were interviewing for a WRITING job and not a programming job.

    ReplyDelete
  139. Anonymous2:11 AM

    It is clear from the varying degree of responses here, that there isn't a right or wrong way to answer this. Some like the way he responded and others don't. Whether the interviewer was looking for a thought provoking answer or his ability to demonstrate a particular answering technique, and emotional response we will never know. So lets stop pretending like we're know it alls for a second. No one's perfect, and I'm sure he in no way way wanted to appear arrogant. That is furthest from anyones goal during an interview. He simply wanted to demonstrate his knowledge. Heck I've failed at interviews for jobs I know I'd be perfect for.

    ReplyDelete
  140. I think you may confused when he asked about hash function.I read this question as saying "Prove to me that you have read the original paper and understand how to implement it." Given that question, you failed.

    ReplyDelete
  141. Thanks for taking the time to discuss this,would you mind updating your blog with more information? It is extremely helpful for me.


    Job Interview

    ReplyDelete
  142. Hi!
    Guys,
    You can see all type question and answer of Interview...
    You can Click Here: Job Interview Questions

    ReplyDelete

Add a comment.

Note: Only a member of this blog may post a comment.