The Beauty of Data Structures and Algorithms-35 Trie Trees

This series is not original, the original text of the article is excerpted from Geek Time – The Beauty of Data Structure Algorithms, which is used for daily learning records. If there is any infringement, please contact me to delete it, thank you!

Search engine search keyword hint function, I think you should be familiar with it, right? In order to facilitate fast input, when you enter a certain part of the text to be searched in the search box of the search engine, the search engine will automatically pop up a drop-down box with various keyword hints. You can directly select what you want to search from the drop-down box instead of typing everything in, which saves our search time to a certain extent.

Although we use this function almost every day, as an engineer, have you ever thought about how it is implemented? What data structures and algorithms does it use at the bottom?

Search engines such as Google and Baidu have very comprehensive and accurate keyword hinting functions, and they must have done a lot of optimization, but the same is true. The most basic principle at the bottom is the data structure to be discussed today: Trie tree .

What is a “Trie tree”?

Trie tree, also called “dictionary tree”. As the name suggests, it is a tree structure. It is a data structure specially dealing with string matching, which is used to solve the problem of quickly finding a certain string in a set of strings.

Of course, such a problem can have multiple solutions, such as hash tables, red-black trees, or some of the string matching algorithms we mentioned in the previous sections, but Trie trees have their own unique advantages in solving this problem. . Not only that, but the problems that the Trie tree can solve is not limited to this, we will analyze it later.

Now, let’s take a look at what the Trie tree looks like.

I will give a simple example to illustrate. We have 6 strings, they are: how, hi, her, hello, so, see. We want to look for the existence of a certain string multiple times in it. If each search is to match the string to be searched with these 6 strings in turn, the efficiency is relatively low. Is there a more efficient method?

At this time, we can preprocess these 6 strings and organize them into the structure of the Trie tree. After each search, we will perform a matching search in the Trie tree. The essence of the Trie tree is to use the common prefix between strings to combine repeated prefixes together . The final result is what it looks like in the picture below.

ata Structures and Algorithms-35 Trie Trees

Among them, the root node does not contain any information. Each node represents a character in a string, and a path from the root node to the red node represents a string (note: red nodes are not all leaf nodes).

In order to make it easier for you to understand how the Trie tree is constructed, I have drawn a decomposition process of the Trie tree construction. Each step of the construction process is equivalent to inserting a string into the Trie tree. When all strings are inserted, the Trie tree is constructed.

The Beauty of Data Structures and Algorithms-35 Trie Trees

When we search for a string in the Trie tree, such as the string “her”, then we split the string to be searched into a single character h, e, r, and then start matching from the root node of the Trie tree. As shown in the figure, the green path is the path matched in the Trie tree.

The Beauty of Data Structures and Algorithms-35 Trie Trees

What if we’re looking for the string “he”? We also use the same method as above, starting from the root node and following a certain path to match, as shown in the figure, the green path is the path matched by the string “he”. However, the last node “e” of the path is not red. That is, “he” is a prefix substring of some string, but does not match any string exactly.

The Beauty of Data Structures and Algorithms-35 Trie Trees

How to implement a Trie tree?

Knowing what a Trie tree looks like, let’s now see how to implement a Trie tree with code.

From the introduction of the Trie tree just now, the Trie tree mainly has two operations, one is to construct a set of strings into a Trie tree . If this process is broken down, it is a process of inserting a string into the Trie tree. Another is to query a string in a Trie tree .

After understanding the two main operations of the Trie tree, let’s look at how to store a Trie tree?

From the previous figure, we can see that the Trie tree is a multi-fork tree. We know that in a binary tree, the left and right child nodes of a node are stored through two pointers, as shown in the following Java code. So for a multi-fork tree, how do we store pointers to all child nodes of a node?

class BinaryTreeNode {
  char data;
  BinaryTreeNode left;
  BinaryTreeNode right;  
}

Let me first introduce one of the storage methods, which is also a classic storage method, which is described in most data structure and algorithm books. Remember the hash table we talked about earlier? With the help of the idea of ​​a hash table, we store the pointer of the child node through an array whose subscripts and characters are mapped one-to-one. This sentence is a bit abstract and not very easy to understand. I drew a picture for you to see.

The Beauty of Data Structures and Algorithms-35 Trie Trees

Assuming that there are only 26 lowercase letters from a to z in our string, we store the pointer to the child node a at the subscript 0 position in the array, and store the pointer to the child node b at the subscript 1 position, By analogy, the position with the subscript 25 stores the pointer to the child node z. If the child node of a character does not exist, we store null in the corresponding subscript position.

class TrieNode {
  char data;
  TrieNode children[26];
}

When we look up the string in the Trie tree, we can quickly find the pointer of the matching child node by subtracting the ASCII code of “a” from the ASCII code of the character. For example, the ASCII code of d minus the ASCII code of a is 3, and the pointer of the child node d is stored in the position with the subscript 3 in the array.

After describing so much, you may still be a little confused. I translated the above description into code, you can read it together, it should help you understand.

public class Trie {
  private TrieNode root = new TrieNode('/'); // 存储无意义字符
  // 往Trie树中插入一个字符串
  public void insert(char[] text) {
    TrieNode p = root;
    for (int i = 0; i < text.length; ++i) {
      int index = text[i] - 'a';
      if (p.children[index] == null) {
        TrieNode newNode = new TrieNode(text[i]);
        p.children[index] = newNode;
      }
      p = p.children[index];
    }
    p.isEndingChar = true;
  }
  // 在Trie树中查找一个字符串
  public boolean find(char[] pattern) {
    TrieNode p = root;
    for (int i = 0; i < pattern.length; ++i) {
      int index = pattern[i] - 'a';
      if (p.children[index] == null) {
        return false; // 不存在pattern
      }
      p = p.children[index];
    }
    if (p.isEndingChar == false) return false; // 不能完全匹配,只是前缀
    else return true; // 找到pattern
  }
  public class TrieNode {
    public char data;
    public TrieNode[] children = new TrieNode[26];
    public boolean isEndingChar = false;
    public TrieNode(char data) {
      this.data = data;
    }
  }
}

The implementation of the Trie tree, you should understand it by now. Now, let’s see, what is the time complexity of finding a certain string in a Trie tree?

If you want to query some strings frequently in a set of strings, it will be very efficient to use a Trie tree. The process of building a Trie tree requires scanning all strings, and the time complexity is O(n) (n represents the sum of the lengths of all strings). But once the build is successful, subsequent query operations will be very efficient.

For each query, if the length of the string to be queried is k, then we only need to compare about k nodes to complete the query operation. It has nothing to do with the length and number of the original set of strings. So, after building the Trie tree, the time complexity of finding a string in it is O(k), where k represents the length of the string to be found.

Are Trie trees really memory-intensive?

Earlier we talked about the implementation of the Trie tree, and also analyzed the time complexity. By now you should know that Trie trees are a very unique and efficient way to match strings. However, about the Trie tree, have you ever heard such a statement: “The Trie tree is very memory-intensive, and uses a space-for-time idea”. What is the reason for this?

When we talked about the implementation of the Trie tree just now, we talked about using an array to store pointers to the child nodes of a node. If the string contains 26 characters from a to z, each node stores an array of length 26, and each array stores an 8-byte pointer (or 4 bytes, which is the same size as the CPU, operating system, compiler, etc.). Moreover, even if a node has only a few children, far less than 26, such as 3 or 4, we have to maintain an array of length 26.

As we said earlier, the essence of the Trie tree is to avoid repeatedly storing the same prefix substring of a set of strings, but now each character (corresponding to a node) is stored much larger than 1 byte. According to the example we gave above, the length of the array is 26, and each element is 8 bytes, then each node will need an additional 26*8=208 bytes. And that’s the case with only 26 characters.

If the string contains not only lowercase letters, but also uppercase letters, numbers, and even Chinese, it requires more storage space. So, that said, in some cases, Trie trees will not necessarily save storage space. In the case that there are not many repeated prefixes, the Trie tree not only cannot save memory, but may also waste more memory.

Of course, we cannot deny that Trie trees are very efficient, although they may be a waste of memory. So in order to solve this memory problem, do we have other ways?

We can sacrifice a little bit of query efficiency and replace the array in each node with another data structure to store a node’s child node pointers. Which data structure to use? We actually have many choices, such as ordered arrays, jump tables, hash tables, red-black trees, etc.

Suppose we use an ordered array, and the pointers in the array are arranged in order of the size of the characters in the child nodes they point to. When querying, we can quickly find the pointer of the child node that a character should match by using the binary search method. However, when inserting a string into the Trie tree, in order to maintain the order of the data in the array, it will be slightly slower.

The idea of ​​replacing it with other data structures is similar. I won’t analyze them one by one here. You can analyze it yourself based on what you have learned before.

In fact, there are many variants of Trie trees, all of which can solve the problem of memory consumption to a certain extent. For example, indentation optimization is for a node with only one child node, and this node is not the end node of a string, this node can be merged with the child node. This saves space, but increases the difficulty of coding. I won’t go into details here, if you are interested, you can study it yourself.

The Beauty of Data Structures and Algorithms-35 Trie Trees

Comparison of Trie tree with hash table and red-black tree

In fact, the problem of string matching, generally speaking, is actually the problem of data search. We have talked a lot about data structures that support efficient operation of dynamic data, such as hash tables, red-black trees, jump tables, and so on. In fact, these data structures can also implement the function of finding a string within a set of strings. We chose two data structures, hash table and red-black tree, and compared them with Trie tree to see their respective advantages, disadvantages and application scenarios.

In the scenario just described, looking up a string within a set of strings, the Trie tree doesn’t actually perform well. It has extremely strict requirements on the strings to be processed.

First, the character set contained in the string cannot be too large. As we mentioned earlier, if the character set is too large, then storage space may be wasted a lot. Even if it can be optimized, it has to pay the price of sacrificing query and insertion efficiency.

Second, the prefixes of the strings are required to overlap more, otherwise the space consumption will be much larger.

Third, if we want to solve the problem with a Trie tree, then we have to implement a Trie tree from scratch and ensure that there are no bugs. This is engineering to complicate simple problems. Unless necessary, this is generally not recommended.

Fourth, we know that the data blocks connected by pointers are discontinuous, and the pointers are used in the Trie tree, so it is not friendly to the cache, and the performance will be reduced.

Based on these points, for the problem of finding strings in a set of strings, we prefer to use hash tables or red-black trees in engineering. Because of these two data structures, we don’t need to implement them ourselves, we can just use the ready-made class libraries provided in the programming language.

At this point, you may be confused. After talking for a long time, I denied the Trie tree and asked you to use a red-black tree or a hash table. Is the Trie tree useless? Is today’s content just for nothing?

In fact, Trie tree is just not suitable for exact match search, this kind of problem is more suitable for solving with hash table or red-black tree. Trie trees are more suitable for finding strings that match prefixes, which is a scenario similar to the opening question.

Answer the opening

The Trie tree is finished, let’s look at the question mentioned in the opening paragraph: how to use the Trie tree to realize the prompt function of search keywords?

We assume that the keyword base consists of users’ popular search keywords. We construct this thesaurus as a Trie tree. When the user enters one of the words, the word is used as a prefix substring to match in the Trie tree. For the convenience of explanation, we assume that there are only 6 keywords in the thesaurus: hello, her, hi, how, so, and see. When the user enters the letter h, we display hello, her, hi, and how prefixed with h in the search prompt box. When the user continues to type the letter e, we display hello and her prefixed with he in the search prompt box. This is the most basic algorithm principle for searching for keyword hints.

The Beauty of Data Structures and Algorithms-35 Trie Trees

However, what I am talking about is only the most basic implementation principle. In fact, the search key word prompt function of search engines is far from being as simple as I said. If you dig a little deeper, you will think that the above solution encounters the following problems:

  • The idea I just talked about is the search keyword prompt for English. For more complex Chinese, how should the data in the thesaurus be constructed into a Trie tree?
  • If there are many keywords in the thesaurus, when the user enters the keyword when the search prompts, there are also many keywords that can be matched in the Trie tree as a prefix. How to choose which content to display?
  • In a search engine like Google, Google can still use the correct spelling for keyword hints even if the user spells the wrong word. How is this done?

You can think about how to solve it first. If not, it doesn’t matter. We will explain these problems in the actual combat chapter.

In fact, this application of Trie tree can be extended to a wider application, that is, automatic input completion, such as input method automatic completion function, IDE code editor automatic completion function, and browser URL input automatic completion function etc.

Content summary

Today we talked about a special kind of tree, the Trie tree. A Trie tree is a data structure that solves the problem of fast string matching. If there are not many repeated prefixes in the set of strings used to construct the Trie tree, the data structure of the Trie tree is generally more memory-intensive, and it is a problem-solving idea that replaces space with time.

Although it consumes more memory, it is very efficient to do string matching in the Trie tree when it is insensitive to memory or the memory consumption is within the acceptable range. The time complexity is O(k), and k represents the string to be matched. length.

However, the advantage of the Trie tree is not that it is used to search for dynamic collection data, because this work can be replaced by a more suitable hash table or red-black tree. The most advantage of the Trie tree is to find strings with matching prefixes. For example, the scenario of the keyword prompt function in the search engine is more suitable for solving it. It is also a classic application scenario of the Trie tree.

2 thoughts on “The Beauty of Data Structures and Algorithms-35 Trie Trees

Leave a Reply

Your email address will not be published.

en_USEnglish