Table of Contents

An opening

Design architecture part

Scalability, security and availability

Google distributed files system

Communication Protocols

This is the beginning

Google has many users in different parts of the world. It is considered the biggest search engine. It runs more than 1 million servers in datacenters around the world. Each day it processes hundreds and millions of requests for search results. Google’s search engine can be accessed by entering keywords into the home page. Google will retrieve the top pages with the best scores based on its previous visits and show them to users in a matter of seconds.

Google is a leading Internet search company thanks to its ranking algorithms. The system that handles search is capable of handling more than 88 trillion searches a month. Users can expect to receive query results in 0.2 seconds. This section will provide a high-level view of the entire system. Google uses Crawlers to crawl the Web. The URL server sends a list with URLs to Crawler. Crawler transmits the web pages to Repository, which compresses them and stores them. The parsed URL is assigned to the docID number when the system begins parsing web pages.

The indexer can perform many functions, including reading repositories and extracting documents. The hits are the number of times a particular word appears in a document. The hits help to identify words and their positions in the text. They also estimate font sizes and capitalization. The indexer sorts these hits into barrels to create a partially-sorted forward index. The indexer is also a very important tool, as it parses every link in a page and stores all the information that’s important about each one in he anchor file. The anchor’s file can be used to locate each link accurately, including the text and location.

The URL resolver converts relative URLs into absolute URLs and then docID. It inserts the anchor text in the forward index along with the docID to which the anchor is pointing. The database is linked for every pair of docIDs. The PageRanks are calculated using the links database. Sorter sorts barrels by docID. It then resorts by wordID. This operation will require a temporary storage space.

The sorter creates a wordID list and offsets this into the reverse indices. DuffSimulink creates a dictionary and the LeX icon for the searcher using the DuffSimulink functionality. The searcher works on a Web server. It uses DopCopION dictionaries, PageRanks, inverted Indexes, etc. to answer queries.

ScalabilityAvailability and SecurityFrom a distributed systems perspective, Google’s Search engine is a fascinating study. This system can handle high demand, in particular scalability.

Scalability is defined as the efficient and effective operation of distributed system at various scales (from intranets for small businesses to Internet). The system will still be effective even if the number of users and resources increases. Scalability is a challenge that has three components. (1) Control cost of resources. When resources are in high demand, we have to invest reasonable money to upgrade the system. If a single search engine server is unable to handle the demand for access, then it’s necessary to add more servers to prevent performance bottlenecks.

Google measures scalability along three dimensions. Google’s results are undoubtedly excellent in these respects. In order to scale, other functions like indexing ranking and search require highly-distributed solutions. (2) Control performance degradation When a distributed systems deals with a huge number of resources or users, it is likely to produce large data sets.

The performance of distributed systems is greatly affected by the management and storage of these large data sets. In this situation, while the hierarchical method is clearly more scalable than the linear one, a performance hit cannot be entirely avoided. Google’s engine is designed to be interactive with users. It is essential that the search engine has as little latency possible. As a result, if the performance of the network is good, it will allow the search to be completed in under 0.2S. Google can only increase its profits by selling more advertisements.

Google has a revenue of US $32 billion annually, which is higher than other search engine in terms of the processing speed and efficiency for related resources. This includes network, computing, storage, etc. (3) Prevent software resource exhaustion The search engine’s network address is 32 bits. Internet addresses will become exhausted if they are overloaded. Google has no solution for this at the moment, as if you use a 128-bit Internet Address, many software components will need to be changed.

The availability and usability of distributed systems is dependent on how many clients can add new services to share resources. Google’s Search Engine must be able handle the most demanding requirements within the shortest possible time when it comes to web crawling. Indexing and sorting. Google has developed an architecture that is physically based to meet the needs of its users. The middle layer defines an infrastructure for a distributed system that allows new services and applications to be developed, while also maintaining the integrity of Google’s massive code database.

Information resources are protected in three ways: confidentiality (to stop disclosure to unauthorized persons), integrity, or availability. Security of information resources includes 3 parts: confidentiality, integrity, and availability.

Google distributed filesystemThe implementation is a response to the rapid growth and needs for Google’s management of big data. GFS also faces challenges in managing distribution as well as the increased risk of hardware failure. GFS has to manage large amounts of data while ensuring the safety of that data. Google made the decision to stop using any distributed file system. It decided instead to create a new system of file storage.

It optimizes large files. It can handle files from Gigabytes up to Terabytes. GFS clusters have a single chunk server, and can be accessed from multiple clients. These machines run Linux and can run user-level processes. The user can run both the client and block server on the same machine as long as their resources permit it.

Chunk servers store on local disks as Linux files. Chunk servers can store Linux files locally. Data ranges and chunk handles are used to identify the data in a chunk. Every chunk should be replicated across at least 3 servers in order to enhance GFS performance. The metadata for the entire GFS is maintained by the chunk master. After a period of time, the chunkmaster will ask each chunk server to upload its state using HeartBeat. The chunk server is directly connected to data-bearing communication that does not require the Linux Vnode Layer.

Clients and chunk servers do not store file data. The client and system are consistent with this approach, which does not store data. Linux’s buffer keeps all frequently-accessed data in its memory. Thus, chunk servers are not required to cache files, which improves GFS performance.

Communication protocolsChoosing and setting the right communication protocols can make or break a system’s design. Google uses a simple and efficient protocol for remote calls. To transform the procedure call information, a serialization component must be used to transmit the remote-call protocol. Google created a protocol cache, a serialization component that is simplified and high-performance. Google uses a different protocol for publishing and subscribing.

The focus of protocol buffers is on the data description, and then serialization. It is designed to be a flexible, simple and extensible method of specifying and serializing data regardless of language or platform. Serialized data is suitable for storage, transfer, and any scenario requiring the serialization of data formats. Google has chosen protocol buffers because of three factors. Google’s interface is less expressive than XML.

In order to fully satisfy Google’s communication needs, the designer used publish subscribe. This allows for the distribution of events to be reliably and in real-time sent out to many potential customers. This system is primarily designed to support Google AdWords. Google publish subscribe takes a thematic approach, which focuses on timely and reliable delivery. This will increase the overhead, even though communication is effective.

Google can provide the fastest retrieval speed without consuming excessive resources. Google’s distributed files system, I believe, is the technology that allows Google to retrieve all the data in just 0.2 seconds. Google’s Google File System was created with the aim of providing redundancy to store massive amounts of data in low-cost computers.

The Google File System was designed on the basis of a high component failure rate and high throughput. GFS is introduced in 3.1. GFS has a major difference from other distributed files systems. This is because GFS uses only one primary device. Traditional distributed file systems will have a single-point of failure as well as a bottleneck for throughput.

GFS creates an on-server cache to prevent failures. The primary device agent can only replicate data when it changes. The design, although simple, is sufficient. The system is also very fault-tolerant.

In the unlikely event of a malfunction or error in the system, it is possible to restart the primary device, block server, and replicas within seconds. The main device can also be hidden. GFS has also had some issues with decreasing efficiency. Google currently has over 450000 devices. However, only 1/3 of these are effective. Google is forced to pay extra for energy, space and cost. GFS achieves high performance with low costs, so the next thing to do is reduce unnecessary costs.

Author

  • luketaylor

    Luke Taylor is an educational blogger and professor who uses his blog to share his insights on educational issues. He has written extensively on topics such as online learning, assessment, and student engagement. He has also been a guest speaker on various college campuses.

Related Posts