Designing Data-Intensive Applications

Kitap eklendiğinde bana bildir

esandrewalıntı yaptı4 yıl önce
However, the downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But this average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers. This means that a single tweet may result in over 30 million writes to home timelines! Doing this in a timely manner—Twitter tries to deliver tweets to followers within five seconds—is a significant challenge. In the example of Twitter, the distribution of followers per user (maybe weighted by how often those users tweet) is a key load parameter for discussing scalability, since it determines the fan-out load. Your application may have very different characteristics, but you can apply similar principles to reasoning about its load.
- 1 beğeni
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
Samson Mwathialıntı yaptı7 ay önce
Many applications today are data-intensive , as opposed to compute-intensive
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
b9449300348alıntı yaptıgeçen yıl
CPU clock speeds are barely increasing, but multi-core processors are stand
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
Peter Gazaryanalıntı yaptı2 yıl önce
A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

Store data so that they, or another application, can find it again later (databases)

Remember the result of an expensive operation, to speed up reads (caches)

Allow users to search data by keyword or filter it in various ways (search indexes)

Send a message to another process, to be handled asynchronously (stream processing)

Periodically crunch a large amount of accumulated data (batch processing)
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
exordiumexordiumalıntı yaptı3 yıl önce
The currently trendy style of application development involves breaking down functionality into a set of services that communicate via synchronous network requests such as REST APIs (see “Dataflow Through Services: REST and RPC”). The advantage of such a service-oriented architecture over a single monolithic application is primarily organizational scalability through loose coupling: different teams can work on different services, which reduces coordination effort between teams (as long as the services can be deployed and updated independently).
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
esandrewalıntı yaptı4 yıl önce
There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
esandrewalıntı yaptı4 yıl önce
Sometimes, when discussing scalable data systems, people make comments along the lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization. However, it’s also important to choose the right tool for the job, and different technologies each have their own strengths and weaknesses.
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
Yurii Rebrykalıntı yaptı4 yıl önce
A widely used alternative is to send messages via a message broker (also known as a message queue), which is essentially a kind of database that is optimized for handling message streams
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
Yurii Rebrykalıntı yaptı4 yıl önce
MapReduce’s approach of fully materializing intermediate state has downsides compared to Unix pipes:
A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed, whereas processes connected by a Unix pipe are started at the same time, with output being consumed as soon as it is produced. Skew or varying load on different machines means that a job often has a few straggler tasks that take much longer to complete than the others. Having to wait until all of the preceding job’s tasks have completed slows down the execution of the workflow as a whole.
Mappers are often redundant: they just read back the same file that was just written by a reducer, and prepare it for the next stage of partitioning and sorting. In many cases, the mapper code could be part of the previous reducer: if the reducer output was partitioned and sorted in the same way as mapper output, then reducers could be chained together directly, without interleaving with mapper stages.
Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin
Yurii Rebrykalıntı yaptı4 yıl önce
If we cannot fully trust that every individual component of the system will be free from corruption—that every piece of hardware is fault-free and that every piece of software is bug-free—then we must at least periodically check the integrity of our data.
- Beğen
- Yorum
- Paylaş
  Facebook
  Twitter
  Bağlantıyı kopyala
- Bunu bildirin