">
 

How a Bloom Filter Works: Build One From Scratch in Python

Iniciado por joomlamz, Ontem às 22:15

Respostas: 0   |   Visualizações: 1

Tópico anterior - Tópico seguinte

0 Membros e 1 Visitante estão a ver este tópico.


                     How a Bloom Filter Works: Build One From Scratch in Python
               




Tópico:
                     How a Bloom Filter Works: Build One From Scratch in Python
               
Categoria: Tutoriais | FreeCodeCamp Premium
Idioma Principal: Português (Conteúdo de Tecnologia)

Conteúdo do Tutorial / Guia Passo a Passo:
-------------------------------------------------------------------------
A Bloom filter gives you something that feels like magic: it can tell you whether an item is in a set of billions, using only a few kilobytes of memory. And it answers in the same tiny amount of time no matter how much you have stored.

That sounds impossible. A normal set has to remember every item, so its memory grows with the data. But a Bloom filter remembers almost nothing about the items themselves, yet it still answers membership questions. The catch is that it's allowed to be wrong in one specific, controllable direction.

It's not magic, and the moment you build one yourself, the trick becomes clear and you should understand exactly what it can and can't promise.

In this tutorial, we'll build a working Bloom filter from scratch in Python, using nothing but a list of bits and a couple of hash functions. By the end, you'll understand bit arrays, why we use several hashes, what a false positive is, the one guarantee a Bloom filter never breaks, and how to size one for a target error rate.

Table of Contents

• What a Bloom Filter Actually Is

• A Short History

• Where Bloom Filters Are Used

• The Core Idea: a Bit Array and a Few Hashes

• Turning an Item into Positions

• Adding and Checking

• False Positives Are Normal

• Sizing it for a Target Error Rate

• What it Cannot Do: Delete

• Putting it Together

What a Bloom Filter Actually Is

A Bloom filter is a probabilistic data structure. Its whole job is to answer one question, "is this item in the set?", and it gives one of only two answers:

• Definitely not in the set. This answer is always correct.

• Possibly in the set. This answer is usually correct, but it's occasionally wrong.

The surprising part is that it answers without storing the items at all. A normal set, like Python's
setor a hash table, keeps every item it has seen, so its memory grows with both the number of items and the size of each one.

A Bloom filter keeps only a fixed row of bits. Its size is decided up front and never changes, whether you store short words or long URLs or whole files.

So a Bloom filter isn't really a container. It's closer to a fingerprint of a set. You can't ask it to list what's inside, or to hand an item back. You can only ask "have you probably seen this?", and you can trust its "no" completely.

A quick way to picture it: instead of keeping a guest list of names, you keep a wall of light switches. When a guest arrives, you flip a few switches chosen from their name. To check whether someone came, you look at their switches. If any one of them is off, they definitely never arrived. If all of them are on, they probably did, though someone else's name might have flipped those same switches.

That picture also explains why you would reach for one instead of a plain set. For a million URLs averaging fifty bytes each, a real set costs tens of megabytes and grows with the length of the URLs. A Bloom filter for the same million items at a one percent error rate costs about 1.2 megabytes, fixed, no matter how long the URLs are.

When the set is huge, has to live in memory on every machine, or holds large items, that saving is the difference between practical and impossible. The price is the rare false positive, and the usual pattern makes that cheap: a "no" skips an expensive lookup, and a "yes" just triggers the slower exact check you would have run anyway.

The rule of thumb: if you need exact answers, deletion, or the

... [O tutorial continua no link abaixo] ...


Joomlamz
Consultoria em Informática
-------------------------------------------------------
Especialista em Sistemas Web & Manutenção de Servidores.
A desenvolver o novo AplPortal com suporte a PHP 8.
Precisa de ajuda profissional? Contacte-me.

Tags: