python tutorial - Python Hashing (Hash Tables And Hashlib) - learn python - python programming
Hash Tables
- While an array can be used to construct hash tables, array indexes its elements using integers.
- However, if we want to store data and use keys other than integer, such as 'string', we may want to use dictionary.
- Dictionaries in Python are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys.
- We declare an empty dictionary like this:
>>> D = {}
click below button to copy the code. By Python tutorial team
- Then, we can add its elements:
>>> D['a'] = 1
>>> D['b'] = 2
>>> D['c'] = 3
>>> D
{'a': 1, 'c': 3, 'b': 2}
click below button to copy the code. By Python tutorial team
- It's a structure with (key, value) pair:
D[key] = value
click below button to copy the code. By Python tutorial team
- The string used to "index" the hash table D is called the "key". To access the data stored in the table, we need to know the key:
>>> D['b']
2
click below button to copy the code. By Python tutorial team
How we loop through the hash table ?
>>> for k in D.keys():
... print D[k]
...
1
3
2
click below button to copy the code. By Python tutorial team
- If we want to print the (key, value) pair:
>>> for k,v in D.items():
... print k,':',v
...
a : 1
c : 3
b : 2
click below button to copy the code. By Python tutorial team
Hashing from two arrays
- Using two Arrays of equal length, create a Hash object where the elements from one array (the keys) are associated with the elements of the other (the values):
>>> keys = ['a', 'b', 'c']
>>> values = [1, 2, 3]
>>> hash = {k:v for k, v in zip(keys, values)}
>>> hash
{'a': 1, 'c': 3, 'b': 2}
click below button to copy the code. By Python tutorial team
Hashing
- Here are some hashing samples using built-in hash function:
>>> map(hash, [0, 1, 2, 3])
[0, 1, 2, 3]
>>> map(hash, ['0','1','2','3'])
[6144018481, 6272018864, 6400019251, 6528019634]
>>> hash('0')
6144018481
click below button to copy the code. By Python tutorial team
- As we can see from the example, Python is using different hash() function depending on the type of data.
- Python provides hashlib for secure hashes and message digests:
md5(), sha*():
>>> import hashlib
>>> hashlib.md5('a')
>>> hashlib.md5('a').digest()
'\x0c\xc1u\xb9\xc0\xf1\xb6\xa81\xc3\x99\xe2iw&a;'
>>> hashlib.md5('a').hexdigest()
'0cc175b9c0f1b6a831c399e269772661'
>>> hashlib.sha512('a')
>>> hashlib.sha512('a').digest()
'\x1f@\xfc\x92\xda$\x16\x94u\ty\xeel\xf5\x82\xf2\xd5\xd7\xd2\x8e\x183]\xe0Z\xbcT\xd0V\x0e\x0fS\x02\x86\x0ce+\xf0\x8dV\x02R\xaa^t!\x05F\xf3i\xfb\xbb\xce\x8c\x12\xcf\xc7\x95{&R;\xfe\x9au'
>>> hashlib.sha512('a').hexdigest()
'1f40fc92da241694750979ee6cf582f2d5d7d28e18335de05abc54d0560e0f5302860c652bf08d560252aa5e74210546f369fbbbce8c12cfc7957b2652fe9a75'
>>>
click below button to copy the code. By Python tutorial team
Hashing Sample Code
- In this section, we used 64 bit integer (hash value from hash()) for the comparison of shingles instead of directly working on the string.
from __future__ import division
import itertools
import re
import hashlib
# a shingle in this code is a string with K-words
K = 4
def jaccard_set(s1, s2):
" takes two sets and returns Jaccard coefficient"
u = s1.union(s2)
i = s1.intersection(s2)
return len(i)/len(u)
def make_a_set_of_tokens(doc):
"""makes a set of K-tokens"""
# replace non-alphanumeric char with a space, and then split
tokens = re.sub("[^\w]", " ", doc).split()
sh = set()
for i in range(len(tokens)-K):
t = tokens[i]
for x in tokens[i+1:i+K]:
t += ' ' + x
sh.add(t)
return sh
if __name__ == '__main__':
documents = ["The legal system is made up of civil courts, criminal courts and specialty courts
such as family law courts and bankruptcy court...."]
shingles = []
# handle documents one by one
# makes a list of sets which are compresized of a list of K words string
for doc in documents:
# makes a set of tokens
# sh = set([' ', ..., ' '])
sh = make_a_set_of_tokens(doc)
print("sh = %s") %(sh)
# hasing
bucket = map(hash, sh)
# print("bucket = %s") %(bucket)
# shingles : list of sets (sh)
shingles.append(set(bucket))
#print("shingles=%s") %(shingles)
combinations = list( itertools.combinations([x for x in range(len(shingles))], 2) )
print("combinations=%s") %(combinations)
# compare each pair in combinations tuple of shingles
for c in combinations:
i1 = c[0]
i2 = c[1]
jac = jaccard_set(shingles[i1], shingles[i2])
print("%s : jaccard=%s") %(c,jac)
click below button to copy the code. By Python tutorial team
- Output is exactly the same as the one we got using string comparison:
combinations=[(0, 1), (0, 2), (1, 2)]
(0, 1) : jaccard=0.0196078431373
(0, 2) : jaccard=0.0
(1, 2) : jaccard=0.0